Author of this article: Sticnarf Chen, TiKV Committer, PingCAP R&D engineer, keen on open source technology, rich experience in the field of distributed transactions, is currently committed to optimizing the performance of TiDB distributed transactions.

TiDB provides native distributed transaction support and low latency distributed transaction is a continuous optimization direction. The Async Commit feature introduced in TiDB 5.0 greatly improves the delay of transaction Commit. This feature is mainly introduced by sticnarf Chen, and lei zhao (youjiali1995). Nick Cameron(NRC) and Zhou Zhenjing (MyonKeminta).

This article will introduce the design idea, principle and key implementation details of Async Commit.

Extra latency for Percolator

TiDB transactions are based on the Percolator transaction model. You can refer to our previous blog to learn more about the Percolator transaction model submission process.

The diagram above shows the Commit process before Async Commit is introduced. After a user sends a COMMIT statement to TiDB, TiDB must go through at least the following steps to return the COMMIT result to the user:

  1. Prewrite all keys concurrently;

  2. Get the timestamp from PD as Commit TS;

  3. Submit the primary key.

The critical path of the entire submission process includes at least two network interactions between TiDB and TiKV. Only the operation of submitting secondary keys is done asynchronously in TiDB background.

In the transaction model of TiDB, we can roughly think of the TiDB node as the coordinator of the transaction and the TiKV node as the participant of the transaction. In traditional two-phase commit, the default coordinator’s data is stored locally on the node, but in TiDB transaction model, all transaction-related data is stored on TiKV. Therefore, in traditional two-phase commit, the transaction state can be determined after the first phase. TiDB needs to complete part of the second phase to store the transaction state on TiKV before it can respond to the user.

However, this also means that there is room for improvement in TiDB’s previous transaction model. Can the Percolator transaction model be improved so that the state of a transaction can be determined after the first phase is completed without additional network interaction?

Improve transaction completion conditions

Before Async Commit is introduced, the transaction is committed only if its primary key is committed. Async Commit is designed to advance the determination of transaction state to the completion of prewrite, making the entire second phase of the Commit asynchronous. That is, for Async Commit transactions, as long as all keys of the transaction are prewritten successfully, the transaction commits successfully.

The following is an example of an Async Commit transaction (you may notice that the Commit TS from PD is not available before prewrite, for reasons that will be explained later) :

To achieve this goal, we have two main problems to solve:

  • How to determine if all keys have been prewritten.

  • How to determine the Commit TS of a transaction.

How do I find the keys owned by the firm

Before Async Commit is introduced, the state of the transaction is determined only by the primary key, so only Pointers to the primary key need to be stored on all secondary keys. If the secondary key is not committed, query the status of the primary key to know the current transaction status:

To judge Async Commit transaction, we need to know the state of all keys, so we need to be able to query every key of the transaction from any key. As a result, we made a small change to keep the pointer from the secondary key to the primary key, and store the pointer to each secondary key in the value of the primary key:

The Primary key stores a list of all secondary keys, but obviously, if a transaction contains a very large number of keys, it is not possible to store them all on the Primary key. Therefore, Async Commit transactions should not be too large. Currently, we only use Async Commit for transactions that contain no more than 256 keys and the total size of all keys does not exceed 4096 bytes.

The Commit time of large transactions itself is long, and the delay improvement caused by reducing one network interaction is not obvious, so we do not consider using a similar multi-level structure to allow Async Commit to support larger transactions.

How do I determine the Commit TS of a transaction

The state of the Async Commit transaction must be determined when prewrite completes, and Commit TS is no exception as part of the transaction state.

By default, TiDB transactions meet the isolation level and linear consistency of snapshot isolation. We want these properties to be true for Async Commit transactions as well, so it is critical to determine the appropriate Commit TS.

For each key of an Async Commit transaction, Prewrite calculates and records the Min Commit TS of the key in TiKV. The maximum Min Commit TS for all keys of a transaction is the Commit TS for this transaction.

The Min Commit TS calculations are described below and how they enable Async Commit transactions to meet snapshot isolation and linear consistency.

Ensuring snapshot isolation

TiDB implements snapshot isolation through MVCC. To achieve snapshot isolation, we need to ensure that a consistent snapshot is always read with Start TS as the snapshot timestamp.

For this reason, Max TS1 on TiKV is updated with each snapshot read of TiDB. In Prewrite, Min Commit TS is required to be at least 2 larger than the current Max TS, that is, larger than all previous snapshot read timestamps. Therefore, Max TS + 1 can be used as Min Commit TS. After the Async Commit transaction commits successfully, the snapshot isolation is not broken because the Commit TS has a larger timestamp than the previous snapshot read.

In the following example, transaction T1 writes x and Y keys. T2 reads y to update Max TS to 5, so next T1 prewrite y, Min Commit TS is at least 6. If T1 prewrite y succeeds, T1 Commit succeeds, and T1 Commit TS must be at least 6. Therefore, when T2 reads y later, it does not read the updated value of T1, and the snapshot of transaction T2 remains the same.

T1: Begin (Start TS = 1)
T1: Prewrite(x) T2: Begin (Start TS = 5)
T2: Read(y) => Max TS = 5
T1: Prewrite(y) => Min Commit TS = 6
T2: Read(y)

Ensure linear consistency

There are actually two requirements for linear consistency:

  • Sequential

  • Real time

In real time, the modification of the transaction can be read by the new transaction immediately after the transaction is submitted successfully. The snapshot timestamp of the new transaction is taken from the TSO on PD, which requires that the Commit TS not be too large, exceeding the maximum timestamp + 1 allocated by the TSO.

As mentioned in snapshot isolation, a possible value for Min Commit TS is Max TS + 1. The timestamps used to update Max TS are all from TSO, so Max TS + 1 must be less than or equal to the minimum unallocated timestamp on TSO. In addition to Max TS on TiKV, the coordinator TiDB also provides a constraint on Min Commit TS (mentioned later), but does not push it beyond the unallocated minimum timestamp on TSO.

Sequentiality requires that the logical order of occurrence cannot violate the physical order of occurrence. Specifically, there are two transactions, T1 and T2. If T2 commits after T1, then logically the Commit for T1 should occur before T2, which means that the Commit TS for T1 should be smaller than the Commit TS for T2. 3

To ensure this feature, TiDB gets a timestamp from PD TSO before prewrite as a minimum constraint for Min Commit TS. Due to the real-time guarantee above, the timestamp obtained by T2 before prewrite must be greater than or equal to T1’s Commit TS, and this timestamp will not be used to update Max TS, so equal cannot happen. To sum up, we can ensure that the Commit TS of T2 is larger than the Commit TS of T1, which meets the sequential requirement.

To sum up, the Max TS + 1 and the maximum timestamp obtained from PD before prewrite for Min Commit TS of each key The maximum number of Min Commit TS for all keys can ensure both snapshot isolation and linear consistency.

Phase one Submission (1PC)

If a transaction updates only a non-unique index of one record, or inserts only one record without a secondary index, it will only involve a single Region. In a scenario involving only one Region, is it possible to commit a transaction in one phase without using the distributed transaction submission protocol? This is certainly possible, but the difficulty lies in determining the Commit TS for the one-phase committed transactions.

With Async Commit as the basis for calculating Commit TS, the difficulty of implementing the first-phase Commit is solved. We use the same method as Async Commit to calculate the Commit TS of a phase Commit transaction, by one interaction with TiKV directly Commit:

Phase-one commit does not use the distributed commit protocol, which reduces the number of TiKV writes. So if the transaction involves only one Region, using one-phase commit not only reduces transaction latency, but also improves throughput. 4

The one-phase Commit feature was introduced in TiDB 5.0 as part of Async Commit.

Causal consistency

As mentioned above, obtaining Min Commit TS from TSO can ensure sequential performance. So what happens if you omit this step? Won’t this save another network interaction delay between PD and TiDB?

In this case, however, we can find examples of violations of orderliness. Given that x and y are on different TiKV nodes, transaction T1 modifies X and transaction T2 modifies Y. T1 starts earlier than T2, but the user does not notify T1 to commit until T2 has committed successfully. Thus, for the user, transaction T1 commits after transaction T2 commits, and logically T1 should commit later than T2 if sequentiality is satisfied.

If the operation of obtaining Min Commit TS before prewrite is omitted, T1 Commit TS may be 2, and T2 Commit TS may be 6. If you have a transaction T3 with Start TS of 3, it can observe the fact that T2 is logically later than T1. So there’s no linear consistency.

T1: Begin (Start TS = 1)
T3: Begin (Start TS = 3)
T2: Begin (Start TS = 5)
T2: Prewrite(y)Min Commit TS = 6
Notify T1 to submit
T1: Prewrite(x)Min Commit TS = 2
T3: Read(x, y)

At this point, the concept of a snapshot may not be quite what you expect. In the following example, the late Start T2 notifies the transaction T1 to Commit, and T1’s Commit TS may be smaller than T2’s Start TS.

For users, it is not expected for T2 to read T1’s modification of X in the future. In this scenario, the repeatable read nature is not broken, but whether snapshot isolation is still acceptable is debatable 5.

T1: Begin (Start TS = 1)
T2: Begin (Start TS = 5)
T2: Read(y)
Notify T1 to submit
T1: Prewrite(x)Min Commit TS = 2
T2: Read(x)

We call this weaker consistency causal consistency: the order of transactions that cause and effect is the same as the order in which they are physically committed, but the order in which transactions that do not cause and effect are committed is uncertain. We consider two transactions to be causal if and only if they lock or write data that intersect. In fact, causality here only includes causality known to the database, not external causality such as “application layer notification” in the above example.

The conditions for such an exception scenario are harsh, so we provide the user with a way to avoid getting Min Commit TS: TRANSACTION started WITH CAUSAL CONSISTENCY ONLY Min Commit TS is not obtained at Commit time. If your scenario does not involve controlling the commit order of two simultaneous transactions outside the database, try lowering the consistency level to reduce the time it takes for TiDB to obtain a timestamp from PD TSO.

Performance improvement

Async Commit brings forward the transaction completion point to the end of prewrite, making the primary key Commit operation asynchronous. The more time it takes to Commit the primary key, the more Async Commit improves. Small transactions with few interactions can usually get a big boost with Async Commit. Conversely, there are some scenarios where Async Commit promotion is not obvious:

  • For transactions that contain many statements and have long interaction logic, the time of transaction submission is relatively low, and the improvement of Async Commit is not obvious.

  • For transactions containing a large number of keys and a large amount of data to be written, the prewrite time is significantly longer than the primary key submission time, and the Async Commit time is not significantly improved.

  • Async Commit does not reduce reads and writes to TiKV and therefore cannot increase the limit throughput. So if the system itself is approaching its throughput limit, Async Commit will not provide a significant improvement.

In Sysbench olTP_update_index scenarios, a transaction writes only row records and index keys. It is also an auto commit transaction with no additional interaction, so theoretically Async Commit can significantly reduce latency.

Actual testing also proves this. As shown in the figure above, when testing sysbench OLTP_update_index with fixed 2000 TPS and enabling Async Commit, the average latency decreased by 42% and p99 latency decreased by 32%.

If the transaction involves only one Region, one-phase commit optimization can significantly reduce transaction commit latency. Limit throughput can also be increased due to the reduction of TiKV write.

As shown in the figure above, test sysbench OLTP_UPDATe_NON_index with 2000 TPS fixed. This is a scenario where a transaction writes to only one Region. After one-phase commit is enabled, the average latency is reduced by 46% and the P99 latency is reduced by 35%.

conclusion

Async Commit allows TiDB transaction Commit to reduce the delay of a write to TiKV, which is a major improvement over the original Percolator transaction model. The newly created TiDB 5.0 cluster has Async Commit and one-phase Commit enabled by default. To upgrade a cluster from an earlier version to 5.0, you need to manually set the global system variables tidb_ENABLE_ASYNC_COMMIT and tidb_ENABLE_1PC to ON to enable Async Commit and one-phase Commit.

Due to space constraints, this article only covers the key designs in Async Commit. Interested readers can read the Async Commit design documentation for more details. In the future, we will continue to improve the performance of TiDB transactions, improve people’s experience of using TiDB, so that more people can benefit from TiDB.

Welcome to contact us:

The main responsibility of the Transaction SIG is to discuss and plan the future development of TiKV distributed transactions and organize community members to develop and maintain them. You can now find us on the # Sig-Transaction channel at Slack in TiKV.

Since the release of TiDB 4.0, a total of 538 Contributors have submitted 12,513 PR to help us work together on milestone releases of enterprise core scenarios, and Async Commit is just a representative of these PR. To thank all Contributors for version 5.0, the TiDB community has prepared a 5.0 custom peripheral. If you are also a 5.0 Contributor, please fill out the form and give us your address by May 5.

annotation

1 To ensure that the Max TS of the new Leader is large enough after the migration of the Region Leader and the Merge of the Region Leader, TiKV also obtains the latest timestamp from PD to update the Max TS. . ↩

2 During the Prewrite process, TiKV adds a memory lock to the prewrite key to prevent the constraint from being violated by the updated snapshot read. In this way, the Start TS ≥ Min Commit TS read request is blocked for a short time. . ↩

3 if T1 and T2 commit in overlapping time, the logical order of their commit is uncertain. . ↩

To be precise, one-phase commit is only used when a transaction can be completed by writing a single TiKV request. To improve commit efficiency, large transactions are split into multiple requests, and one-phase commit is not currently used even if they all involve the same Region. . ↩

5 If we allow T1 to logically commit before T2 starts (because linear consistency is not satisfied), then this case can still be considered snapshot isolation satisfied. . ↩