Author: Lab Chen/Big Data Open Lab

In the previous article “Comparison of Common Technologies and Solutions of Database Recovery Subsystem (I)”, we introduced the Logging & Recovery subsystem of database management system, and discussed in detail the concept and technical implementation of ARIES, the mainstream Recovery algorithm based on Physical Logging. This article will share with you the introduction of Logical Undo Logging principle and recovery technology of two database systems, SQL Server(Azure) and Silo by Professor Qing of Daomong University of Chinese Normal University.

— Logical Undo Logging —

In the previous article, we briefly outlined the optimization idea of Early Lock Release: by releasing locks on indexes Early to increase concurrency, but at the same time creating dependencies between transactions, leading to cascading rollback. For example, the first transaction has released the lock, but the fault occurs during the log flush and needs to be rolled back. In this case, the lock has been acquired by the next transaction, and the next transaction must be rolled back together with the previous transaction, which greatly affects the system performance.

In this case, Logical Undo Logging is introduced in the recovery system to solve the above problems to a certain extent. The basic idea of Logical Undo Logging is to Undo the modification operation when the transaction is rolled back, rather than the modification of the data, such as the Undo of insert is delete and the Undo of the data item +1 is -1.

In addition, Logical Undo processing introduces the concept of Operation Logging, that is, some transactions are logged in a special way, called Transaction Operation Logging. When Operation starts, a special Operation Logging is logged in the form of log<Ti, Oj, operation-begin>, where Oj is the unique Operation ID. During the Operation, the system logs Physical Redo and Undo. After Operation ends, special Logging <Ti, Oj, operation-end, U> is logged, where U is the Logical Undo Operation for a set of operations that have already taken place.

For example, if the Index inserts a new key-value pair (K5, RID 7) into the leaf Index I9, where K5 is stored in the X position of the Index I9 page, replacing Old1; RID7 is stored in the X+8 position to replace Old2. <T1, O1, operation-begin> <T1, O1, operation-begin> <T1, O1, operation-begin> <T1, O1, operation-begin> The end log <T1, O1, operation-end, (delete I9, K5, RID7) > is recorded at the end, where (delete I9, K5, RID7) is the Logical Undo for the T1 operation, that is, delete K5 and RID7 of the page on I9.

After Operation execution, the lock can be released early, allowing other transactions to successfully insert < Key, Record ID> into the page, but causing all Key values to be reordered, leaving K5 and RID7 at X and X+8. If Physical Undo is performed, K5 and RID7 need to be withdrawn from X and X+8. However, the positions of K5 and RID7 have changed in practice, and Physical Undo is not practical according to the original log records. However, from a logical point of view, Undo only needs to delete K5 and RID7 from index node I9. Therefore, operation logs (delete I9, K5, RID7) are added to the log, indicating that Undo only needs to follow this logical command.

During rollback, the log is scanned: if there is no operation-end log, Physical Undo is performed because the lock is released only at the end of operation, otherwise the lock has not been released and other transactions cannot modify the locked item. If the operation-end log exists, it indicates that the lock has been released. In this case, only Logical Undo can be performed. All the logs between begin operation and end operation are skipped because the Undo of these logs has been replaced by Logical Undo. If the <T,O, operation-abort> log is found during Undo, the operation is abort. You can skip the intermediate log and go back to <T,O, operation-begin>. When Undo reaches the <T,start> log, it indicates that all logs of the transaction have been cancelled. The <T, abort> log indicates that the Undo is ended.

It is important to note that all Redo operations are Physical Redo. This is because the Logical Redo implementation is very complex, such as determining the order of Redo logs, so all Redo information on most systems is Physical. In addition, Physical Redo does not conflict with early release of locks, because Redo only occurs when there is a failure, in which case the system hangs and all the locks are gone, requiring a new lock to be redone.

– SQL Server: Constant Time Recovery –

For most commercial database systems such as SQL Server, ARIES is mostly used to restore the system, so the amount of Undo all uncommitted transactions is proportional to the amount of work done by each transaction. The more operations a transaction does, the longer Undo takes. Because a transaction operation may be a single statement that updates many records, Undo needs to Undo records one by one, which may take too long to Undo. Although correctness is guaranteed, it is not acceptable for cloud services or systems with high availability requirements. In response to this situation, Constant Time Recovery (CTR) optimization technology has emerged, combining ARIES systems with multi-version concurrency control to achieve fixed Time Recovery. Fixed-time recovery refers to the use of multi-version information in the database system to ensure that the recovery operation is completed within a certain period of time in any case. The basic idea is to make use of different data versions in a multi-version database system to ensure that the data Undo to a correct state, rather than using the information in the original WAL log to restore.

  • Concurrency control for MS-SQL

SQL Server introduced multi-version concurrency control in 2005, but the earlier multi-version was only used to implement Snapshot isolation level rather than recovery, allowing the system to read data from the database based on Snapshot timestamps. When data is updated, multi-version concurrency control can update records in place on the data Page, but older versions are not discarded but placed in a separate Version Store. The Version Store is a special table that Only allows continuous addition of data (appends-only). A pointer link indicates the previous Version of the record, which in turn points to an older Version, thus forming a data Version chain. When accessing the data, you can decide which version of the data to read based on the transaction timestamp. Therefore, updates to the earlier multi-version Store do not need to be logged because, in the event of a failure, all timestamps are up to date after a restart, as long as the last Version is maintained, there is no need for Snapshot access prior to this Version. For currently Active transactions, the old version can be discarded through the Garbage Collection mechanism based on the current timestamp.

CTR is optimized based on the original Version Store to implement a Persistent Version Store, which enables Persistent storage of older versions. In CTR technology, when the system updates the Version Store, it records the log to prepare for the recovery, which increases the volume and cost of the Version Store. Therefore, there are two strategies to implement the storage of the old Version: in-row Versioning and off-Rowversioning.

In-row Versioning is an operation to update a record that changes only a small amount of data and does not need to be stored In the Version Store. Instead, a delta value is added to the record to indicate the change In the value of the property. The goal is to reduce the Versioning overhead, since disk I/O changes in the same location are relatively small.

Off-row Versioning has a special system table that stores older versions of all tables and Redo logs for Insert operations through WAL. Off-row Versioning is used when in-row Versioning cannot completely save data updates due to the large amount of changes. In the figure below, Col 2 in line A4 is 444, and a delta is written after the update to 555 to record the version change. However, this modification is limited by the size of the data and whether there is free space on the page where the record itself resides. If there is free space to write to, if not, the updated records need to be put into the off-row Versioning table.

During recovery, CTR is implemented in three phases (Analytics, Redo, and Undo). The analysis phase, similar to ARIES, is used to determine the current state of each transaction such as Active, Commit, and undo-based transactions. During Redo, the primary table and the Version Store tables are replayed to the crash state. Once Redo is complete, the database is ready for service. The third step of CTR is Undo. At the end of the analysis phase, it is known which transactions are not committed, and the Undo phase can directly mark these transactions as Abort. Since different versions of each Record Record the transaction number associated with this version, subsequent transactions first determine the state of the related transaction when they read this version. If Abort is used, this version is ignored and the older version is read. In this way, when an unavailable version is read, you need to follow the link to find the previous version, which incurs additional performance overhead but reduces the offline time of the database. After continuing to provide the service, the system can spend the rest of the time doing Garbage Collection to remove the old versions, a mechanism called Logical Revert.

  • Logical Revert

Logical Revert works in two ways. The first is to use a Background process called the Background Cleanup to scan all blocks of data and determine which blocks are Garbage to collect. If the last Version in the primary table came from an aborted transaction, take a Commit Version from the Version Store and put it in the primary table. Even if you don’t do this at this point, you will read the data in the Version Store later when you use it. Therefore, version migration can be done slowly through the Garage Collection process in the background. The second is that if a transaction updates data and finds that the Version in the main table is the Version of an Abort transaction, it overwrites that Version, and the correct Version of that transaction should be in the Version Store.

As you can see, CTR recovery is a fixed time as soon as the first two phases end, and the time required for the first two phases is actually only related to the Checkpoint of the transaction. If the Checkpoint interval is fixed by log size, when the Redo phase is over, the database is restored and the recovery time does not exceed a fixed value.

– Silo: Force Recovery –

Silo is a prototype of high performance in-memory database system jointly developed by Harvard University and MIT, which solves the problem of low throughput caused by excessive concurrency. If a CPU kernel has one thread to execute transactions, the throughput rate increases with the number of cores in the absence of contention, but drops after a certain level, possibly because of bottlenecks caused by some resource contention. Although each thread executes separately during transaction execution, ultimately all transactions require a transaction ID before committing. The transaction ID is globally scoped. The transaction obtains its COMMIT ID through the atomic_FETch_and_add (& global_TID) operation. The allocation of transaction IDS is realized by the global Manager role. When a transaction applies for an ID, the Manager will update the transaction counter by counting +1 to ensure that the transaction ID is globally unique and increasing. Therefore, the write speed of the Manager will be the upper limit of system performance. As the concurrency gets higher and higher and transactions apply for ids, there will be competition that causes longer wait times and lower throughput rates.

  • Optimistic concurrency control

Silo’s solution to performance bottlenecks is optimistic concurrency control for multi-core concurrent work + shared memory databases. Optimistic concurrency control is introduced in memory Database Parsing and Comparison of Mainstream Products (iii). It means that transactions are considered to have no impact on each other during execution, and only check whether there is any conflict at the time of submission. If there is no conflict, apply for a global transaction ID to complete the Commit. Silo removes the required global transaction ID by designing Force Recovery and uses the concept of Group Commit to divide the time into multiple epochs, each of which is 40 milliseconds. The Epoch contains multiple transactions that are involved in the current time period, so the group of transactions can be committed together by committing the Epoch without requiring a global transaction ID for each transaction. The drawback of this design, however, is that if the transaction takes longer than 40 milliseconds to execute, the resulting span can affect commit and recovery.

In Silo, each transaction is distinguished by Sequence Number + Epoch Number. Sequence Number is used to determine the Sequence of transactions in the execution process, and recovery strategy is jointly determined by Sequence Number and Epoch Number. Each transaction will have a transaction ID (TID), the transaction will be Group Commit according to the Epoch, and the overall Commit will be serialized according to the Epoch Number. The transaction ID is 64-bit and consists of the state bit, Sequence Number, and Epoch Number. The high part is Epoch Number, the middle part is Sequence Number, and the first three parts are the state bits. Each record will save the corresponding transaction ID, among which the status bit is used to store information such as Latch when accessing the record. Compared with traditional disk-based DBMS database, one important difference is that the lock management is put together with the record, and Data Buffer and Locking Table are not managed separately.

  • Three phases of transaction commit

Because Silo uses standard optimistic concurrency control, conflicts are checked only at commit time. In the pre-commit phase, the transaction ID of the data item is stored in the Local read-set, and then the value is Read. When modifying data records, you also need to add the modified data records to the Local write-set.

Silo transaction is committed in three stages. The first step is that all records to be written in Local Wite-set are locked. The lock information is stored in the state bits of the transaction ID, which are obtained by atomic operations Compare and Set. Then read out the entire transaction from the current Epoch. The system has a dedicated thread responsible for updating the Epoch (every 40ms+1), and all transactions do not compete to write the Epoch Number but simply read the value. The second step in a transaction commit is to check the read-set. Each data item in a Silo contains the ID of the last transaction to update it. If the TID of the record changes or the record is locked by another transaction, it indicates that the record has changed between the Read and commit and requires Rollback. Finally, the transaction TID is generated. At Commit time, the Sequence Number in the TID should be a minimum transaction ID greater than all the records read, so as to ensure the transaction increment. The result is returned only after all transactions have fallen.

  • Recovery – SiloR

SiloR is a recovery subsystem of Silo that uses Physical Logging and Checkpoints to ensure transaction durability, and uses a parallel recovery strategy in these two aspects. As mentioned earlier, logging operations in in-memory databases are the slowest because logging involves writing to disks, and disk I/O is a performance bottleneck for the entire system architecture. So SiloR logs concurrently, assuming that each disk on the system has one log thread serving a group of worker threads, and that the log thread shares a CPU Socket with a group of worker threads.

Based on this assumption, the log thread needs to maintain a log buffer pool with multiple log buffers. If a worker thread wants to execute, it first needs to ask the log thread for a log buffer to write logs. When the log buffer is full, it returns the log thread to the disk and obtains a new log buffer. If no log buffer is available, the worker thread blocks. The log thread periodically flusher the buffer to disk. When the buffer is empty, it can continue to be handed over to the worker thread. For log files, one new log file is generated for every 100 epochals, and the old log file name is generated according to fixed rules. The last part of the file name is used to identify the largest Epoch Number in the log. The log content records the collection of transaction TID and Record updates (Table, Key, Old Value -> New Value).

In fact, each core in the CPU works the same way. There is a dedicated thread that keeps track of which logs have been flushed to disk and writes the latest Epoch to a fixed location on disk. All transactions compare their current epochs with their Persistent epochs that have fallen (hereinafter referred to as Pepoch). If the value is less than or equal to Pepoch, logs have fallen and can be returned to the client.

  • The recovery process

Like ARIES, SlioR also has to do something that I’ve discovered. The first step in SlioR is to read out the data from the last checkpoint and restore it; Since in-memory databases do not log indexes, indexes need to be refactored in memory. The second step is log playback. The in-memory database only does Redo and does not do Undo. Redo operations are not performed in the normal log order as ARIES, but are performed from back to front to update the latest version. During log playback, the system checks the Pepoch log file to find the latest Pepoch number. Logs that exceed the Pepoch number are not returned to the client and can be ignored. Log file recovery using Value Logging, for each log, check whether the data Record exists, if not, according to the log Record generated data Record; If present, compare the TIDs in the Record and log. If the TID in the log is greater than the TID in the Record, Redo is required, and the old value is replaced with the new value in the log. As you can see, SiloR’s Redo file is not a step-by-step recovery to the site of a failure, as ARIES’s goal is to restore data on disk to its final correct state, whereas SiloR’s goal is to restore data in memory to its latest correct state.

– In – the Memory: –

For in-memory databases, recovery systems are simpler than disk DBMS systems. Data is logged, indexes are not, and Redo logs are required, not Undo logs. All data is overwritten directly, and there is no need to manage Dirty pages. There is no Buffer Pool or Buffer drop problem. However, in-memory databases are still limited by log synchronization time overhead because logs are still flushed to non-volatile storage (disk). In the early 1980s, research on in-memory databases assumed that in-memory data would not be lost, such as memory with batteries (after power failure can also be supported by batteries for a period of time); There are also technologies such as non-volatile Memory (NVM), but these technologies are far from universal use, and persistent storage still needs to be considered.

Persistent systems are required to have minimal impact on performance, with no impact on throughput or latency. In order to reduce the impact on transaction execution, the primary goal pursued by in-memory databases is recovery speed, which is the fastest recovery after a failure. Nowadays, many studies focus on parallel logging of memory databases, which makes it difficult to implement locks and Checkpoints.

Adding a checkpoint controller to in-memory cpus: I/O Adding a powerful feature of in-memory cpus: This feature provides a powerful ability to change transaction execution, introduce latency, and consume Memory.

Parameter Description changing a parameter that I/o change: parameter Description changing a parameter that I/O change Parameter Changing Ability Parameter Parameter Changing Ability Parameter Parameter Changing Parameter Consistency Parameter Changing Indicates that a checkpoint data that captures the changing ability of a parameter is changing. Fuzzy Subsystem contains both committed and uncommitted transactions, and only removes uncommitted transactions during recovery.

Adding a Checkpoint to a database system using multiple versions of storage: **Checkpoints Second, you can copy the children of a process through the Folk function at the operating system level. All data in memory is then copied, but additional operations are required to roll back changes that are in progress and have not yet been committed.

Adding a Checkpoint to a Checkpoint we’ve seen in this video: **Checkpoints Another is an incremental Checkpoint, which only does incremental content between the current Checkpoint and the last one. The difference is the amount of data at Checkpoint and the amount of data needed to recover.

**Checkpoints ** We have three Checkpoint frequencies. One is Checkpoint based on fixed period of time. Checkpoint based on the size of a fixed log; The last method is mandatory Checkpoint, for example, forcibly Checkpoint a database after it goes offline.

Summary of this paper

In this article, we have introduced the recovery subsystem of database systems. We have introduced the concepts and technical implementation of ARIES and Logical Undo Logging, the main recovery systems of Physical Logging. In addition, we introduce the Recovery strategies of two database systems — CTR fixed time Recovery for SQL Server and Force Recovery for Silo. The next lecture will discuss concurrency control techniques for database systems.

References:

1. C. Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, and Peter Schwarz. 1992. ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking And Partial Rollbacks Using Write-Ahead Logging. ACM Trans. Database Syst. 17, 1 (March 1992), 94 — 162.

2. Antonopoulos, P., Byrne, P., Chen, W., Diaconu, C., Kodandaramaih, R. T., Kodavalla, H., … & Venkataramanappa, G. M. (2019). Constant time recovery in Azure SQL database. Proceedings of the VLDB Endowment, Twelve (12), 2143-2154.

3. Zheng, W., Tu, S., Kohler, E., & Liskov, B. (2014). Fast databases with fast durability and recovery through multicore parallelism. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14) (pp. 465-477).

4. Ren, K., Diamond, T., Abadi, D. J., & Thomson, A. (2016, June). Low-overhead asynchronous checkpointing in main-memory database systems. In Proceedings of the 2016 International Conference on Management of Data (pp. 1539-1551).

5. Kemper, A., & Neumann, T. (2011, April). HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In 2011 IEEE 27th International Conference on Data Engineering (pp. 195-206). IEEE.

_____________________________________________________________________________________________