Yan Ran, senior technical expert of OceanBase team of Ant Financial, one of the initial members of OceanBase, is currently responsible for the research and development of transaction engine and performance optimization. The following is a transcript of the speech:

First of all, let’s talk about a topic we are all interested in: what is the biggest difference between OceanBase and existing classic databases such as Oracle and SQL Server? OceanBase is a cloud-native database. The software we do is all based on hardware. So, the first thing we need to know is where is the hardware today?

IBM’s solution, as shown above, is the most familiar, and is still used by many organizations today, from the System 360 more than 50 years ago to the IBM Z14 released last year.

We talk about cloud computing every day, so what exactly is cloud computing? Cloud computing is not a solution that comes out of a stone. It started with Digital minicomputers. DEC was a contemporary of IBM’s mainframes, addressing storage and computing needs at an affordable price. It differs from IBM mainframes in that DEC is cheaper and more flexible to obtain.

DEC was subsequently disrupted by chip-based solutions, such as SUN workstation solutions that offered better performance and lower prices, and a more versatile platform. Then PC Server took over the baton, at this time CPU, memory, storage can come from different manufacturers, but they can still provide a set of standardized computing and storage platform.

Compared to IBM mainframes, PC Server-based data centers provide computing and storage solutions in a more industrial manner, with better cost performance and easier to scale. However, IBM mainframe is the integration of hardware and software design, which is also the mainframe to meet the needs of many hardware solutions. For example, financial services have high reliability requirements. Mainframes have made various reliability guarantees at the hardware level, from storage to memory and even CPU, which are supported by redundancy policies. However, the failure rate of a single PC Server is relatively high, so a completely different approach is needed to meet the requirements of high reliability and availability.

Here is a quote from Yang Zhenkun, founder of OceanBase team — “Computer is not naturally suitable for database, but database naturally chooses computer”. Why are computers inherently unsuitable for databases? When dealing with data, one of the key issues is the reliability and consistency of the data. And the computer hardware naturally has all kinds of probability of damage, whether it is power, software bugs, operating system problems, or the whole machine room is hung, the computer is naturally wrong. But we have to solve the problem is not wrong, so in the computer to solve the reliable security consistency of the database, in fact, is a very big challenge.

This is the serious challenge we face today to achieve the transactional service capability of a more reliable database on a less reliable machine.

OceanBase transaction

Of course, where there are challenges, there are opportunities. What did OceanBase do in the face of this challenge? In such a CLOUD environment based on PC Server, OceanBase realizes the ability of elastic expansion, and at the same time realizes the ability of high reliability and high availability to solve database transactions without relying on high-end hardware.

The hardware that OceanBase relies on is a generic cloud environment. But on this hardware basis, we can still do database transactions, and also achieve high performance, even financial level reliability.

What is financial grade reliability? Everyone will feel it personally. For example, if you make a post and it’s lost, you might be upset, but not terribly upset. Another scene, today you transfer 100 yuan to your friend, but your friend only receives 98 yuan, your heart must be afraid. Therefore, in this financial level scenario, what must be done is to provide adequate protection for users, which is called financial level reliability. But the difficulty with financial reliability is the attention to detail. Details are the devil, we do database software, need to control the details of the ability to be very strong.

The points highlighted here continue several important features of database transactions.

  • Implementation of TRANSACTION ACID in OceanBase architecture:
  • They offer: Transaction logging uses Paxos for multi-copy synchronization
  • Atomicity: Use two-phase commit to ensure Atomicity of cross-machine transactions
  • Isolation: Uses the multi-version mechanism for concurrency control
  • Consistency: Ensures uniqueness constraints

OceanBase storage engine

OceanBase’s architecture is based on the LSM Tree. Why is it based on LSM Tree? What are its features?

In a classic database, all data is fragmented and stored in persistent storage, such as on a disk or SSD. When reading data, be sure to store it in memory first. A block of data is also stored in memory when changes need to be made. If the memory is full, it will swipe the page back.

LSMTree, on which OceanBase is based, allows more changes to be placed more centrally in a single memory structure. What we do is we do this periodically in the background, the foreground holds the data for as long as possible, and then we let the background do the merge. OceanBase has a mechanism called merge of the day, which means that if the front desk can save one day, we do merge of the background every night.

OceanBase will make scheduling within the system. When the first machine brushes dirty pages in the background, the traffic of the business will be transferred to other machines, so that the business of the front desk will not be affected. When the first one is finished, the traffic of the business is cut back, and then the second machine is used to brush dirty pages in the background. We solved the problem of using the time difference between different machines and different services to do dirty pages in a clustered environment through this “rotation merge” approach.

OceanBase transaction engine design also makes use of the LSM Tree structure, so that all the execution state of the transaction is stored in memory, and the data modified by the transaction will be persisted only after the transaction is committed. So there is no need to do undo when implementing atomicity of transactions. This allows OceanBase to do database transactions in a more concise way. That’s the whole logic of the OceanBase storage engine.

OceanBase Memory transaction engine

Transaction execution in OceanBase takes place in memory. One of the biggest differences between memory and hard disk is that memory can be randomly addressed in bytes, and this has the great benefit of enriching data structures.

For example, the data in the table above will be updated in the first row: the condition is to increase Han’s salary by 500,000. Now we need to write a new value for the Han corresponding to this column in the table. The in-memory representation is just such a linked list. If Han suddenly wants to change departments to accept new challenges after he gets a raise in salary, we need to do an update operation again. Just add a new update to the list to change his department. He moved from R&D to investment, so let’s record the corresponding changes in this column here.

This is a multi-version-based concurrency control within the database, which records the time of each update to ensure that the modification and read operations do not interfere with each other. Because even if one line of data is changed, we can still get the historical data directly. Data structures based on linked lists are very programmer friendly.

Compared with the classical database based on disk, both data reading and writing are operations in the unit of fixed-length data block, and the way of expressing information is also based on block. One of the biggest benefits of OceanBase’s method is that it greatly improves the simplicity of expression, which means that its operation efficiency is much higher. This is one of the important reasons why OceanBase is able to provide more transaction capability in the current hardware environment.

Let’s review the architecture of OceanBase again. After the data is sharded, OceanBase can extend the data to multiple machines in multiple clusters for storage. It provides linear scaling capability.

As shown in the figure above, there may be multiple copies of the same data, such as P0 on three machines, which may be in the same machine room or from different machines. But they serve the same piece of data, and only one of them is the current master, which is responsible for the database operations and synchronizes changes to the database transactions to other machines. The purpose of OceanBase data organization and distribution mentioned here is to solve the reliability problem of the following database.

Reliability vs. availability

High reliability and high availability are concepts that are widely mentioned by all database products. So what exactly do reliability and availability mean?

Traditional database, ACID theory does not have the concept of availability, only guarantee durable. But data loss does not necessarily guarantee continuity of service. However, the capability of failover is very important in the actual system. So all commercial databases have a solution for usability. The classic scheme is based on master/slave synchronization. When the master fails, the slave can continue to provide services, which is to solve the problem of availability.

OceanBase uses the Paxos protocol to address reliability and availability issues. To persist transactions in any database, the transaction logs need to persist to multiple copies. Of the three replicas, we consider the transaction to be successful as long as it persists to the disks of the two replicas. That means when any machine goes down, there’s at least one copy left.

So, isn’t it better to think the other way around and sync to as many machines as possible? Would it be more reliable if all three copies were synchronized? This is the problem with availability, if one of the machines has a network failure, or the system is too loaded to respond, do you consider the transaction successful?

If all three are successful, it cannot be considered successful because one machine is not responding. The Paxos protocol only requires two machines to synchronize, even if one does not answer, we still consider the transaction successful because the majority succeeded.

The failure of one of the three systems is not affected and does not affect the continued service of the system. This is a fine balance between reliability and availability. If you need more protection, you can choose a five-copy solution, which ensures that two machines fail without compromising either reliability or availability, an important balance.

So we add an A to ACID, which means we want to make the transaction both reliable and the processing power of the transaction available.

Distributed transaction two-phase commit protocol

If a database writes logs on a single machine, it must only write log files on its own machine. If the writing succeeds, it succeeds. If the writing fails, it fails. But when it comes to multiple machines, it involves: Machine A succeeds, machine B fails. What shall WE do about it?

The two-phase commit protocol exists to solve this problem, that is, commit can no longer be a one-time success, it involves the agreement of multiple machines, each machine is successful and ultimately defined as a success.

In fact, the two-phase commit protocol is rarely used in practice. Why is that? Mainly because it’s complicated. Although the theory is beautiful, there are a lot of details. However, in OceanBase’s business scenario, the two-phase commit protocol must be used to solve the problem.

The basis of OceanBase is that every participant is highly available, because OceanBase uses Paxos protocol to ensure high availability of partitions, so the failure of any machine will not lead to service stop, which is a very important premise. In addition, because Paxos synchronization introduced synchronization delays across networks and across machine rooms, the original two-phase commit protocol was more expensive to log multiple times. One thing OceanBase did was to let the coordinator not log and keep only the memory state. A very important benefit of this is the low commit latency. And because all participants are highly available, we don’t have the usual problems with two-phase states, such as coordinators getting stuck when down.

OceanBase transaction isolation

Transaction isolation is concerned with the concurrency control of transactions, and OceanBase uses multi-version concurrency control. The read request takes the publish version of the current system as the snapshot version read. When a transaction commits, it generates a successively increasing version number, which is the version of all the modified data in the transaction.

In a single machine scenario, a single log determines whether the transaction can be committed. The location of the log determines the version number of the transaction, which must be incremented consecutively. For example, 230 might be a prepare log. The prepare log does not mean that the transaction committed, but 232 might be a commit log for another transaction. This means that for the transaction to commit, 232 needs to be readable. However, 230 is in the state of ununlocked transaction. In this case, another control method is required. For a two-phase commit transaction between prepare and COMMIT, all rows are locked and not allowed to be read.

The effect is small because there is no user intervention between prepare and commit, and it is a millisecond operation, meaning that the row is locked for a fraction of a second during the commit. The read operation waits for the transaction to commit before deciding whether to read the row.

conclusion

Based on the above technological innovation, OceanBase truly realized the high reliability and high availability of transactions in the cloud environment, and also had good performance. It is hoped that OceanBase can help more businesses solve the needs of data storage and query, and will no longer be trapped in the problems of high price and poor scalability of traditional commercial databases.