This article is published by Tencent Cloud database

preface

CynosDB is a new generation of distributed database, 100% compatible with MySQL and PostgreSQL. It supports flexible storage expansion, sharing data with one master and multiple slave, and its performance exceeds that of the community’s native MySQL and PostgreSQL. CynosDB adopts share storage architecture, and the cornerstone of its flexible expansion and high cost performance is CynosDB File System (CynosFS for short) : a user-mode distributed File System developed by Tencent Cloud. The purpose of this article is to describe the core architecture design of CynosDB and CynosFS as a whole.

Challenges and Solutions

CynosDB is the original architecture of the public cloud. Its core idea is to realize the advantages of high cost performance, high availability and elastic expansion of the public cloud on the basis of resource pooling. The biggest technical challenge to resource pooling is the efficient and stable flexible scheduling capability, which is also the cornerstone of high availability of public cloud products. Computing, network, and storage are public cloud Iaas products. The first two have mature elastic scheduling architectures, except storage products. The cost of data scheduling is too high. Database is a typical computing + storage product. To achieve resource pooling, you need to:

** L Storage and computing separation: ** In this way, computing resources (mainly including CPU and memory) can be pooled using existing mature container and virtual machine technologies

L ** Distributed storage: ** Divides data into normalized blocks and introduces the distributed scheduling system to realize elastic scheduling of storage capacity and I/O, so as to realize storage resource pooling

So for database products, is there an existing architecture that can meet these two requirements well? We found that the solution of virtual machine + distributed block storage (cloud disk) is very suitable. At present, mainstream public cloud platforms also have very mature products. Tencent Cloud’s MySQL base edition, launched earlier this year, and RDS on AWS are based on this architecture. Figure 1 illustrates this architecture briefly:

Figure 1

However, this architecture has the following shortcomings:

L Network IO heavy: It can be seen that a large amount of data needs to be written to the cloud disk for only one database instance, including WAL LOG, Page data, and Double Write or Full Page Write to prevent partial Page Write. In addition, the cloud disk will do multiple backups of these data

L Master and slave instances do not share data: on the one hand, a large amount of storage is wasted, and on the other hand, network IO is further aggravated

As a result, database products based on this architecture cannot compete with master-slave architecture based on physical machine deployment in system throughput capacity, and latency is also greatly challenged, which has a significant impact on OLTP-like services. At the same time, each database instance has a separate copy (in practice, it may be 2-3 copies) of storage, which is also expensive.

Aiming at these two deficiencies, CynosDB adopts the following design:

L Log sinking: CynosDB sinks WAL logs to the storage tier, and the database instance only needs to Write WAL logs to the storage tier, without writing Page data (Page data and Double or Full Page Write). This WAL LOG also acts as a RAFT protocol LOG to synchronize multiple data backups, further reducing network IO.

**l Primary and secondary instances share data: ** primary and secondary instances of CynosDB share a single copy of storage data, further reducing network IO and greatly reducing storage capacity

Following these two solutions, we optimized the traditional cloud disk-based architecture, resulting in the following architecture of CynosDB:

Figure 2

The components in the figure include:

L **DB Engine: **DB Engine supports one master and multiple slave databases.

L **Distributed File System: **Distributed File System provides Distributed File management and translates File read/write requests into corresponding BLOCK read/write requests

L **LOG/BLOCK API: **Storage Service provides a read/write interface, for different processing of read/write requests:

N Write request: Sends modification logs (equivalent to WAL logs) to the Storage Service through the LOG API

N Read request: Directly reads data through the BLOCK API

L **DB Cluster Manager: ** Is responsible for HA management of a primary and multiple secondary DB Cluster.

L **Storage Service: ** is responsible for log processing, asynchronous playback of BLOCK data, multi-version support of read requests, etc. It is also responsible for backing up WAL logs to Cold Backup Storage

L **Segment (Seg) : **Storage Service Manages the smallest unit (10GB) of data blocks and logs, and is also an entity for data replication. The 3 SEGs in the figure with the same color actually store the same piece of data and are synchronized using a consistency protocol (Raft) called Segment Group (SG).

L **Pool: ** Multiple SGS logically form a contiguous BLOCK device to store data blocks for the Distributed File System. Pool and SG are one-to-many.

L **Storage Cluster Manager: ** Is responsible for HA scheduling of Storage services and Segment groups, and maintaining the mapping between pools and SG.

L **Cold Backup Service: ** Incremental Backup is performed after WAL logs are received. Based on incremental Backup, full Backup and differential Backup can be flexibly generated

As you can see, all the above modules except DB Engine and DB Cluster Manager form a user-mode distributed file system independent of the database Engine, which we named CynosFS.

Log sinking and asynchronous playback

Here CynosDB uses the idea of LOG as database in AWS Aurora paper to sink WAL logs into Storage Serivce. The computing layer only writes WAL logs, and Storage Serivce asynchronously applies WAL logs to corresponding blocks. In addition to reducing write IO, it is also the basis for Storage Serivce to provide MVCC read capability. MVCC is the basis of shared storage between master and slave nodes. Data and logs in Storage servers are managed by Segment. Figure 3 shows the basic write process.

Figure 3

1. The update log is received, written to the disk, and step 2 is concurrently initiated

2. Initiate RAFT log replication

RAFT majority commits successfully and returns log writing success

4. Asynchronously mount log records to the update chain of the corresponding data BLOCK

5. Merge the logs in the update chain to the data BLOCK

6. Back up logs to a cold backup system

7. Reclaim useless logs

Step 4 will form an update chain for the updated data BLOCK to support reading of multiple versions of the data BLOCK, thus providing MVCC capability. You control the start window of the MVCC by controlling the merge progress in step 5.

At the same time, CynosDB also puts page CRC operations into the storage service to make full use of the CPU resources of the storage server

MVCC implementation

The MVCC capability of CynosFS is the foundation of the CynosDB master many-slave architecture, which is described in detail in this section.

First we introduce a series of concepts:

Lmini-transaction (MTR) : Just like database transactions are used to guarantee the ACID properties of transactions, MTR is used to guarantee the ACID properties of on-disk page data. For example, the insertion operation of a B+ tree will modify at most 3 data pages in the case of node splitting. The modification of these 3 data pages should be handled as one transaction, which is MTR. The data on disk of the database instance is updated according to MTR as the smallest unit

LWrite Ahead Log (WAL Log) : a common technique used by relational databases to ensure data consistency. WAL LOG is an incremental LOG flow in which LOG Records store page modifications. Since CynosFS is a distributed file system independent of a specific database engine, its WAL LOG records binary changes to data blocks and is a physical LOG

LLog Sequence Number (LSN) : Each LOG Record in a WAL LOG has a unique Sequence Number. The Sequence Number is monotonically incremented from 1.

LConsistency Point LSN (CPL) : A MTR may correspond to multiple disk data page changes, the corresponding will produce multiple logging, the logging in the last article 1 (LSN maximum 1) is identified as CPL, the log records corresponding to a state of the data on the disk, the state said a complete application of the MTR, and MVCC read consistency condition

LSegment Group Complete LSN (SGCL) : The data pages of a database instance are distributed to multiple SG, so each SG holds only part of the log records. SCL indicates that the SG has persisted all log records less than or equal to SCL to disk. For CynosFS, SCL is the CommitIndex for Raft because SG uses the Raft protocol.

LPool Complete LSN (PCL) : Indicates that all SG constituting the Pool have persisted logs less than or equal to PCL to disk. Since LSN is continuously monotonically increasing, this is also equivalent to having no holes in between

LPool Consistency Point LSN (PCPL) : maximum CPL not greater than PCL. Logical representation: points that provide MVCC read consistency that have been persisted to disk.

write

First of all, the modification of all data pages by the master instance is transformed by the Distributed File System to form Pool level logs with continuous monotonically increasing LSN and Distributed to different SG. SG will push their SGCL after completing Raft Commmit and return, master instance will push PCL and PCPL based on all SG completion. The following two diagrams describe the progress of PCL and PCPL:

Figure 4.

First, the Pool consists of two SG. Each SG has three segments to store the same data. Logs with LSN numbers from 1 to 8 have been persisted in two SG systems. Logs 1, 3, 4, and 7 are persisted in SG A, and logs 2, 5, 6, and 8 are persisted in SG B. These 8 logs belong to 3 MTR, namely, MTR-X (1, 2, 3), MTR-Y (4, 5), and MTR-Z (6, 7, 8). Among them, 3 and 5 are CPL, indicating that the corresponding logs of MTR-X and MTR-Y have all been persisted. 8 is not CPL, indicating that MTR-Z will have log records later. As defined above, PCL is 8 and PCPL is 5

Figure 5

In Figure 5, one log CPL9 of MTR-Z is persisted to SG A. By definition, both PCL and PCPL are updated to 9 at this time.

read

This refers primarily to the read flow from the instance. The slave instance has its own memory data. After the master instance updates the data, the memory data of the slave instance must be updated synchronously. Otherwise, inconsistency may occur. The master instance pushes the logging and the latest PCPL values to the slave instance. Therefore, the PCPL value of the slave instance is delayed compared to that of the master instance. This PCPL value determines the last MTR that all transactions can read from the instance. In order to reduce the delay of the slave instance, the master instance may push logs to the slave instance before the PCPL value is advanced. Therefore, the slave instance needs to ensure that logs with a value greater than the local PCPL value are not applied. The slave instance handles these update logs in one of two ways: If the corresponding data page is in the Buffer Pool, the update log is applied to the corresponding data page; otherwise, it is discarded. Finally, when a Read transaction from an instance accesses a data page (either directly from a Buffer or from a Storage Service), the current PCPL value is Read Point LSN (for short: RPL), using RPL to achieve consistent read (since PRL is a PCPL value, it must guarantee the integrity of MTR). CynosDB supports one master and many slaves. Each slave instance has many read transactions, and these read transactions have their own RPL. The smallest of these RPL is called min-RPL (for short: MRPL), there may be multiple CPL points between MRPL and PCPL, and the RPL of different read transactions may be at any one CPL point in this interval. Therefore, the Storage Service must be able to support reading data at any CPL point in this range, that is, MVCC. CynosDB uses the data page + update log chain to achieve this.

Finally, going back to Figure 3, SG needs to periodically recycle useless log records, and we can see that useless log records are those whose LSN value is less than MRPL. Logs greater than or equal to MRPL must be retained to support MVCC.

Transaction commit

A database transaction consists of multiple MTR, and the LSN of the last log of the last MTR is called Commit LSN (CLSN for short). To improve efficiency, logs are pushed to the Storage Service in an asynchronous pipeline and are pushed in parallel according to SG groups. Globally, there may be out-of-order because of parallelism, but the logs pushed to an SG must be monotonically increasing (discontinuous) by LSN. As PCPL progresses, once the VALUE of PCPL is greater than the CLSN of an object, that object is ready to Commit.

Crash recovery

First, CynosDB keeps the PCPL of the master instance, which we call Last PCPL (l-pCPL for short). CynosDB needs to ensure that the L-PCPL value is updated fast enough to ensure that log records larger than l-PCPL are not recycled. On this premise, we implement crash recovery through the following steps:

Figure 6.

First, the new master instance gets an L-PCPL value of 3. This instance consists of two SG groups, where SG A has persistent logs (1, 4, 5) and SG B has persistent logs (2, 3, 6, 8).

Figure 7.

The master instance reads all log LSNS greater than 3 from both SG nodes, and obtains 4 and 5 from SG A and 6 and 8 from SG B. This creates log flows of 3, 4, 5, 6, and 8. By definition, since there is no log with LSN of 7, 8 is invalid, and 6 is not CPL, so 6 is also invalid, so the latest PCPL value is calculated to be 5. The new PCPL value is then pushed to all slave instances.

Figure 8.

Finally, the master instance pushes the new PCPL=5 to all SG so that it can clean up any logs greater than 5.

After the above three steps, the whole data comes back to the consistent point of PCPL=5. The master and slave instances can then provide services.

Last backup

In fact, in CynosDB, there is little need for a full backup operation, just to make sure that THE SG WAL logs are incrementally saved to the cold backup system before being recycled. Because WAL logs logically contain all data changes, we can use incremental WAL logs to generate full and differential backups as required by customers. These operations can be handled offline by a separate system based on incremental backup data without any impact on CynosDB’s online system. In addition, we need to back up the state configuration information of the instance so that it can be retrieved during recovery.

In order to better provide services for the majority of developers, data gentleman sincerely invites you to participate in this prize research, say what you most want to say, use the most useful database! The questionnaire will take you about two minutes. After answering the question, you will have the chance to get the Tencent Cloud 100 YUAN no-threshold voucher, which can only be used by the cloud database and cloud server. If you win the prize, please fill in your mobile phone number on the page, which will be used to receive the voucher code. Limited number of first come, first served, the entry will be closed on December 15, click to participate!

This article has been authorized by the author to Tencent Cloud + community, more original text pleaseClick on the

Search concern public number “cloud plus community”, the first time to obtain technical dry goods, after concern reply 1024 send you a technical course gift package!