This article is published by the Cloud + community

Author: Xu Zhongqing, responsible person of Distributed storage CynosStore of Tencent Cloud self-developed database CynosDB. Database kernel development, database product architecture and planning. He used to work for Huawei, joined Tencent in 2015, and participated in the development of TBase (PGXZ), CynosDB and other database products. Focus on relational database, database cluster, new database architecture and other areas. Currently, he is responsible for the distributed storage CynosStore of CynosDB.

Moving enterprise IT systems to the public cloud is already an ongoing trend. Database services, as a key component provided on the public cloud, are one of the key considerations for enterprise customers to move their systems that have been running for years to the cloud. Relational database systems, on the other hand, have been around for forty years since System R began. Especially with the development of the Internet, the business has higher and higher requirements on the throughput of database instances. For many services, the maximum throughput provided by a single physical machine cannot meet the rapid development of services. Therefore, the database cluster is the difficulty that many IT systems can’t get around.

CynosDB for PostgreSQL is a cloud native database developed by Tencent Cloud. The core idea of CynosDB for PostgreSQL comes from Amazon’s cloud database service Aurora. The core idea is “log-based storage” and “storage computing separation”. At the same time, CynosDB does differ from Aurora in many aspects of architecture and engineering implementation. Compared with the traditional stand-alone database, CynosDB mainly solves the following problems:

Deposit is separate

The separation of storage and computation is one of the main characteristics of cloud database, which is different from traditional database. It is mainly for the purpose of 1) improving the efficiency of resource utilization and giving users as many resources as they use. 2) Stateless compute nodes are more conducive to the high availability of database services and convenience of cluster management (fault recovery, instance migration).

The storage capacity automatically expands or shrinks

Traditional relational databases are limited by the resources of a single physical machine, including storage space and computing capacity on a single machine. CynosDB uses distributed storage to overcome the limitations of stand-alone storage. In addition, storage supports multiple copies, which are guaranteed by RAFT protocol.

Higher network utilization

The log-based storage design greatly reduces the network traffic during database operation.

Higher throughput

Traditional database cluster is faced with a key problem: the contradiction between distributed transactions and linear expansion of cluster throughput. That said, many database clusters either support full ACID or pursue excellent linear scalability, and most of the time you can’t have your cake and eat it. The former, such as Oracle RAC, is the most mature and complete database cluster on the market, providing data access services that are completely transparent to the business. However, the linear scalability of Oracle RAC has proved insufficient in the market, and as a result, more users are building high-availability clusters primarily with RAC rather than high-scale clusters. The latter, such as Proxy+ open source DB database cluster scheme, usually provides good linear scalability, but has great limitations for database users because it does not support distributed transactions. Or you can support distributed transactions, but in turn reduce linear scaling when cross-node writes are large. CynosDB improves the maximum throughput of the entire system by using the linear expansion of read-only nodes in the write multiple read mode, which is sufficient for most public cloud users.

The storage capacity automatically expands or shrinks

Traditional relational databases are limited by the resources of a single physical machine, including storage space and computing capacity on a single machine. CynosDB uses distributed storage to overcome the limitations of stand-alone storage. In addition, storage supports multiple copies, which are guaranteed by RAFT protocol.

Higher network utilization

The log-based storage design greatly reduces the network traffic during database operation.

Higher throughput

Traditional database cluster is faced with a key problem: the contradiction between distributed transactions and linear expansion of cluster throughput. That said, many database clusters either support full ACID or pursue excellent linear scalability, and most of the time you can’t have your cake and eat it. The former, such as Oracle RAC, is the most mature and complete database cluster on the market, providing data access services that are completely transparent to the business. However, the linear scalability of Oracle RAC has proved insufficient in the market, and as a result, more users are building high-availability clusters primarily with RAC rather than high-scale clusters. The latter, such as Proxy+ open source DB database cluster scheme, usually provides good linear scalability, but has great limitations for database users because it does not support distributed transactions. Or you can support distributed transactions, but in turn reduce linear scaling when cross-node writes are large. CynosDB improves the maximum throughput of the entire system by using the linear expansion of read-only nodes in the write multiple read mode, which is sufficient for most public cloud users.

The following figure shows the product architecture of CynosDB for PostgreSQL. CynosDB is a database cluster based on shared storage that supports write to read.

Figure 1 CynosDB for PostgreSQL product architecture diagram

CynosDB is based on CynosStore, which is a distributed storage that provides a solid base for CynosDB. CynosStore consists of multiple Store nodes and CynosStore clients. The CynosStore Client is compiled with the DB (PostgreSQL) as a binary package. It provides an access interface for the DB and transfers log flows between the master and slave DB. In addition, each Store Node will automatically and continuously back up data and logs to Tencent cloud object storage service COS for the PITR (Instant Recovery) function.

I. Data organization form of CynosStore

CynosStore allocates storage space for each database, which we call a Pool. Each database has a Pool. The capacity expansion of the database storage space is implemented by expanding the Pool capacity. A Pool is divided into multiple Segment groups (SG). Each SG has a fixed size of 10 GB. We also call each SG a logical shard. A Segment Group (SG) consists of multiple physical segments, each of which corresponds to a physical copy. Consistency is implemented through RAFT protocol. Segment is the smallest unit of data migration and backup in CynosStore. Each SG keeps its own data and logs about the most recent period of that data.

Figure 2 Data organization form of CynosStore

In Figure 2, CynosStore has three Store nodes, and a Pool is created in CynosStore. This Pool consists of three SG, and each SG has three copies. CynosStore also has free copies that can be used to expand the current Pool, or to create another Pool. Create a SG with 3 free segments and allocate the new Pool.

Distributed storage based on log asynchronous write

WAL (log Write First) is traditionally used for data transactions and failover. The most obvious benefit of this is that 1) after the database goes down, it can recover data pages based on persistent WAL. 2) Write logs first, instead of writing data directly. You can change random I/O (write data pages) to sequential I/O (write logs) in the critical path of database write operations to improve database performance.

Figure 3 log-based storage

Figure 3 (left) illustrates the process of writing data to a traditional database in the most abstract way: Every time data is modified, the log must be persisted before the data page can be persisted. Log persistence is usually triggered when

1) When a transaction is committed, all logs before the maximum log point generated by the transaction must be persisted before the transaction can be returned to the client.

2) When the log cache space is insufficient, the log cache space must be released after persistence;

3) When the data page cache space is insufficient, some data pages must be eliminated to release the cache space. For example, according to the elimination algorithm, dirty page A must be eliminated, so all logs before the last modification of A log point must be persisted, and then A can be persisted to the storage, and finally A can be really eliminated from the data cache space.

In theory, the database only needs to persist logs. Because the database can recover the contents of any current data page as long as it has all the logs from the database initialization time to the current time. In other words, the database only needs to write logs, not data pages, to ensure the integrity and correctness of the data. In practice, however, database implementers do not do this because 1) it would be time-consuming to traverse the log from beginning to end to recover each data page; 2) Full logs are much larger than the data itself and require more disk space to store.

So what if persistent logs were stored on a device that had not only storage power, but computing power, and the ability to replay the logs to the latest page on its own? Yes, if so, there is no need for the database engine to pass the data page to the store, because the store can compute the new page itself and persist it. This is the core of CynosDB’s “Adopt log-based storage” philosophy. Figure 3 (right) depicts this idea in extreme abstraction. In the figure, compute nodes and storage nodes are placed on different physical machines. In addition to persistent logs, the storage nodes also have the ability to generate the latest data pages through Apply logs. In this way, the compute node only needs to write logs to the storage node instead of passing data pages to the storage node.

The following figure describes the structure of CynosStore with log-based storage.

Figure 4. CynosStore: Log-based storage

This figure describes how the database engine accesses the CynosStore. The database engine accesses the CynosStore through the CynosStore Client. The two core operations include 1) logging; 2) Read the data page.

The database engine passes the database logs to CynosStore, and the CynosStore Client is responsible for converting the database logs to CynosStore Journal and serializing these concurrent writing journals. Finally, the data page modified by Journal is routed to different SG and sent to the Store Node where SG belongs. In addition, the CynosStore Client asynchronously listens for the log persistence confirmation messages of each Store Node and tells the database engine the latest persistence log points after merging.

When the data page accessed by the database engine is not hit in the cache, the required page (read block) needs to be read from the CynosStore. Read Block is a synchronous operation. Also, The CynosStore supports multiple version page reads within a certain time range. Because Store nodes do not replay logs at exactly the same pace, there will always be some consistency points, so the read request initiator needs to provide consistency points to ensure the consistency required by the database engine, or by default, CynosStore uses the latest consistency points (read points) to read data pages. In addition, read-only database instances also need to use the multi-version feature provided by CynosStore in the write-to-read scenario.

CynosStore provides two layers of access: a block-device-level interface and a block-device-based file system level interface. Called CynosBS and CynosFS respectively, they both use this form of interface for asynchronous logging and synchronous data reading. So what are the benefits of CynosDB for PostgreSQL with log-based storage compared to a PostgreSQL cluster with multiple servers?

1) Reduce network traffic. First, as long as the memory is separated, the compute node cannot avoid sending data to the storage node. If we still use the traditional database + network hard disk to perform memory separation (separation of computing and storage media), the network needs to transfer not only logs, but also data. The data transfer size is determined by the amount of concurrent writes, database cache size, and checkpoint frequency. CynosDB with CynosStore as its base only needs to pass logs to CynosStore to reduce network traffic.

2) More conducive to the implementation of shared storage based clustering: multiple instances of a database (write multiple read) access to the same Pool. Log-based CynosStore ensures that as soon as the DB master node (read/write node) writes a log to CynosStore, the slave node (read/write node) can read the latest version of the data page modified by this part of the log. It does not need to wait for the master node to persist the data page to the storage through checkpoint operations so that the reader node can see the latest data page. This greatly reduces latency between master and slave database instances. Otherwise, slave nodes wait for the master node to persist the data page before advancing to read points. If the checkpoint interval is too long for the primary node, the delay between the primary and secondary nodes increases. If the checkpoint interval is too short, the network traffic of the primary node increases.

Of course, the persistence of new data pages after the Apply log, which is always done, doesn’t just disappear, it just moves down from the database engine to the CynosStore. But as mentioned above, in addition to reducing unnecessary network traffic, CynosStore SG does redo and persistence in parallel. The number of SG’s in a Pool can be expanded on demand, and the host Store Node of a SG can be dynamically scheduled, so this can be done in a very flexible and efficient way.

Third, CynosStore Journal (decimated)

CynosStore Journal (CSJ) performs database journal-like functions, such as WAL for PostgreSQL. CSJ differs from PostgreSQL WAL in that CSJ has its own log format that is decouple from database semantics. PostgreSQL WAL is generated and parsed only by the PostgreSQL engine. That is, when other storage engines get a PostgreSQL WAL fragment and the base page content modified by the fragment, they cannot recover the latest page content. CSJ aims to define a log format independent of the logic of various storage engines to facilitate the establishment of a common log-based distributed storage system. CSJ defines five Journal types:

1.SetByte: Overwrites the continuous storage space in the specified data page at the specified offset position and the specified length with the contents of Journal.

\2. SetBit: Similar to SetByte, except that the minimum granularity of SetBit is Bit. For example, hitbit information in PostgreSQL can be converted into SetBit logs.

\3. ClearPage: When a new Page is allocated, it needs to be initialized. At this point, the original content of the new allocation Page is not important, so it does not need to be read from the physical device, but only needs to be written with an all-zero Page.

\4. DataMove: There are write operations that move part of the page to another place. DataMove logs are used to describe these operations. For example, when PostgreSQL performs compact operations on a Page during Vacuum, DataMove logs are smaller than SetByte logs.

\5. UserDefined: The database engine will always have operations that do not modify a specific page content but need to be stored in a log. For example, PostgreSQL’s latest transaction ID (XID) is stored in WAL, so that when the database recovers from a failure, it knows from which XID to assign. This type of logging is semantically related to the database engine and does not require CynosStore to understand it, but requires the log to persist it. UserDefined describes this type of log. CynosStore only takes care of persistence and provides a query interface for this type of log, which is ignored by Apply CSJ.

The above five types of Journal are the lowest level of log storage. As long as the data writing is based on block/page, it can be converted into these five types of Journal to describe. Of course, there are some engines, such as LSM-based storage engines, that are not well suited for this low-level logging format.

Another feature of CSJ is out-of-order persistence, because the CSJ of a Pool is routed to multiple SG and written asynchronously. The Journal Acks returned by each SG are not synchronized and interspersed with each other, so the CynosStore Client also needs to merge these Acks and advance to successive CSJ points (VDL).

Figure 5 CynosStore log routing and out-of-order ACK

As long as continuous logs are routed based on data fragmentation, there will be the problem of out-of-order ACK of logs, so log ACKS must be merged. Aurora has this mechanism, CynosDB has this mechanism. For ease of understanding, we name the key points in the Journal the same way as Aurora does.

What needs to be described here is MTR, which is the atomic writing unit provided by CynosStore. CSJ is composed of one MTR next to another. Any log must belong to one MTR, and multiple logs in one MTR may belong to different SG. For PostgreSQL engine, it can be roughly understood as follows: one XLogRecord corresponds to one MTR, the log of a database transaction consists of one or more MTR, and MTR of multiple database concurrent transactions can be interspersed with each other. But CynosStore doesn’t understand and feel the transaction logic of the database engine, only MTR. Read requests sent to CynosStore must provide read points that are not at a log point inside an MTR. In short, MTR is a CynosStore transaction.

4. Fault recovery

When the primary instance fails, it is possible that the log points persisted by each SG in the Pool on the primary instance are globally discontinuous or have holes. The contents of the logs corresponding to these holes are no longer known. For example, three consecutive logs j1, J2, and j3 are routed to sg1, SG2, and SG3 respectively. At the moment of failure, J1 and J3 had been successfully sent to SG1 and SG3. But J2 was still in the network buffer on the machine where the CynosStore Client was located, and was lost as the primary instance failed. When the new master instance is started, there will be discontinuous logs j1, j3 on the Pool, and J2 has been lost.

When this failure scenario occurs, the newly started master instance will query all logs since the last persistent continuous log VDL on each SG and merge the logs to calculate the new continuous persistent log number VDL. That’s the new point of consistency. The new instance Truncate all logs greater than VDL on each SG using the Truncate interface provided by CynosStore. Then the first journal generated by the new instance starts from the next journal in the new VDL.

Figure 6: Log recovery process during failover

If Figure 5 happens to be the point at which a database instance failure occurred, Figure 6 is the process of calculating the new consistency point after restarting a database read/write instance. The CynosStore Client computes a new consistency point of 8 and truncates logs greater than 8. That is, truncate the 9 and 10 on SG2. The next log generated will start at 9.

5. Consistency of multiple copies

CynosStore uses multi-raft to achieve SG multi-copy consistency. CynosStore uses batch and asynchronous pipelinization to improve RAFT throughput. We use CynosStore’s custom benchmark to measure the throughput of log persistence on a single SG at 3.75 million logs per second. CynosStore Benchmark uses asynchronous log writing to test CynosStore throughput. The log type includes SetByte and SetBit. The log writing thread continuously writes to the log, and the listening thread is responsible for processing ack packets and promoting VDL. Benchmark then measures the VDL’s propulsion speed per unit time. 3.75 million bytes/SEC means that 3.75 million SetByte and SetBit logs are persisted by one SG every second. In an SG scenario, the average network traffic from CynosStore Client to Store Node is 171MB/ second, which is also the network traffic from a Leader to a Follower.

Six, write and read more

Based on the shared storage CynosStore, CynosDB supports single-write multi-read database instances in the same Pool to improve database throughput. Two problems need to be solved in the multi-read write based on shared storage:

\1. How the master node (read/write node) notifies the slave node (read/write node) of changes to the page. Since slave nodes also have buffers, when a page cached from the slave node is modified in the master node, slave nodes need a mechanism to know about the modified message and to update the change in the Buffer from the slave node or to re-read the new version of the page from the CynosStore.

\2. How to read a consistent snapshot of the database from a read request on a node. In the active/standby mode of PostgreSQL, the standby server creates a snapshot (active transaction list) using the snapshot information and transaction information synchronized from the host. The slave node of CynosDB requires a snapshot of the CynosStore (consistent read point) in addition to the database snapshot (active transaction list). Because sharded logs are applied in parallel.

If the storage of a write-to-read shared storage database cluster does not have log redo capability itself, there are two alternatives for main-slave page synchronization:

The first alternative is to synchronize only logs between master and slave. If a cache miss is generated from the secondary checkpoint, only the base page of the last checkpoint can be read from the storage. All changes to the page made in the log cache since the last checkpoint are replayed on this basis. The key problem with this approach is that if the time interval between checkpoint points of the primary instance is too long, or if the log volume is too large, the slave instance will spend too much time on the Apply log even if the hit ratio is not high. Even in extreme scenarios, the slave instance will repeatedly apply the same log to the same page, which not only greatly increases the query delay, but also generates a lot of unnecessary CPU overhead, and may greatly increase the delay between the master and slave.

In the second alternative, the master instance provides the slave instance with the service of reading the data pages of the memory buffer, and the master instance periodically synchronizes the changed page numbers and logs to the slave instance. When reading a page, the slave instance first determines whether 1) to use the slave instance’s own memory page directly, 2) to replay the new memory page from the memory page and log, 3) to pull the latest memory page from the master instance, or 4) to read the page from storage based on the modified page number information synchronized by the master instance. This approach is somewhat similar to a simplified version of Oracle RAC. This solution addresses two key issues: 1) Different slave instances may obtain different versions of pages from the master instance, and the master instance memory page service may need to provide multiple version capabilities. 2) The read memory page service may impose a significant burden on the master instance because, in addition to the impact of multiple slave instances, the slave instance must pull the entire page each time a page in the master instance changes even a small part of its content if it reads the page. In general, the more frequently the primary instance is modified, the more frequently it is pulled from the instance.

In contrast, CynosStore also needs to synchronize dirty pages, but CynosStore has a more flexible way of retrieving new pages from instances. There are two options: 1) replay memory pages from the log; 2) Read data from StoreNode. The minimum information that slave instances need to synchronize dirty pages is exactly which pages have been modified by the master instance. Master/slave synchronization log content is to speed up slave instances and reduce the burden on Store Nodes.

Figure 7. CynosDB Write multiple reads

Figure 7 depicts the basic framework of one write and one read (one master and one slave). One write and one read (one master and one slave) is the superposition of one write and one read. The CynosStore Client (CSClient) operates independently of the primary and secondary CynosStore journals. The primary CSClient continuously sends CynosStore Journal (CSJ) from the primary instance to the secondary instance. As soon as these consecutive logs reach the slave instance, DB Engine can read the latest version of the modified logs without waiting for all of them to apply. Thus, the delay between master and slave is reduced. Here’s the advantage of log-based storage: as long as the master instance persists the logs to a Store Node, the instance can read the latest version of the data page modified by the logs.

Seven, conclusion

CynosStore is a distributed storage that is built from scratch and adapted to cloud databases. CynosStore has some natural architectural advantages: 1) Storage computing separation, and storage computing network traffic to a minimum; 2) Improve resource utilization, reduce cloud cost, 3) more conducive to database instances to achieve one write multiple read, 4) higher performance than the traditional RDS cluster with one master and two slaves. In addition, we will make further improvements to CynosStore in terms of performance, high availability, and resource isolation.

This article has been published by Tencent Cloud + community authorized by the author