This article is published by the Cloud + community

Author: Sun Xu, Tencent database development engineer, 9 years of database kernel development experience; Familiar with database query processing, concurrency control, logging and storage systems; Familiar with database kernel implementation mechanisms such as PostgreSQL (Greenplum, PGXC, etc.) and Teradata.

CynosDB is a self-developed database developed by Tencent database development team, which has PostgreSQL and MySQL versions. This paper takes PostgreSQL compatible CynosDB as an example to introduce our architecture design and optimization ideas.

1, an overview of the

PostgreSQL is the most advanced open source database in the world. It was started in 1986 and has a community evolution history of over 30 years. Its advanced architecture, reliability and rich functionality have been highly recognized by the industry. PostgreSQL also runs on a variety of operating systems and supports multiple index types and extensions. In particular, the PostGIS extension makes it easy for PostgreSQL to handle geographic data.

PostgreSQL compatible Version of CynosDB, a PostgreSQL product in the NewSQL domain, also has good scalability. The resource pooling feature of the PostgreSQL database allows users to achieve the same performance at a lower cost without sacrificing the original functionality of the database.

2. Infrastructure

Existing databases on the shared cloud have some shortcomings:

1. Network I/OS are heavy. In the active/standby architecture of traditional cloud systems, a large amount of data needs to be written to disks, including WAL logs, dirty Page data, and Double Write or Full Page Write.

2. The primary and secondary instances do not share data. On the one hand, it wastes a lot of storage, on the other hand, it further aggravates network IO. As a result, the disk usage is low and the CPU is idle.

CynosDB can solve the above problems through log sinking and shared storage to achieve high cost performance, high availability and elastic expansion of the common cloud database. Its infrastructure is as follows:

Components in the architecture:

\1. Master is the master instance of the database, which is responsible for receiving application read/write transaction requests.

\2. The slave is a read-only instance of the database. It is responsible for processing application read requests and can support multiple slave instances.

\3. The CynosStore Client provides an interface for accessing the CynosStore. The DB engine accesses storage through these interfaces to complete data file reading and writing operations.

\4. CynosStore is a distributed storage system that stores database data and logs and is responsible for converting logs to data pages.

\4. Cluster management service is responsible for the management of the whole system, such as storage capacity expansion, instance creation, etc.

\5. Cold standby storage Is used to store system logs.

The master instance logs changes to the storage system (CynosStore), and CynosStore periodically merges the logs to the data page. As a result, CynosDB does not need to write dirty pages to storage, unlike traditional databases. The slave database instance does not write transactions and does not send logs to the storage, but reads pages from the storage and receives logs from the master instance to refresh data pages in memory. If the page corresponding to the received logs is not in the slave memory, the logs are discarded.

Architecturally, CynosDB separates storage from computing and pools resources, making it suitable for cloud deployment. In addition, only the log stream is used to calculate and store the transmitted data, and there is no need to write dirty pages, thus reducing the amount of network in the system. In general, CynosDB has the following advantages:

1. Elastic expansion of computing capability. You can quickly add slave nodes to expand read capacity without having to make a full copy of data.

\2. Elastic expansion of storage capacity. Unlike traditional databases, they are not limited by the storage capacity of a single machine.

\3. Make full use of hardware resources. This alleviates idle CPUS and low disk utilization in the traditional active/standby architecture.

\4. Backing up is easy. Backup is completely continuous in the background with no user intervention.

3, compatible with PostgreSQL version of CynosDB computing layer architecture

CynosDB implements the separation of computing and storage, so the system is divided into two large blocks: computing layer and storage layer. The computing layer is responsible for SQL parsing, log generation, etc. The storage layer is responsible for data storage, log archiving, and log merging. This section uses the PostgreSQL compatible version of CynosDB as an example to describe the computing layer architecture. Its computing layer architecture is shown in the figure below. To implement this NewSQL architecture, we have made a new design for the PostgreSQL kernel:

The grey parts are PostgreSQL kernel native modules:

\1. SQL: PostgreSQL SQL engine, including lexical/syntax analysis, semantic analysis, query rewriting/optimization, and query execution. CynosDB is designed with no SQL layer changes, so it is compatible with PostgreSQL’s original SQL syntax and semantics.

\2. Access: The Access layer of the database, which defines the organization and Access methods of objects. These include:

LHeap: table implementation and access methods, including scan, update, insert, delete, etc.

Lbtree/gin/gist/spgist/hash/brin: index, including the implementation of various indexes and operation mode, such as index scans, inserts, etc.

LCLOG /MultiXACT: with transaction commit status and concurrency, etc.

Access is the key module of design and optimization. When database objects such as tables and indexes are modified, native PostgreSQL generates xlogs and writes them to log files. In CynosDB, logs are also generated when these objects are modified, but these logs are not written to a local log file, but are sent to the CynosStore.

\3. Storage /buffer: Buffer pool and storage management, call the file interface to read and write data files. Use the CynosStore Client in CynosDB to operate on files in CynosStore.

\5. The CynosStore Client provides an interface to access the CynosStore to complete the operation of the database on the data file. The interface includes data page reading interface and log sending interface.

\6. Distributed Storage CynosStore is a log-based distributed block storage, which will not be discussed in this article.

The computing layer of CynosDB will modify the data files to generate logs, which will be sent to the distributed storage CynosStore through the CynosStore Client, and the CynosStore will periodically merge the logs to the data page. The important point here is that the computing layer does not write logs as PostgreSQL’s native XLog, but as our own redesigned logging system and log format. As a result, CynosDB does not rely on PostgreSQL’s native logging system, and this design also gives us the opportunity to do more performance tuning on CynosDB. See the next section for details.

4. Architecture optimization

The architecture design of the CynosDB computing layer follows the following ideas:

1. “Minimalist IO.” That is, reduce the network/disk I/O

\2. Efficient system design. Asynchronous log design to reduce CPU load on the computing layer

These designs enable CynosDB to perform better than an equivalent configuration on the cloud. This section mainly introduces the optimization methods of the computing layer.

4.1 Log System

CynosStore is a distributed block device that supports log writing and multi-version read. The DB engine sends changes to files in the storage in the form of logs. The log format is < page number, page offset, modified content, modified length >, which indicates the modified content at the offset of the page. Logs designed this way are idempotent.

For example, the original PostgreSQL XLog format might be:

< relfilenode pageno, offsetnum informask2, infomask, hoff, tuple_data > : Insert a tuple in the offsetnum position of the page (defined by RelFilenode and Pageno). The inserted tuple is reconstructed by informasK2, Infomask, Hoff, tuple_data, etc during recovery.

The same operation might generate the following log in CynosDB. Suppose we insert a tuple on a page with a page number of n:

<n,10,(char *) & PD_flag,2> — Save page header PD_flag to log

<n,12,(char *) &pd_lower,2> — Save page header pd_lower to log

<n,14,(char *) & PD_upper,2> — Save page header PD_upper to log

<n,36,(char *) &ItemIdData,4> — Saves the 3rd element of the ItemIdData array to the log

<n,7488,(char *) tuple,172> — Save tuple to log

These entries record all the changes made to the page when the tuple is inserted, and they eventually form a MTR (Mini-Transaction Record) in the CynosStore Client, representing an atomic change to the database storage structure, for example: Btree structure, page structure modification; During the log replay, all the logs of an MTR need to be applied completely, otherwise the database storage structure will be damaged), and be sent to storage in the log stream. When the storage needs to merge the MTR into a page, make sure that all the logs in the MTR are applied completely, and any incomplete application will result in incorrect page structure.

We’ve optimized the PostgreSQL kernel to take advantage of the logging features, and the log size overhead is similar to PostgreSQL’s native XLOG. These optimizations and designs include:

\1. Remove full Page Write (FPW) feature from PostgreSQL. In order to ensure that the torn pages can be restored correctly after the system restarts, PostgreSQL Checkpoint will record the whole page to the log when the page is modified for the first time. This feature is FPW. Similar to MySQL double write. When Crash Recovery occurs, the system will play back the log with the full page as the base page and write the recovered page to storage, regardless of whether the page in the storage page is half a page. Due to the idempotent nature of CynosDB logs, when a half-page is written, the system plays back logs on the page to restore the page to a consistent state. Thus, no native FPW is required in CynosDB, reducing the amount of logging.

\2. Remove the dirty page flushing operation in the system. CynosDB saves changes to the page by logging, and can merge the logs on the base page to get the latest page, so there is no need to scrub the original system, just brush the log is enough.

Through the above optimization, the network I/O and log volume can be greatly reduced.

\3. In addition to the above optimization of the PostgreSQL kernel, CynosDB has also streamlined and compressed the logging method. All logs of CynosDB have log headers. If you modify the LogHeader shared by multiple logs on the same page, the overhead of multiple log headers can be saved, as shown in the following figure:

LH stands for LogHeader and Log Element represents a page change to the page. As shown in the figure above, there are two logs modifying Block1, and each modification has a log header (LH). After log header merging and optimization, the new MTR is formed, and the logs modifying Block1 share the same log header.

If two logs that modify the same page are adjacent, you can further merge the two logs into one log. This reduces the number of log entries, resulting in faster log merging and page generation.

4.2 page CRC

In PostgreSQL, CRC attributes are computed and populated before a page is flushed, whereas in CynosDB, a log generated for CRC is written to the storage, which increases the CPU burden and the number of logs on the compute node. To solve this problem, we relegated the CRC computation to storage, thereby reducing the CPU burden on the computing layer, as well as the number of log entries.

4.3 Asynchronous Table Extension

A native PostgreSQL database uses a local file system to store data, and its file expansion operations are synchronized and reflected to disk files in real time. However, the expansion operation of CynosDB is implemented through logs. If the expansion is flushed to the logs, the system performance will be affected. Therefore, we implemented asynchronous file expansion, that is, the logs of file expansion are kept in the system’s log buffer, instead of being flushed to the storage in real time for each extension. When the transaction is committed, these logs are flushed to the storage, which significantly improves the performance of batch data import. In addition, extension operations can extend multiple pages in a file at once, reducing the number of times extension operations are called.

subsequent

In the future, we will make more exploration in new hardware, multi-master architecture and other fields, bringing more surprises and highlights to the database product form on the cloud.

This article has been published by Tencent Cloud + community authorized by the author