Guide language | database is in transformation period, the change of power at the same time from the external cause and internal cause, external cause is the change of user requirements, internal cause is the outbreak of the new technology. Users’ demands shift from physical to logical ownership of data, so the form of cloud services is becoming more widely accepted. The explosion of new technology is reflected in the productization of new storage media. Tencent cloud native database is the product of this change, Tencent cloud native database in the way of cloud services to provide better database performance, availability and reliability. This article is compiled and shared by Tencent Cloud database technology Director Zhang Qinglin in Techo TVP Developer Summit “Song of Ice and Fire of Data — from online database technology to Mass data Analysis technology” “Tencent Cloud TDSQL-C Architecture Exploration and Practice” speech. We will give a detailed introduction to Tencent cloud native database architecture, breakthroughs in core indicators and exploration.


Click here to watch a video of the speech

I. The past and present of Tencent cloud native database

Our share today is mainly composed of three parts. The first part is the background of tdSQL-C, namely, why we do TDSQL-C, its architecture and current situation. The second part mainly introduces the breakthrough innovations of TDSQL-C and what we have done to make it convenient for users from a user’s perspective. The third part is the RoadMap for the future of TDSQL-C.

Let’s start with point one. We were engaged in the product operation and research and development of CDB before, but we encountered some problems that were difficult to be solved in the field of traditional database. First, the storage capacity. When the disk capacity of a single machine reaches a certain level, it will bring corresponding trouble to the business. The second one is expansibility. Those who are familiar with database operation and maintenance should know that a typical scenario is that we add machines when there are business activities and reduce machines after the business activities. The process is generally through backup, components and instances, and the efficiency is poor. The third is availability, typically for disaster recovery. From the perspective of dbAs, when HA occurs in disaster recovery, the timing is uncontrollable. Fourth is reliability, because single traditional MySQL architecture, it’s a master, and stored in local storage, when our local disk damage, its data reliability will have a problem, the traditional MySQL to do the data backup and data recovery, if the delay or DDL this problem of large number, if happen again this time HA, The database service is actually unavailable at this point.

Based on the problems encountered in the operation and maintenance of traditional databases, combined with some architectures in the industry, we independently developed a database product of Tencent for storage and separation.

This is the architecture diagram of our overall TDSQL-C. It is similar to traditional MySQL, supporting one read-write node, multiple standby library nodes, and the standby library node can support up to 15 read nodes. It is divided into two parts of the computing and storage, and compute nodes is mainly responsible for the database of the traditional areas of business logic, such as transaction, lock, and the commonly used DML, all in the field of traditional database database operations, except the computing layer between the two parts is the other layers are in the middle of calculating calculate layer is not responsible for data persistence operation, data persistence operation will be issued to the storage layer, The storage layer is responsible for persisting the data. The storage layer is HiStore (network storage) and the maximum storage size supported by itself is now 1PB. The difference between the primary and secondary databases is that the Redo log is used to synchronize data between the primary and secondary databases. After the Redo log is sent to the standby database, only the BP between the standby databases is synchronized.

Let’s look at the overall native MySQL master-standby architecture and the current TDSQL-C architecture. Storage in native MySQL is performed locally, depending on the local Dict store and limited by the local storage space. The storage in TDSQL-C is network cloud disk, which is divided into two levels of storage, HiStore is responsible for storing data, and backup is responsible for storing backup.

Its computing and storage are separated, but the primary and secondary nodes of the compute node share storage, which is the following HiStore. Unlike the traditional MySQL architecture, it is a computable storage that embodies two things: The primary compute node sends Redo logs to the storage layer. The storage layer persists data based on its Base page and the Redo log. The storage layer does not send Redo logs to the computing layer until it receives the Redo log. Prove that the transaction has now ended, allowing the business logic to continue. While the business logic is running, the storage layer asynchronously processes the Redo log. The Redo log stored in the computation generates a new one, which is persisted in the HiStore to persist data operations. So there is no persistence of data under traditional CDB or traditional MySQL architecture, and there is no performance jitter associated with WAL.

After HiStore persists the data, it reclaims the memory occupied by the Redo log. Our storage also has the following features: When a local disk is mapped to a network disk, it is mapped to a cell in the storage space based on its physical address. Therefore, its logs are distributed by page. Each cell has a corresponding data page and Redo log. Therefore, our store is required to support multiple versions of data, which are generated by our Base Page plus the Redo log passed from the computation to a specified version of data.

Just introduced our own architecture, and overall architecture, TDSQL-C has the following features: the first is mass storage, intelligent capacity expansion. Storage is supported by HiStore, and the network storage can support up to 1PB, which is not a level relative to the current single capacity of dozens or a few TONS. The second is that our overall QPS memory can be in the millions, so its performance is linearly expanded. Third, it is compatible with MySQL and PG, and does not have some problems caused by distributed transaction locks. It does not carry out data sharding, so it naturally supports distributed. Followed by extensibility, compared with traditional construction or traditional RDS CDB architectures, it does not depend on the local physical or logical backup derivative according to backup capacity, but directly from the file system to do a snapshot of a snapshot and Redo log for expansion, so the expansion of the basic within a minute of time, can be said to be the second stage expansion. The third is related to Serverless, backup, and failover, which we’ll look at below.

Ii. Breakthroughs in core indicators of Tencent cloud native database

1. Break a

These are some of the architectures and features supported by TDSQL-C in general, and where we are today. In implementing TDSQL-C, a distributed product with separate computing and storage, we do some features to solve users’ practical problems:

The first is the Serverless scenario. Before supporting Serverless, we would first purchase a billing database instance, including storage space, network storage space and computing resources, when doing business development and operation and maintenance. The billing starts from the moment we purchase the database instance. However, it must be a low-frequency use period during business development, so in our overall development scenario, there are few scenarios using database. In the Serverless scenario, the billing will be based on the time you use. There will be 12 points in one minute and 720 points in one hour for this instance. Only when you actually use your time on the clock can you count it as billable time. In the Serverless scenario, when you buy an instance, you will have one computing resource and one storage space, which is fully charged when you actually use it, but when you don’t use it, the billing space is only storage space, so it reduces our consumption and maximizes user profit.

It doesn’t matter when Serverless is started or shut down, it will automatically shut down when there is no access to it, and when a request is sent again, we will have an intermediate way to automatically define that it needs to be accessed again, so it will quickly pull up the instance. So it has two features, the first is intelligent extreme elasticity: extreme start-stop speed, and the second is true pay-as-you-use.

To implement Serverless, we did the following. Since we are a local network, network latency increases. To achieve fast start and stop, we keep BP independent of MySQL. It is actually independent of MySQL process when purchasing MySQL instance or creating BP.

Because it takes time to analyze the entire MySQL startup and stop process, we also parallelize the process in order to meet the true “as you use it, you spend it”. MySQL > create table (exten); MySQL > create table (exten); MySQL > create table (exten); Only when you write data to a page of 1 MB space can you truly charge for Dr Storage. Therefore, from the user’s point of view, the user’s billing storage is basically reduced to the minimum, and we will continue to optimize the following, really achieve page level usage billing.

2. The breakthrough

We can single machine capacity in several G or G, this time it’s memory is relatively small, such as only hundreds of gigabytes of memory, or less than a T, T, but there may be hundreds of storage space belongs to the typical type of IO Bound at this moment, in order to solve this problem, we put this BP memory to do the second level cache, In the IO Bound scenario, we don’t actually phase it out, we actually phase it out to our local SSD storage or local AEP storage, and the next time we use it, we can read it directly from the local storage, which minimizes network IO consumption. In our test scenarios, overall performance improved exponentially when the hit ratio was low, even below 50%.

In addition, when we do level 2 cache with more and more local disk space, first of all, the capacity is controllable. We can automatically configure how much storage space this piece occupies as level 2 cache of local BP. Memory specifications and memory management are similar to BP’s memory management. Local level 2 caches will be eliminated first. When local files are not large enough, it will follow a series of elimination algorithms to eliminate. Disk management is independent of local files, so it does not use network I/O. In this case, the local I/O can compensate for the total network I/O consumption, so the overall performance gains are significant. This diagram is the architecture of our local level 2 cache.

3. Break through three

The third breakthrough is perceptionless backup. In traditional CDB architecture or TRADITIONAL RDS architecture, logical backup or physical backup are used for backup. There are two typical problems. Both logical backup and physical backup involve a large amount of I/O operations and occupy a lot of collective resources. There is a large lock in order to obtain the site for subsequent incremental backups.

We pursue two goals in backup: one is imperceptive backup, so that users are unaware of it; The second is extreme speed, because in traditional backup recovery, recovery is based on local data, if it is a logical backup, copy data, if it is a physical backup, plus incremental backup. So the goal is to have no awareness backup and extreme speed back up, which is the only way to achieve second expansion.

This is the backup and backfile in TDSQL-C. The backup is actually a snapshot backup of the file system made by HiStore. It is equivalent to if I send a backup command, I will first take a snapshot of my previous HiStore, and the previous write operation will be written to the new data store. So when I do a backup, I copy my original data and upload it to COS, because our underlying backup is spread over multiple cells, so the backup is also parallel backup. A file callback is also a parallel file callback. It sends the Redo log to the cell, which contains the original data and the Redo log, and each cell applies the Redo log to the file. So both backup and refile, we implement parallel, and the refile time is not like Binlog logical changes, but directly locate some physical changes, and the speed of refile is GB level. Depending on the local computable storage and HiStore features, we can implement automatic perceptionless backup and second-level refile.

Iii. Future exploration of TDSQL-C

Based on a series of things we are doing now, on the one hand, 0-1, and on the other hand, combined with the problems encountered in the operation and maintenance process, the future development direction of TDSQL-C has two points: the first point is minimalist database operation and maintenance. From the perspective of operation and maintenance, to identify the future development direction, the first is the intelligence challenge, people who have used MySQL or CDB should encounter some problems in the MySQL optimizer, for example, when upgrading, it is found that the execution plan is wrong, or when the amount of data reaches a certain extent, the execution plan will also be wrong. The kernel of TDSQL-C optimizes the overall statistics in the optimizer and determines its execution plan for dynamic tuning. For example, during the upgrade, if the execution plan is found to be wrong, it can be automatically corrected and sent to o&M.

Because at the time of separate computing and storage products we to modify the kernel code amount is larger, and to increase the storage capacity, read and write in the extreme performance optimization on we also do the corresponding modification, so we in order to guarantee its logical rigor will do accordingly, will include the introduction of a industry facilities, in principle of the control to the kernel modifications. From the point of view of cost, most of the cost in the Serverless scenario is incurred by storage, so the cost of users should be reduced. This can be done in two ways: on the one hand, what users actually use is what they pay for. There is a concept of storage space, and we need to truly achieve the page level of storage space. The other is triple copy, which can reduce the storage space by about 1/3 through data compression storage, erasure code and other technologies. This is also our future direction.

The third point is efficiency, because when we do DDL, when the capacity of a single table reaches T level, MySQL will first scan its index, scan all the data related to the index, sort each, merge sort, and create the index, this series of process is actually very time-consuming. There are two ways to optimize this operation: The first is Instant DDL, which supports more Instant DDL for changes in size per byte. Second, the process of index creation through it will parallelize all the processes involved just now, such as scanning B tree, sorting and building B+ tree. If instant DDL can be implemented, second-level DDL operations can be achieved. If not, instant DDL operations can be implemented. Parallel Index processing is performed.

Business-oriented Low Database development, especially in the product TDSQL-C, the data storage scale will be relatively large, such as hundreds of T or P level similar to the maximum capacity, so the data analysis ability will be improved at this time. First, we will increase its maximum storage, and through the optimizer and parallel execution framework, Maximize the use of machine resources to increase single machine memory. We use new media for storage, such as AEP or SSD, to greatly improve the local level cache as described earlier, or to maximize the speed of a range of things that can be done locally.

In terms of business types, such as financial or government disaster recovery compliance businesses, we will conduct full-link audit, including field encryption, and optimize tdSQL-C product capabilities based on the principle of global remote disaster recovery and nearby access.

Finally, the ability to integrate AP into tdSQL-C products, we are currently developing a column store product that is built into TDSQL-C and uses column store locally to start the whole scenario acceleration.

The lecturer introduction

Qing-lin zhang

Tencent Cloud database technology director

Tencent cloud database CDB, TDSQL-C database kernel research and development. Technical director of Tencent Cloud database, Tencent Cloud Evangelist, MySQL architect, technical director of Tencent Cloud Architecture Platform Department cloud native database kernel research and development team, Mariadb Fund Board & Mariadb community version development member, focusing on MySQL kernel development and related architecture and transition work.