Depth | lead domestic database to world, POLARDB underlying logic is what?

POLARDB is ali Cloud independent research and development of the next generation of cloud native distributed database, 100% compatible with MySQL, PostgreSQL and other open source databases, highly compatible with Oracle syntax, RDS service customers do not need to modify the application code, can be migrated to POLARDB, experience greater capacity, Higher performance, lower cost, and more flexibility.

At present, POLARDB is the fastest growing database product of Aliyun, widely used in Internet finance, government convenience project, new retail, education, games, social broadcast and other industries.

As a new generation of cloud native database based on the separation of computing and storage architecture, POLARDB’s compute nodes mainly implemented SQL parsing and optimization, as well as query parallel execution and high performance transaction processing without lock, memory status synchronization between compute nodes through high throughput physical replication protocol.

The storage layer is based on the distributed file system PolarFS. Parallel Raft consensus algorithm is used to achieve strong consistency between multiple data copies. Multi-version page management of the storage engine is carried out in the storage layer to support the Snapshot Isolation Isolation level of the whole cluster across computing nodes.

01 Advanced architecture based on separation of computing and storage

Filter and projection operators are pushed down from the computing layer to the storage layer through intelligent interconnection protocols that understand database semantics between computing nodes and storage nodes. To ensure low latency of transactions and query statements and reduce the delay of status synchronization between compute nodes, compute nodes and storage nodes are connected using a 25Gb high-speed RDMA network and the user-mode network protocol layer of the Bypass kernel is used for communication.

Based on the advanced architecture of computing and storage separation, POLARDB can flexibly scale from 1 compute node (2 CPU cores) to 16 compute nodes (up to 1000 cores), and the single instance storage capacity can flexibly expand from 10GB to 100TB according to the usage.

The architecture of separating compute nodes from storage nodes gives POLARDB real-time horizontal scaling capabilities. Since the computing capacity of a single database instance is limited, the traditional way is to build multiple database copies to share the pressure, so as to provide the ability of database Scale out.

However, this approach requires the storage of multiple full copies of data, and frequent synchronization of log data results in high network overhead. In addition, on a traditional database cluster, adding replicas requires synchronization of all incremental data, which creates the problem of synchronization latency escalation.

POLARDB stores database files and log files such as Redo logs on a shared storage device, ensuring that the primary instance and all replicas share the same full and incremental log data. Only the metadata information in the memory needs to be synchronized between nodes. With the guarantee of MVCC mechanism, it can support the consistency of data read across nodes, which cleverly solves the problem of data synchronization between master instances and replicas, greatly saves the network overhead across nodes and reduces the synchronization delay between replicas.

Improving transaction performance POLARDB kernel level optimization revealed

POLARDB does a lot of optimization at the kernel level to improve transaction performance. The scalability of the system can be improved greatly by using a series of scalability bottlenecks using lockless algorithm and various parallel optimization algorithms to reduce or even eliminate the conflicts between locks.

At the same time, based on our experience in the large-scale and high-concurrency scenario of Double 11, we realized the optimization function of hot data such as inventory on POLARDB. For simple, repetitive queries, POLARDB supports fetching results directly from the storage engine, reducing optimizer and executor overhead.

In addition, further optimize the already efficient physical replication. For example, we added some metadata to redo logs to reduce the CPU overhead of log parsing. This simple optimization reduces log parsing time by 60%. We also reuse some data structures to reduce the overhead of the memory allocator.

POLARDB uses a number of algorithms to optimize the logging application, such as requiring the logging application only for data pages in the buffer pool. At the same time, we also optimized page cleaner and double write buffer, greatly reducing the cost of these work. This series of optimizations resulted in POLARDB far outperforming MySQL in performance, reaching a maximum of 6 times MySQL performance in benchmarks such as sysbencholtp_INSERT with a large number of concurrent writes.

03 Parallel Query Support

To improve the power of complex queries such as subqueries and joins (such as TPC-H benchmarks), The POLARDB query processor supports parallel Queries that can execute a query simultaneously on multiple or all available CPU cores. Parallel query can divide a query task (currently, only SELECT statements are supported) into multiple sub-tasks, which can be processed in parallel. The overall leader-worker concurrency model is adopted.

The Leader thread is responsible for generating the parallel query plan and coordinating other components of the parallel execution process. The parallel execution plan includes sub-actions such as parallel scan, parallel join of multiple tables, parallel sort, parallel grouping and parallel aggregation.

Message Queue is the communication layer between the leader thread and worker thread. The worker thread sends data to the leader thread through the Message Queue, and the leader thread also sends control information to the worker thread through the Message queue.

The Worker thread is responsible for actually executing the task. The Leader thread parses the query statement to generate parallel plans, and then starts multiple worker threads at the same time to process parallel tasks. In order to efficiently execute the query, the execution on the worker does not need to be optimized again, but directly copies the generated plan fragments from the Leader. This requires the implementation of copies of all nodes in the execution plan tree.

The worker thread returns the intermediate result set to the leader after scanning, clustering, sorting, etc. The leader is responsible for collecting all data sets from the worker, and then performs appropriate secondary processing (such as merge sort, secondary group by, etc.), and finally returns the final result to the client.

The Parallel Scan layer combines the data structure characteristics of the storage engine to achieve workload balancing. How to divide the scanned data into multiple partitions so that all workers can work as evenly as possible is the goal of data partitioning. In a storage engine that uses B+ trees as its storage structure, partitioning is done from the root first, and if there are not enough partitions (>= parallelism) at the root, partitioning is continued from the next level.

If we need six partitions, the root node can split up to four partitions, so we need to search the next level to partition, and so on. In the actual parallel query process, in order to make multiple worker threads allocate scan segments more evenly, partitions will be divided in the B+ tree as much as possible. In this way, if a worker thread has a high filter and will finish the current partition first, it will automatically attach the next partition to continue execution. Load balancing of all threads is realized by automatic attach.

A new generation of cost-based optimizer

The customer’s business on the cloud is diversified, and the wrong execution plan will lead to slow query. To systematically address these issues, POLARDB has introduced a new generation of cost-based optimizers. The new Compressed Histogram is realized in POLARDB to automatically detect and construct accurate description of high frequency data, and the data frequency and value space are considered in the calculation of selection rate, so as to solve the widespread data skew scenario in practical applications.

POLARDB conducts a large number of cost estimation based on improved histogram, such as estimating the result size of table and table join, which is a decisive factor in the optimization of join cost and join order. MySQL can only roughly estimate according to empirical formula. No matter the rows_per_key with index or the default parameter value without index, the estimation error is large, and these errors will be amplified in the process of multi-table join, resulting in inefficient execution plan.

In POLARDB, histogram is used to merge the overlapping parts, and different estimation algorithms are adapted to different histogram types, which greatly improves the estimation accuracy and helps the optimizer to make better join order selection. In the test of randomly generated normal distribution data, the speed of multi-table joint query can be increased by 2.4-12 times after optimization. In tPC-H test, the join order of multiple queries is changed, and the performance is improved by 77%-332%.

POLARDB also optimized the record_in_range logic using histograms, and MySQL used index Dive to estimate the number of records in the range for indexed filters, which is CPU intensive in OLTP short queries. After replacing Index Dive with histogram based estimation, the response time of most queries in taobao’s e-commerce core business was halved.

05, self-developed distributed file system PolarFS: high reliability, high availability, and database co-design

The storage layer of POLARDB is PolarFS, a distributed file system independently developed by Aliyun. PolarFS is the first high performance distributed storage system with low latency and full user space I/O stack designed for DB applications in China (see VLDB 2018 article PolarFS: An ultra-low Latency and Failure Resilient Distributed FileSystem for Shared Storage Cloud Database, It provides low latency and high performance I/O capabilities comparable to the local SSD architecture, and provides excellent storage capacity and performance expansion in distributed cluster mode.

As a storage infrastructure with deep cooperation with POLARDB, PolarFS ‘core competitiveness is not only reflected in performance and scalability, The deeper level is in the face of many challenging POLARDB customer business needs and large-scale public cloud R & D operation process and long-term accumulation formed a series of highly reliable, highly available, and database co-design storage technology.

In order to support POLARDB to distribute queries across multiple compute nodes while maintaining global Snapshot Isolation semantics, PolarFS supports storing multi-version pages dynamically generated by the POLARDB storage engine B+ tree.

In order to reduce read/write conflicts, modern databases generally provide different transaction isolation levels such as RC, SI and SSI through THE FRAMEWORK of MVCC concurrency control. Under THE MVCC mechanism, each page of B+ tree dynamically maintains a series of versions, and multiple transactions in concurrent execution allow each to access different versions of a page.

In a POLARDB cluster, the pages of the B+ tree of each compute node may have different versions due to the cross-node replication synchronization delay. In this case, multi-version storage can provide the corresponding version for each node. In POLARDB, when a compute node writes a page to PolarFS, it provides the version information of the data page (LSN). PolarFS stores not only the data page but also the metadata of the data version. When a compute node reads a data page, it also provides version information to obtain the corresponding (historical) version of the data page from the storage.

The POLARDB database layer periodically sends low water levels of all compute node versions in the cluster to PolarFS, and PolarFS cleans up historical versions that are no longer used based on this version number.

Data reliability is the bottom line of all POLARDB designs. In a distributed system, bugs in hardware, firmware, and software, such as hard disks, networks, and memory, may cause data errors, posing various challenges to data reliability. Storage side reliability problems are due to silent errors (lost write, Misdirected write, Block Corruption, etc.), network and memory mainly due to bit inversion and software bugs.

POLARDB and PolarFS provide end-to-end full-link data verification to ensure data reliability in the event of various exceptions (including hardware failure, software failure, and manual operation failure).

When data is written, POLARDB starts from the storage engine of the compute node, until the data falls to the disk of PolarFS storage node. The intermediate link will verify the correctness of data to prevent abnormal data writing.

When data is read, PolarFS and POLARDB storage engines checksum the read data to accurately identify the occurrence of disk silent errors and prevent silent errors from spreading.

When service traffic is low, a data consistency scan is performed continuously in the background to check whether the checksum of the single copy data is correct and whether the data among the copies is consistent. Correct verification in the process of data migration is also very important: POLARDB in any form of data migration action, in addition to the checksum of the copy of its own data, but also to the consistency of multiple copies of data verification; When both checks pass, data is migrated to the target end. To prevent the spread of data errors on a single copy due to migration, and avoid data corruption.

PolarFS also supports quick physical snapshot backup and restore for POLARDB. Snapshot is a popular backup scheme based on storage system. In essence, the redirect-on-write mechanism records metadata changes of block devices, replicates the Write operations On the storage volume, and modifies the Write operations to the newly replicated storage volume to restore the data at the snapshot point in time.

Snapshot is a typical post-processing mechanism based on the time and write load model. In other words, when a snapshot is created, data is not backed up. Instead, the load of backup data is evenly distributed to the time window when data is written after the snapshot is created to achieve fast backup and restore response.

POLARDB uses the snapshot mechanism of the underlying storage system and the Redo log incremental backup to perform point-in-time user data recovery more efficiently than the traditional full data and logical log incremental data recovery.

06 Highly compatible Oracle syntax costs 1/10 of commercial databases

In addition to being 100% compatible with MySQL and PostgreSQL, two of the most popular open source database ecosystems, POLARDB is highly compatible with Oracle syntax and offers cloud solutions for traditional enterprises at 1/10 the cost of commercial databases.

By replacing Oracle GUI management tool OEM with DMS, and replacing SQL Plus command line tool with POLARDBPlus, the usage of OracleDBA is followed. The CLIENT SDK can be replaced with libpq and JDBC Driver from OCI and O-JDBC Driver. Only the so and JAR packages need to be replaced, and the program main code does not need to be modified.

Support for Oracle SQL general DML syntax, almost all advanced syntax such as connect by, Pivot, listagg, etc. PL/SQL stored procedures and built-in function libraries used by stored procedures can also be fully covered.

For some advanced functions (such as security management, AWR, etc.) provide exactly the same format layout and operation syntax, so in a comprehensive view, POLARDB on Oracle operation method, use habits, ecological tools, SQL syntax, format layout, etc., have done a comprehensive compatibility and replacement, combined with migration evaluation tool ADAM, Applications can be made with little or no change.

Look ahead: More new technologies and enterprise features are coming soon

In addition to the technologies described above, there are a number of new POLARDB technologies and enterprise-class features that will be released in the second half of 2019 that will improve the overall availability, performance and cost of using POLARDB:

1) From elastic storage to elastic memory, warm buffer pool technology

POLARDB will support “hot” buffer pools that are deconstructed with compute node processes, which will greatly reduce the impact on user business when a compute node restarts. When model replacement specifications upgrade or downgrade (Serverless), the impact on services is less. At the same time, a separate memory makes it possible to dynamically expand or contract on demand.

2) Multi-fold performance increase, better DDL support (FAST DDL)

POLARDB will soon support parallel DDL, which will greatly reduce table-level DDL latency. This feature takes parallelism to the extreme and can reduce DDL time such as index building by nearly 10 times. POLARDB also implemented a number of DDL replication level optimizations, which enabled DDL to replicate in large quantities across regions faster and with less resource consumption.

3) Support Global Database across regions

POLARDB supports physical replication across geographies and over long distances, helping users build their global database deployments. Through physical replication, data can be copied to equipment rooms all over the world in real time, so that global users’ queries can be quickly responded to in the local equipment room.

4) Partition table support

POLARDB supports 100 TB storage capacity. But as the size of the table increases, so does the level of single-table indexes, resulting in slower data lookup and location, and physical locks on some single-table indexes also cause parallel DML to hit the ceiling.

So proper zoning becomes even more urgent. Previously, many users relied on the database external middleware to reduce the pressure of single table.

However, with the development of POLARDB in various aspects such as parallel query, we can put these functions of sub-database sub-table through the form of partition table in the database more effectively.

Efficient partitioning not only enables us to support larger tables, but it also reduces global physical lock conflicts for some database indexes, thereby improving overall DML performance.

At the same time, this mode can better support the separation of hot and cold data, the data of different “temperature” stored in different storage media, in order to ensure the performance of data access and reduce the cost of data storage.

POLARDB has enhanced a series of partitioned table features, including Global Index, Foreign Key Constraint, and Interval Partition, making POLARDB better able to deal with large tables

5) Row-level compression

POLARDB is launching line-level compression soon. The industry’s common practice is to compress data at the page level using general compression algorithms (such as LZ77 and Snappy), but the problem of excessive CPU cost is caused by the change of a row of data, the entire data page needs to be decompressed, changed, and compressed again.

In addition, the data pages will have bloat sizes in some cases and that will result in multiple splits. POLARDB uses fine-grain row-level compression technology to compress different data types in specific ways.

The data is stored in both external and memory in a compressed manner, and the row-level data is decompressed only when a query is needed, rather than the entire data page. Since data is compressed except for queries, logs also record compressed data, which further reduces the size of logs and the pressure on data/logs transmitted over the network. The corresponding index also stores only compressed data.

The reduction in the overall amount of data is sufficient to offset the additional overhead caused by decompression, so that the compression greatly reduces data storage without causing performance degradation.

6) In-memory column storage (HTAP)

In the traditional database world, analytical databases and online transaction processing are separated. Therefore, it is usually necessary to import the data of online transaction processing and the data of previous analysis processing into the data warehouse at the end of a day’s operation and run the analysis to generate the corresponding report.

In HTAP database, it saves the time and operation cost of large-scale data movement, one-stop solution to the needs of most enterprise applications, and issue T+0 analysis report synchronously at the end of the transaction day.

In response to this demand, POLARDB implements in-memory column storage tables. Direct synchronization with POLARDB row-storage data through physical logical logs. In this way, real-time big data analysis can be carried out on these stored data through specific operators suitable for analysis. So that users can get one-stop analysis results.

7) Cold and hot separation storage Engine X-Engine

The scale of the data storage is more and more big, but not all of the data access frequency are the same, in fact, the data access is always presents the obvious characteristics of cold and hot distribution, based on the characteristics of X-ray Engine designed a hierarchical storage architecture, hot or cold depending on the data access frequency (cold) the data is divided into multiple levels, according to the characteristics of each layer data access Design the corresponding storage structure and write appropriate storage devices.

Different from the traditional B+ Tree technology, X-Engine uses LSM-Tree as the architecture foundation of hierarchical storage, uses multi-transaction processing queue and pipeline processing technology to reduce the cost of thread context switching, and calculates the task quantity ratio at each stage to make the whole pipeline fully flow and greatly improve transaction processing performance. Data reuse reduces the cost of merging data and reduces performance jitter due to cache obsolescence because of data reuse. This is further enhanced by using FPGA hardware to speed up compaction.

Compared to other storage engines with similar architectures such as RocksDB, x-Engine’s transaction performance is more than 10 times better.

X-engine: An Optimized StorageEngine for Large-scale e-Commerce Transaction Processing x-Engine: An Optimized StorageEngine for Large-scale e-Commerce Transaction Processing

At present, POLARDB not only supports Alibaba Group Taobao, Tmall, Cainiao and other business scenes, but also widely used in government affairs, retail, finance, telecommunications, manufacturing and other fields, there have been 400,000 databases moved to Ali Cloud.

Based on POLARDB distributed database, Beijing’s public transport system quickly and smoothly arranges more than 20,000 buses in the city, facilitating 8 million trips per day.

Zhongan insurance uses this database to process policy data, increasing efficiency by 25%.

The original link

This article is the original content of the cloud habitat community, shall not be reproduced without permission.