Introduction: On March 2, Ali Cloud open source PolarDB enterprise architecture conference, Ali cloud PolarDB kernel technology expert Yan Hua brought the theme of PolarDB HTAP details of the wonderful speech. On the basis of PolarDB storage computing separation architecture, we developed the MPP distributed execution engine based on shared storage, which solves the problem that a single SQL execution cannot use computing resources of other nodes and cannot give full play to IO bandwidth of shared storage pool. Meanwhile, it provides the guarantee of elastic computing and elastic expansion. PolarDB initially has the ability of HTAP. This topic mainly introduces the functional characteristics and key technologies of PolarDB HTAP.

In March 2, Ali cloud open source PolarDB enterprise architecture conference, Ali cloud PolarDB kernel technology expert Yan Hua brought the theme of PolarDB HTAP details of the wonderful speech. On the basis of PolarDB storage computing separation architecture, we developed the MPP distributed execution engine based on shared storage, which solves the problem that a single SQL execution cannot use computing resources of other nodes and cannot give full play to IO bandwidth of shared storage pool. Meanwhile, it provides the guarantee of elastic computing and elastic expansion. PolarDB initially has the ability of HTAP. This topic mainly introduces the functional characteristics and key technologies of PolarDB HTAP.

Live video review: developer.aliyun.com/topic/Polar… PDF download: developer.aliyun.com/topic/downl…

The following is arranged according to the video content of the press conference speech:

The background,

Many PolarDB users have TP and AP sharing requirements. They use PolarDB to handle TP requests with high concurrency during the day, and continue to use PolarDB for AP report analysis at night when TP traffic drops and the machine is idle. But even then, you’re still not maximizing free machine resources.

Because the native PolarDB PG system faced two major challenges in handling complex AP queries: First of all, a single SQL can only be executed on a single node under the native PG execution engine. No matter whether a single node is serial or parallel, CPU memory and other computing resources of other nodes cannot be utilized. It can only be vertically scaled Up, but not horizontally scaled Out. Secondly, PolarDB bottom is the storage pool, theoretically IO throughput is infinite. In the native PG execution engine, a single SQL server can only be executed on a single node. Due to the CPU and memory bottlenecks of a single node, the advantages of large I/O bandwidth on the storage side cannot be fully exploited.

PolarDB decided to do HTAP in order to solve the pain points of actual use. Current HTAP solutions in the industry mainly include the following three types:

(1) TP and AP are separated in storage and calculation. TP and AP can be completely isolated without affecting each other. However, there are some problems in practical use. First of all, TP data needs to be imported into AP system, resulting in a certain delay, resulting in low timeliness. Second, redundant AP systems need to be added, which will increase the total cost. Third, after the addition of an AP system, the difficulty of operation and maintenance will increase.

(2) TP and AP are shared in storage computing, which minimizes cost and maximizes resource utilization. However, there are still two problems. First, due to computing sharing, AP queries and TP queries run at the same time more or less interact with each other. Secondly, when the proportion of AP query increases, the system needs to expand the storage capacity of compute nodes, so redistribution is required, and elastic Scale Out cannot be achieved quickly.

③ TP and AP are shared in storage but separated in computing. PolarDB naturally supports this solution because of its storage computing separation architecture.

Second, the principle of

Architecture based on PolarDB storage computing separation. We developed the distributed MPP execution engine, which provided the guarantee of cross-machine parallel execution and elastic computing elastic expansion, making PolarDB initially equipped with HTAP capability.

At right is the architecture of PolarDB HTAP. The bottom layer is pooled shared storage. TP and AP share a set of storage data, which provides millisecond data freshness while reducing costs. It also provides the ability to rapidly expand compute nodes, which is the first feature of PolarDB HTAP.

The upper layer is a compute node with read and write separation. PolarDB has two sets of execution engines to process HTAP query, in which the stand-alone execution engine processes high-concurrent TP query on the read/write node, and the distributed MPP execution engine processes complex AP query on the read/write node. TP and AP query are physically isolated naturally. The second major feature of PolarDB HTAP is the physical isolation of TP/AP by decoupling the computing environment of TP and AP and eliminating the interaction between CPU and memory.

The third major feature of PolarDB HTAP is Serverless elastic expansion, which eliminates the single point limitation of traditional MPP database coordinate. MPP can be initiated on any read-only node, and the MPP node range and parallelism can be flexibly adjusted. Scale Out and Scale Up are also supported. The elastic adjustment takes effect in a timely manner and does not require data redistribution.

Tilt elimination is the fourth major feature of PolarDB HTAP. PolarDB HTAP can eliminate data skew and computing skew, and realize scheduling based on PG BufferPool affinity.

At the heart of the PolarDB HTAP principle is the distributed MPP execution engine, which is a typical volcano engine. Join the two tables AB first and then aggregate the output, which is also the execution process of PG single execution engine.

In traditional MPP execution engines, data is scattered to different nodes, and data on different nodes may have different distribution properties, such as hash distribution, random distribution, replicated table distribution, etc. The traditional MPP execution engine inserts operators into the plan according to the data distribution characteristics of different tables to ensure that the upper operators are not aware of the data distribution characteristics.

PolarDB is a shared storage architecture. The underlying shared data can be fully accessed by each compute node. If the traditional MPP execution engine is used, each Worker will scan the full amount of data, resulting in repeated data. Meanwhile, it does not achieve the divide-and-conquer acceleration effect during scanning, so it is not an MPP engine in the real sense.

Therefore, in PolarDB distributed MPP execution engine, we refer to the idea of volcano model paper, carry out concurrent processing for all scanning operators, and introduce PxScan operator to shield shared storage. PxScan operator maps share-storage data to share-nothing data. Through coordination between workers, the target table is divided into multiple virtual partition data blocks, and each Worker scans its own virtual partition data blocks, thus realizing cross-machine distributed parallel scan.

The data scanned by PxScan operator will be redistributed by shuffle operator, and then executed on each Worker according to the volcano model just as in a single machine.

This is the core of PolarDB distributed MPP execution engine: Shuffle operator shields data distribution, PxScan operator shields shared storage.

Traditional MPP can only initiate AN MPP query on a specified node, so only a single Worker can scan a table on each node. In order to support the requirement of Serverless elastic scaling in cloud native, we introduce distributed transaction consistency assurance.

First, a node is selected as a coordinator node, and its ReadLSN serves as the LSN of the convention. The smallest snapshot version of all MPP nodes is selected as the snapshot version of the global convention. The LSN uses the wait-and-play and Global Snaphot synchronization mechanisms to ensure that data and snapshots are consistently available when any node initiates AN MPP query.

In order to achieve the elastic expansion of serverless, external dependencies required by each module on a coordinator node link are placed on the shared storage based on the characteristics of shared storage. The parameters required by each Worker node during its operation can also be synchronized from the Coordinator node through the control link, so that the coordinator node and the Worker node are stateless.

Based on the above two design points, the elastic expansion of PolarDB has the following advantages:

① Any node can become a coordinator node, which solves the single point problem of a traditional MPP database coordinator node.

(2) PolarDB can Scale Out horizontally (calculating force elastic expansion), also can Scale Up vertically (single machine parallelism elastic expansion), and elastic expansion is timely effective, do not need to redistribute data.

(3) Allow services to have more flexible scheduling policies. Different service thresholds can run on different node sets. As shown on the right of the figure, service domain SQL 1 can select RO1 and RO2 nodes to perform AP query, and service domain SQL 2 can select RO3 and RO4 nodes to perform AP query. The compute nodes used by the two service domains can implement elastic scheduling.

Skew is the inherent problem of traditional MPP, which is mainly caused by skew of data distribution and skew of data calculation. Data distribution skew is usually caused by uneven data fragmentation. In PG, some inevitable problems of uneven data distribution are introduced due to large object TOAST table storage. Computing skew is usually caused by concurrent transactions, buffers, networks, and IO jitter on different nodes. Tilting can lead to a barrel effect in the execution of traditional MPPS.

The PolarDB design implements the adaptive scanning mechanism. As shown on the right in the figure above, a Coordinator node is used to coordinate the working mode queried by the Worker node. When scanning data, a coordinator node creates a task manager in memory and schedules Worker nodes according to the scanning task. A coordinator node is divided into two threads. The data thread is responsible for serving data links and collecting summary tuples, and the control thread is responsible for serving control links and controlling the scanning progress of each scanning operator.

The block-scanning Worker can scan multiple blocks of data to achieve the versatility of the Worker. For example, in the figure above, both workers of RO1 and RO3 scan 4 data blocks respectively. ROI2 scans 6 data blocks due to the calculation skew, which can scan more data blocks.

The PolarDB adaptive scanning mechanism also takes into account the affinity of PG BufferPool to ensure that each Worker tries to scan fixed data blocks, thus maximizing the cache hit in BufferPool and reducing I/O overhead.

Iii. Functions and features

After continuous iterative development, there are five main features supported by PolarDB HTAP in supporting Parallel Query:

① All basic operators are supported. It not only includes scan class operator, Join class, aggregation class, but also SubqueryScan, HashJoin and so on.

② Optimization of shared storage operator. Including shuffle operator share, ShareSeqScan share, and ShareIndexScan share. ShareSeqScan and ShareIndexScan share refer to that when a large table joins a small table, the small table uses a mechanism similar to copying the table to reduce broadcast overhead and improve performance.

③ Partition table support. It not only supports Hash/Range/List partitioning, but also supports static clipping and dynamic clipping of multi-level partitions. In addition, the PolarDB distributed MPP execution engine also supports Partition Wise Join of partitioned tables.

(4) Flexibility control of parallelism. Including global level, table level, session level, query level parallelism control.

⑤ Serverless elastic extension. MPP can be initiated by any node, any combination of MPP nodes, automatic maintenance of cluster topology information, shared storage mode, active/standby database mode, and three-node mode.

Since it is HTAP, MPP support for DML is indispensable. Based on PolarDB read-write separation architecture and HTAP Serverless elastic extension design, PolarDB Parallel DML supports two features: one write read, multiple write read. Single-write multiple-read refers to that there are multiple read workers on the RO node, but only one write Worker on the RW node. Multiwrite multiread means that there are multiple read workers on the RO node and multiple write workers on the RW node. In the multiple-write/multiple-read scenario, read concurrency and write concurrency are completely decoupled. Different PDML features apply to different scenarios. Users can select different PDML features based on their own service characteristics.

The PolarDB distributed MPP execution engine can be used not only for Query and DML, but also for index building acceleration. There are a large number of indexes in the ALTP business, and approximately 80% of the index creation time is spent sorting and building index pages, and 20% is spent writing index pages. As shown in the figure above on the right, PolarDB uses RO nodes to conduct distributed MPP accelerated sorting of data, adopts process-based technology to build index pages, and adopts batch write technology to improve the writing speed of index pages.

Currently, PolarDB supports normal creation of common b-tree indexes and online creation of b-tree indexes in the feature of index building acceleration.

The advantages of distributed MPP can be seen in the figure above compared to PolarDB’s native stand-alone parallel. We used 16 RO instances of 16G and 256G PolarDB memory online to build a 1 TB TPCH environment for testing and comparison. Compared with stand-alone parallel, distributed MPP parallel makes full use of computing resources of all RO nodes and RO bandwidth of the underlying shared storage, fundamentally solving the HTAP challenge mentioned above. Among TPCH 22 SQL items, 3 SQL items were accelerated by more than 60 times, and 19 SQL items were accelerated by more than 10 times, with an average acceleration of 23 times. In addition, we also test the changes in computing resources caused by elastic scaling. By increasing the total number of CPU cores from 16 to 128, the total operation time of TPCH and each SQL also showed a linear increase, which also verified the elastic expansion characteristics of PolarDB HTAP Serverless.

When the total number of CPU cores was increased to 256, the performance improvement was not significant. The reason is that the IO bandwidth of PolarDB shared storage is full and becomes a bottleneck. This also shows that the computing power of PolarDB distributed MPP execution engine is very strong.

In addition, we compared PolarDB’s distributed MPP with traditional database MPP, also using 16 nodes with 16GB and 256GB memory. The performance of PolarDB is 90% of that of traditional MPP database under the condition that 1 TB TPCH data keeps the same parallelism with traditional MPP database at 1. The most essential reason is that the distribution of traditional MPP databases is hash distribution by default. When the Joinkeys of two tables are their respective distribution keys, Local Wise Join can be performed directly without shuffle. PolarDB is a shared storage pool. Data scanned by PxScan in parallel is equivalent to random distribution and must be shuffled to perform subsequent processing like traditional MPP database. Therefore, when TPCH involves table join, PolarDB has more network shuffle overhead than traditional MPP database.

PolarDB distributed MPP scales flexibly without data redistribution. Therefore, PolarDB can continue to expand the parallelism of single machine and make full use of machine resources when executing MPP on the limited 16 machines. When the parallelism of PolarDB is 8, its performance is 5-6 times that of traditional MPP database. When PolarDB single machine parallelism was linear increase, PolarDB total performance was linear increase, only need to modify the configuration parameters, can take effect in time.

Performance tests were also performed on PolarDB HTAP’s support for build index acceleration. Under 500 GB data volume, when the index has 1, 2, and 4 fields, the performance of distributed MPP parallel construction is about 5 times higher than that of stand-alone parallel construction. When the number of fields to build the index was increased to eight, performance improved by about four times.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.