Welcome toTencent Cloud + community, get more Tencent mass technology practice dry goods oh ~

This article was published by Li Yuesen in Cloud + Community column

Li Yuesen, chief architect of Tencent Cloud PostgreSQL, architect of Tencent Database team, responsible for the architecture design and development of the core database of wechat Payment merchant system, core member of PostgresQL-X2 community, has won a number of national invention patents. Engaged in PG kernel development and architecture design for more than 10 years.

Before 2015, the rapid development of wechat Payment business required a database to safely and efficiently support the core business of wechat Payment merchant system. This task fell on PostgreSQL developed by Tencent database team.

In July 2016, Tencent Cloud released its cloud database PostgreSQL, providing two versions of kernel optimization version and community version developed by Tencent, as well as two solutions of distributed cluster architecture (internal code of distributed cluster postgresqL-XZ). At present, the cloud database PostgreSQL is running stably in Tencent’s core businesses such as big data platform, CNDIAN.com and Tencent Video.

PostgreSQL distributed cluster PostgresQL-XZ developed by Tencent

Postgresql-xz is a localized version of PostgresqL-XC, which supports horizontal scaling of database clusters. Although PostgresQL-XC is powerful, it has significant bottlenecks in performance, scalability, security, and operation and maintenance. Tencent PostgreSQL has made significant improvements and enhancements in these areas after years of accumulation. As the core database for wechat Payment, Tencent PostgreSQL is positioned as a secure, efficient, stable and reliable database cluster. Tencent PostgresQL-Xz is used as a representative to introduce the optimization and improvement of PostgreSQL developed by Tencent.

Optimization of transaction management system

Postgresql-xc has an obvious disadvantage in the Transaction management system solution itself, that is, the Transaction management mechanism can become the bottleneck of the system, GTM (Global Transaction Manager) will limit the scale of the system. As shown in Figure 1, each CN(Coordinator) request requests gXID (global transaction ID) and GSnapshot (global snapshot) information from GTM and sends the information along with SQL statements to DN (Datanode database node) for execution. In addition, the postgresqL-XC management mechanism, only the primary DN can obtain the GXID, while the standby DN does not have its own GXID, so it cannot provide read-only services, which is also a waste of the system.

Figure 1

Tencent postgresqL-Xz has improved the transaction management mechanism. After the improvement, CN no longer obtains GXID and GSnapshot from GTM. Each node uses its own local XID (transaction ID) and gSnapshot (snapshot), so that GTM does not become the bottleneck of the system. In addition, the STANDBY DN machine can also provide read-only services and make full use of idle resources of the system. As shown in Figure 2, the optimized transaction management system architecture is as follows:

Figure 2

Realization and optimization of the standby machine read-only

Of course, the optimization of transaction management system provides the basis for the standby DN read-only, but the original cluster does not have load, scheduling and other capabilities. In this regard, we have also made a lot of innovations, which can be summarized as follows:

  1. Normal CN and read-only CN are separated.
  2. Normal CN stores metadata information about the primary DN
  3. Read-only CN stores metadata information about the standby DN
  4. Logs are synchronized between DNS in hot Standby mode

In these ways, the cluster can provide the standby DN read-only function with intelligent load capability to make full use of system resources.

Figure 3

3. Capacity expansion solution with minimum service interruption

Rapid service growth inevitably requires resource expansion. The implementation of community version makes capacity expansion costly and requires a long period of service interruption. In postgresqL-xC, the DN=Hash(row) % nofdn is used to determine the storage node of a record:

That is, the hash value is computed on the distributed column and then used to modulo the number of nodes in the cluster to determine which node the record goes to (Figure 4).

This scheme is simple, but in practical application, it needs a long time to shut down capacity expansion. After capacity expansion, the number of nodes increases and data cannot be read or written according to the original distribution logic. Therefore, node data needs to be redistributed. Rebalancing data requires shutdown, manual migration and rebalancing to each node. For larger trading systems, the rebalancing process can take several days because the original nodes store huge amounts of data. Believe this is completely intolerable for the business.

Figure 4.

Therefore, we introduce a new classification method — Sharded table. The data distribution of Shardedtable is as follows (FIG. 5) :

  1. Introduce an abstract middle layer, the Shard Map. Each entry in the Shard Map stores the mapping between Shardid and DN.
  2. Every record in the Sharded table passesHash(row) % #shardmap entryTo determine which SharDID the record is stored in by querying the STORED DN of shardmap.
  3. Shardid information allocated to the node is stored in each DN, and visibility is judged.

Through the above scheme, when adding new nodes for capacity expansion, it is only necessary to map some Shardid in ShardMap to the newly added nodes and move the corresponding data there. Capacity expansion only needs to switch the mapping relationship in ShardMap, which shortens the time from days to seconds.

Figure 5

4. Data skew solution

Data skew refers to the fact that in a distributed database system, the physical space of some DN is insufficient and the physical space of others is large due to the distribution of physical nodes, hash, or Shard. For example, if merchants are used as distribution keys, the daily data volume of JINGdong is definitely different from that of an ordinary e-commerce company. A month’s data of a large merchant may fill up the physical space of a DN, and the system has only one way to shut down and expand. Therefore, we must have an effective means to solve the data skew, to ensure that the system can still run efficiently and stably when the table data distribution is not uniform.

Firstly, we divided the DN of the system into groups (see Figure 6 below). In each group:

  1. Contains one or more DNS
  2. Each group has a SHARdMap
  3. When building a Sharded table, you can specify the group to store in, that is, group1 or group2
  4. CN can access all groups, and CN also stores access mode information for all tables

Figure 6.

For users with a large amount of data in the system, special identification is carried out and whitelists are created for them, using different data distribution logic (figure 7 below). Common users use the default data distribution logic, namely:

Shardid = Hash(merchantid) % #shardmap

Large merchants use custom data distribution logic, namely:

Shardid = Hash(merchantid) % #shardmap + fcreate_timedayoffset from 1970-01-01

Figure 7.

By adding date offset into the group distribution logic of large merchants, the data of the same user can be evenly distributed among multiple nodes within the group. Thus effectively solve the problem of uneven data distribution.

Here is an example (figure 8 below) :

Figure 8.

Five.9000W record efficient sorting solution

In a list query scenario, the service receives the following QUERY SQL:

In the wechat payment scenario, a merchant has 300W data per day and over 9000W data in a month, which means PostgreSQL needs to quickly sort 9000W data, and the business logic requires second-level output to quickly obtain the sorting result.

To do this, we provide a table definition solution, which is to create a cluster partition table. Based on the above requirements, you can use a monthly table, that is, one table per month, and index the sort field FFINISH_time so that each partition scan can use the index.

CN pushes down the order BY and limit offset clauses to DN through a series of execution plan optimizations. The Merge Append operator is used to Merge the results of the subtables in the DN. The operator itself ensures that the output is ordered, that is, the Merge Append operator is used to Merge the results of the subtables in the index scan. Then the result of the node itself is guaranteed to be sorted. CN also merges the results of multiple DN using Merge Append to ensure that the output results are in order, thus completing the whole sorting process.

Here are the results of our performance test on sorting:

Through the test on the 24-core CPU, 64G memory model, 9000W data sorting can be completed in the minimum of 25 ms, QPS up to 5400.

Six. Parallel optimization

With the development of current hardware, system resources are becoming more and more abundant. Multiple cpus and large memory have become the standard configuration of the system. Making full use of these resources can effectively improve processing efficiency and optimize performance. Tencent began to optimize PostgreSQL multi-core execution in late 2014.

PostgreSQL9.6 community edition also includes some parallelism features, but it is not as rich as our PostgreSQL community edition.

  • The system creates a global shared memory manager and uses bitmap management algorithm to manage it
  • Executors of data are created at system startup to execute the fragments of the execution plan
  • A schedule queue is created, and all executors wait for the plan on the task queue
  • Each Executor corresponds to a result queue, to which the Executor hangs a pointer as it outputs a result
  • Schedule queues, result queues, and plan shard execution results are stored in the shared memory manager so that all processes can access these structures
  • When the Postgres session receives SQL, it determines whether it can be parallelized and distributes tasks. Read returns when there is a result in the result queue

We complete the optimization operator:

  • Seqscan
  • Hash join
  • Nestloop join
  • Remote query
  • Hash Agg
  • Sort Agg
  • Append

Through the test on the model with 24 core CPU and 64G memory, the optimization results of each operator are as follows:

Overall, the performance is generally 10-12 times that before optimization, and the optimization effect is obvious.

7. Tencent PostgresQL-XZ geo-redundant DISASTER recovery

Geo-redundant disaster recovery (Dr) is a prerequisite for financial databases. Data security is the most basic and important requirement for financial services. Therefore, we have built a comprehensive geo-redundant disaster recovery (Dr) capability for PostgresQL-XZ to ensure efficient and stable data disaster recovery. The specific two-site, three-center deployment structure is as follows:

Strong synchronization between nodes in the same city ensures strong consistency of data; Asynchronous synchronization on the enterprise network is adopted in remote places.

The CAgent is deployed on each physical server. The AGENT collects and reports machine status, generates alarms, and performs switchover.

At least one JCenter is deployed on each IDC. The JCenter collects and reports the status reported by each agent to the ZK cluster. Only one JCenter is active. The active JCenter not only reports status but also performs fault diagnosis and switchover. If the active JCenter is abnormal, the system automatically selects a standby JCenter to become the active JCenter.

The JCenter and CAgent are the control and adjudication nodes of the two centers.

For database nodes, at least one CN is deployed on each IDC. One DN is deployed in each center. One DN is active, and the other two are placed on the host in parallel as standby machines. One IS synchronous standby machine, and the other is asynchronous standby machine.

When a host fails, JCenter preferentially selects the standby host in the same city as the active host.

At present, Tencent Cloud has provided the cloud database PostgreSQL for internal testing, and will provide two versions of kernel optimization version and community version to meet the requirements of more customers.

Question and answer

How do I record PostgreSQL queries?

reading

Getting started with PostgreSQL

The PostgreSQL configuration is optimized

The PostgreSQL active/standby environment is set up

Machine learning in action! Quick introduction to online advertising business and CTR knowledge

This article has been authorized by the author to Tencent Cloud + community, more original text pleaseClick on the

Search concern public number “cloud plus community”, the first time to obtain technical dry goods, after concern reply 1024 send you a technical course gift package!

Massive technical practice experience, all in the cloud plus community!