Welcome to Tencent cloud community, get more Tencent mass technology practice dry goods oh ~

In a distributed environment, there are always various scenarios that affect system availability, such as host downtime or network failures. It can lead to complaints, or even the loss of core data, affecting business performance and goodwill. How to ensure the normal operation of distributed system, deal with various failure scenarios, and ensure that the system is always in high availability is one of the research directions of every enterprise.

At PostgreSQL 2017 China Technology Conference, Tencent Cloud database technology expert, Zhao Haiming shared some basic ideas for ensuring the reliability of distributed systems, using Tencent’s Tbase reliability system as an example.

1. Tbase, a fully functional distributed relational database developed by Tencent

Tbase is a fully functional distributed relational database system developed by Tencent on the basis of the open source distributed database PosgresQL-XC (PGXC for short). Compared with PGXC, Tbase creatively introduces the concept of GROUP in the kernel and puts forward the double KEY distribution strategy. Effectively solve the problem of data skew; At the same time, according to the time stamp of data, data is divided into cold data and hot data, respectively stored in different storage devices, effectively solving the problem of storage cost. This paper takes Tbase as an example to analyze the two systems that ensure Tbase reliability from top to bottom: disaster recovery system and cold backup system.

2. Split brain in distributed SYSTEM Dr

A distributed system is usually composed of several physical servers over a network. Different from a single system, a distributed system is usually composed of multiple devices. Host (physical server) downtime or network failure is a common occurrence in distributed systems (see the following figure).

Figure 2 Tbase DISASTER recovery system – brain split fault scenario

When a node is abnormal in the system, we usually need a global scheduling cluster to avoid brain splitting. When a fault occurs, the original Master node is locked by the global scheduling cluster, and an optimal Slave node is promoted to Master through internal election. After the original faulty Master recovers, it is demoted to Slave and joins the cluster again. In this way, the system still works in active and standby mode, ensuring that the system is always in a high availability state.

Figure 3 Tbase Dr System — Dr Target

When we go deep into the internal process of distributed system scheduling, we need to solve two problems: island detection and role verification.

  • Island detection: Resolve the problem that the Master DN is split when the network is recovered.
  • ** Role verification: ** Resolves the problem that the Master DN is split when the host is down and restarted.

Figure 4 Tbase DISASTER recovery solution – Master DN failure When a host network fails in a distributed system, a node becomes an island without communication. Therefore, island detection is a vivid metaphor for the brain-split scenario. Therefore, distributed systems usually divide island detection into the following steps: 1. Detect islands: A distributed system sends network heartbeats to all hosts in the cluster through the Agent deployed on each node to check connectivity in real time. If you cannot connect to the Center, you become an isolated network. 2. Killing instances: After the Agent finds that it has become a network island, it will initiate a request to kill all CN/DN instances of the local machine. 3**. Dr Switchover ** : When the cluster Master DN is abnormal (or disconnected), the Center takes the initiative to perform a Dr Switchover to restore database services. Since the original Master DN has been killed by the Agent, only the new Master DN provides read and write services in the whole system, so there is no Master DN split in the system. 4. Restore the active/standby mode: After the network of the island host is restored, the Center sends a backup command to the Agent on the host after the Agent is properly connected. The original Master DN is degraded to a new Slave DN to restore the high availability mode of the unified active/standby system. Through Agent’s island detection mechanism, Tbase can ensure that the system is always in high availability state in any Master DN failure.

The host is down
Role verification:When the faulty host is restarted, the Agent and Center execute heartbeat packets for the node monitored by the AgentVerify the active and standby roles. After the failover, the Center performs a Dr Switchover on the original Master DN of the faulty host. Therefore, the Center considers that the role of the DN node on the host is Salve DN. However, during the switchover, the original Master DN cannot receive the Dr Instruction due to the breakdown of the host. Therefore, after a shutdown and restart, the Agent on the host considers that the role of the DN node is still the Master DN, and Agent and Center occurRole verification failure
Role of calibration

Figure 6 Tbase Dr Solution – role verification

3. Geo-redundant Dr Scheme

After solving the problem of split brain, another problem facing distributed system is how to deal with machine room level failure? Tbase is currently applied to the wechat payment system, so the design of Tbase has considered the architecture of two places and three centers (as shown in the figure below). In simple terms, high availability is achieved through a three-node deployment architecture that enables Datanode nodes to implement strong synchronization of same-city nodes and asynchronous synchronization of remote nodes. In addition, the Agent is deployed on each host to collect the running status of each node, report the status to all centers, and execute various operation commands sent by the Center. Center is responsible for summarizing status and writing status information to the ZK cluster. Only monitors the running status of each node. If an exception occurs, the failover process is initiated. Based on the arbitration result, the FAILOVER process is initiated. Of course, the Center can also receive operation instructions from external users, generate distributed instruction plans, send them to agents for execution, and monitor the execution status of agents.

4. Scheduling node Dr In distributed system Dr

In order to solve the problem of node failure in the distributed system, the system introduces two components, Agent and Center, as scheduling modules. When the Agent and Center are running, the host is down or the network is faulty. The common faults in distributed scheduling systems are described as follows: ** Fault 1: Center breaks down: ** During the Dr Process, the Center host breaks down and the Dr Process fails? ** Fault two: state misjudgment: ** The Tbase system runs normally. However, due to a network fault between Center and Agent, Center miscalculates the status of the DN monitored by Agent. As a result, Center sends wrong Dr Switchover instructions when the Tbase system is running normally. ** Fault 3: Command timeout: **Center sends commands to Agent through network packets. Will command loss or command timeout occur? ** Fault 4: Command out of order: **Agent will feedback its execution status to Center during the execution of the command. For various reasons, what should be done when the command returned by Agent is out of order?

For the preceding fault scenarios, the Tbase Dr System provides the following solutions:

  • ** Task takeover: ** Introduces the active and standby Centers to solve problems such as Center breakdown or network faults during the Dr Process.
  • ** Status arbitration: ** ZK cluster is introduced to ensure status consistency of all nodes and avoid misjudgment of Center status and misdiagnosis of Dr.
  • ** Timeout retry: ** The timeout retry mechanism is used to resolve the network timeout problem during the network communication between the Agent and Center.
  • ** instruction ID: ** assigns a global unique ID number to each instruction for encoding, so as to solve the problem of instruction out of order during the network communication between Agent and Center.

You can use status arbitration and task takeover to solve the Center fault recovery and Center breakdown, as shown in the following figure. The distributed system can perform the following operations: 1.** node is faulty: ** When a DN node is abnormal, the Agent collects node status information and reports the abnormal status to all centers. 2.** status arbitration: When the Master Center receives an abnormal node status, it does not immediately initiate the Dr Process. Instead, it sends status arbitration requests to all Slave centers. 3. Obtaining status: after receiving the status request, the Slave Center pulls the node status from ZK and replies to the Master Center. 4.** Start Dr: ** If more than half of the Slave Center determines that the node is abnormal, the Master Center initiates the Dr Process. 5.** Execute Dr: **Master Center generates Dr Command plans, sends Dr Commands to agents, monitors the execution status of agents, and persists Dr Logs to the configuration library. 6.**Center breakdown: ** During the Dr Process, if the Master Center is down, the ZK initiates the main selection process to select a new Master Center from the Slave Center. 7.** Task takeover: ** After the Master Center is selected, the Master Center pulls the Dr Logs from the configuration library, generates a Dr Command plan, and sends Dr Commands to the Agent to complete the rest of the Dr Process. 8.**Center restart: ** After the original Master Center breaks down and restarts, the Master Center obtains its role from the ZK and finds that it has been downgraded to Slave. The Dr Process is not recovered and the Slave Center is automatically converted to the new Slave Center to ensure that the system does not have Center split. The status arbitration ensures the consistency of DN status to avoid misjudgment of DN status by Center and mis-initiated Dr. You can take over tasks to ensure that disaster recovery can continue even when the Center is down. ZK Master selection ensures that only one Master Center can exist in the system at any time, avoiding Center split.

Figure 9 Tbase DISASTER recovery solution — status arbitration + task takeover

Introduce ** timeout retry ** and command ID to solve the problem of Agent and Center network message timeout and message out of order, as shown in the preceding figure. The process is as follows: 1. After the Center sends a command to the Agent, it monitors the execution status of the command. If the Center does not receive any reply from the Agent within a certain period of time, it runs the command and tries again. 2. Instruction ID: When **Center sends each instruction, it codes the instruction and assigns a globally increasing unique ID number to the Agent. When the Agent replies to the execution status of Center, it must reply to the Center with the original instruction ID. 3. Increment of **ID: ** When the Center receives the Agent’s reply, it chooses to continue listening or send the next command as required. If the next command is sent, the Center increases the command ID first and then sends the command. 4.** message filtering: ** increment the ID. On the one hand, the Center updates its task status; on the other hand, if the ID of the message returned by the Agent is smaller than the current ID of the Center, the Center does not process the message and directly filters it. Timeout retry is used to ensure that the Center can still send the command plan normally in abnormal situations such as network jitter. The command ID is used to ensure that the Center can update its own task status from time to time, and the expired messages fed back by the Agent are ignored to prevent command disorder in the messages replied by the Agent due to network problems.

5. Cold backup system of distributed system

There is, of course, a rare but still present exception where the entire database cluster fails completely. In this case, to further ensure data reliability of the distributed system, you are advised to deploy a cold backup system in addition to the existing HIGH availability Dr System. Tbase develops an automatic cold backup system based on PITR. During the Tbase operation, the storage data and incremental data are regularly backed up to HDFS. In this way, Tbase can still perform data recovery in extreme scenarios such as unrepairable disk corruption. In order to improve the efficiency of cold backup and reduce the usage of service resources by cold backup, the cold backup process of Tbase is transferred to the standby node of the data node. Meanwhile, the uploading mechanism of cold backup is optimized to implement HDFS write through without falling disk of cold backup and additional backup of Tbase to reduce the pressure of local disk I/O. At the same time, flexible control of network resources can be achieved to further reduce the system load.

Figure 11 Tbase cold standby system

6. One last word

At present, Tbase can be deployed in private cloud (private), and it is compatible with PostgreSQL protocol. It solves sensitive problems such as storage cost, data skew, online expansion, distributed transaction, cross-node JOIN and so on. At present, Tbase has been in stable operation on wechat pay, e-government hall, public security and other systems.

reading

The secret of one-stop cloud computing for e-commerce festival E-commerce month is approaching, Tencent cloud DCDB helps e-commerce enterprises cope with the payment peak Why is SQL beating NoSQL, and what does it mean for the future of data

Has been authorized by the author tencent cloud community released, reproduced please indicate the article source text links: https://cloud.tencent.com/community/article/566868