1. Disaster recovery plan

RTM system architecture design was introduced before, which said that our disaster recovery design is active-active and multi-active. Many friends ask why the master/slave architecture is not adopted, and how to achieve data consistency in multi-active DISASTER recovery (Dr). This section describes disaster recovery and multi-activity design in detail.

The most common Dr Scheme is active-passive Dr. In the original version, the host and cold standby Dr Mode is adopted. The Client communicates with the Master normally, but only the host provides services. When a host is faulty, the arbitration service is switched between the active and standby hosts. The original host goes offline and becomes the standby host. The original standby host becomes the host and provides services for clients.

The model of one host and one standby machine is simple in design and has good availability. After all, the probability that the active and standby machines are unavailable at the same time is very low. It is believed that many background systems also adopt similar Dr Policies.

The DISASTER recovery data center (ACTIVE-passive mode) has some obvious defects. For example, half of the standby machines are idle. For example, during the active/standby switchover, the standby server must accept all requests from the host in an instant. As a result, the standby server is overloaded. If there is such a problem with active-passive Dr, why do a large number of companies adopt this Dr Model? In fact, the choice of architecture often depends on the project background. In most cases, the number of machines required is small, and machine redundancy is not the main problem. Secondly, the active and standby architecture is simple and stable, which can be developed quickly and put online quickly.

Due to the high concurrency requirements of the RTM system and the high equipment utilization requirements of the project, we adopted a live architecture.

In a live architecture, each machine can provide services for the Client. If one machine is down or fails, the Client only needs to connect to other services in the same area. Multi-live architecture maximizes the utilization of machines and reduces the cost of service operation. At the same time, it is relatively simple for operation and maintenance. When the traffic peak occurs, it only needs to increase machines in parallel, and when the traffic is low, it can reduce machines. Adding or subtracting machines dynamically is very effective for cost control. While living architectures may seem like a nice thing to do, one of the many technical challenges that technologists have to solve with living architectures is the problem of distributed data consistency.

Second, consistency algorithm

Consistency is the consistency of data. In a distributed system, it can be understood as the consistency of data values across multiple nodes.

1. Why is consistency required a) Data cannot be stored on a single node (host); otherwise, a single point of failure may occur.

B) Multiple nodes (hosts) must have the same data.

C) Consistency algorithm is to solve the above two problems.

A) Strong consistency description: Ensure that the system changes the state of the cluster immediately after the submission. Model: Paxos Raft (Muti-Paxos) ZAB (Muti-Paxos)

B) Weak consistency note: also known as final consistency, the system does not guarantee that the state of the cluster will change immediately after the change is submitted, but the final state will be consistent over time. Model: DNS system Gossip protocol specific consistency algorithm introduction you can search, some of the ideas are very worth studying, are the wisdom of the Daniel crystallization.

There is a very classic scene in THE RTM system. There are N people in a group distributed in different areas, and some people modify a certain same attribute of the group at the same time. The modification of attributes needs to notify all people in the group. At this time, if there is no guarantee of data consistency, there is a great possibility that the data received by some of the N people in the group will be different from that received by others. If this is the case, it will be fatal for business.

Attribute A is modified by multiple people in different places at the same time. Due to the network fluctuation delay, the order of receiving attribute A by Client-D, E, and F is completely different. The resulting rendered property values are also different. Therefore, we refer to some ideas of Paxos algorithm:

P1: An Acceptor must accept the first proposal it receives.

P2: If a motion with the value v is selected, then the motion with the larger number selected must also have the value V.

P2a: If a proposal with a value of V is selected, acceptors must accept proposals with a larger number that also have a value of V.

P2b: If a Proposer with a value of V is selected, then the Proposer with a larger number must also have a value of V.

P2c: Select more than half of all acceptors from the Acceptor set (S). A new Proposal, or Pnew for short, must meet one of two conditions:

1) If all acceptors in S have not accepted a proposal, the Pnew number is guaranteed to be unique and increasing, and the value of Pnew can be any value. 2) If one or more acceptors in S have accepted proposals, find the proposal with the highest number, assuming the number N and the value V. So the number of Pnew must be greater than N, and the value of Pnew must be equal to V.

The derivation of the algorithm is still difficult to understand, but we summarized its core point: a consistency algorithm is not an algorithm that guarantees consistency; it guarantees consistency in the order in which you write data. If the data is written in the same order, the end result will be the same.

For example, in the classic scenario we mentioned above, as long as all clients receive attributes in the same order, whether the attributes are 213 or 312, as long as everyone is in the same order, the final values of the attributes on each client will be the same.

Three,

All architectures and algorithms ultimately serve the actual business, and not one architecture is competent for all. The optimal solution is to find a suitable solution from multiple perspectives and multi-dimensions.