In a distributed system, each node is physically independent of each other and communicates and coordinates through the network. Because of the transaction mechanism, data operations on each individual node are guaranteed to meet ACID. However, independent nodes cannot know exactly how transactions are executed in other nodes. So in theory, the two machines can’t theoretically reach a consistent state. If you want data consistency across multiple machines in a distributed deployment, you need to ensure that all or none of the data writes are performed on all nodes. Therefore, the conventional solution is to introduce a “coordinator” component to uniformly schedule execution of all distributed nodes, transform distributed transactions into multiple local transactions, and then rely on stepped-retry methods to achieve final consistency. The solution path of distributed problem can be summarized as: business circumvention ->Base flexible transaction ->CP rigid transaction, try to use the first solution.

Distributed consistency classification

Great consistency (Paxos Raft Zap ZK)

This level of consistency is the most intuitive, requiring the system to write and read what is good user experience, but implementation often has a large impact on system performance.

Weak consistency

This consistency level constrains that the system does not promise that the written value can be read immediately or that the data consistency will be reached after a certain time level (for example, the second level) is reached.

Final consistency (DNS, Eureka)

Final consistency is a special case of weak consistency. The system ensures that data consistency can be achieved within a certain period of time. The reason why final consistency is singled out here is that it is a highly respected consistency model in weak consistency, and also a highly respected model in the industry for data consistency in large distributed systems.

Theory of CAP

A distributed system cannot simultaneously meet the three basic requirements of C: Consistency, A: Availability and P: Partition tolerance. At most, only two of them can be simultaneously met

Consistency of C

In a distributed environment, consistency is the property of data consistency across multiple replicas. Under the requirement of consistency, when a system performs an update operation in a consistent state, the data of the system should remain in a consistent state.

Availability of A

Availability means that the services provided by the system must always be available, and the results of each operation request can always be returned within a limited period of time. The emphasis here is on “finite time” and “return result”.

Partition fault tolerance P

Fault tolerance of partitions limits the following features of a distributed system: A distributed system needs to be able to provide consistent and available services when encountering any network partition failure, except when the entire network environment is faulty.

  • CA: Abandon fault tolerance of partition, enhance consistency and availability, is the traditional choice of standalone database, give up the distributed
  • AP: Abandoning strong consistency in favor of fault tolerance and availability for partitions is the design choice of many distributed systems
  • CP: Rigid things, give up availability, the pursuit of consistency and partition fault tolerance, usually used in the financial industry requires strong consistency, face performance online, cannot meet the high concurrency of the Internet scene

The BASE theory of

BASE is an acronym for Basically Available, Soft state, and Eventually consistent. The core idea of BASE theory is that even if strong consistency cannot be achieved, each application can adopt appropriate ways to achieve the final consistency of the system according to its own business characteristics. Let’s look at the three elements of BASE:

  • Basic available

Basic availability is the ability of a distributed system to allow a partial loss of availability in the event of unforeseen failures. Note that this is by no means equivalent to the system being unavailable. For example: – Loss of response time. Normally, an online search engine needs to return the corresponding query results to the user within 0.5 seconds, but due to the failure, the response time of the query results increased by 1 to 2 seconds – loss of system function: Under normal circumstances, when an e-commerce site for shopping, consumers can almost complete each order, but in some big promoting shopping boom, due to a surge in consumer shopping behavior, in order to protect the stability of the shopping system, some consumers may be lead to a downgrade page.

  • Soft state

Soft state refers to allowing intermediate state of data in the system and considering that the existence of such intermediate state does not affect the overall availability of the system, that is, allowing delay in the process of data synchronization between data copies of different nodes.

  • Final consistency

Final consistency emphasizes that all copies of data, after a period of synchronization, eventually reach a consistent state. Therefore, the essence of final consistency is that the system needs to ensure the consistency of the final data, rather than ensuring the strong consistency of the system data in real time.

Distributed transaction solutions

  • Rigid transactions: Follow the ACID principle for strong consistency
  • Flexible transaction: follow BASE theory, final consistency; Unlike rigid transactions, flexible transactions allow data inconsistency between different nodes for a certain period of time, but require eventual consistency.
    • Compensated transaction-Synchronization (TCC, Saga)
    • Notification transactions – Asynchronous (maximum effort notification, Mq transaction messages)

2PC

In the two-stage submission scheme based on XA protocol, the first stage is the voting stage, in which all participants feedback the information about the success of the transaction to the coordinator; The second phase is the implementation phase, where the coordinator, based on the feedback of all participants, notifies all participants and synchronously commits or rolls back on all branches

3 PCS (TCC)

When a transaction starts, the business application registers with the transaction coordinator to start the transaction. The business application then calls the try interface of all the services to complete a phase of preparation. The transaction coordinator then decides to call the Confirm or Cancel interface, depending on what the try interface returns. If the interface call fails, it will be retried. The TCC solution allows applications to define their own granularity of database operations, making it possible to reduce lock conflicts and improve throughput. Of course, TCC scheme also has its shortcomings, which are mainly reflected in the following two aspects: it is highly invasive to applications. Each branch of the service logic needs to implement the try, Confirm, and Cancel operations. Therefore, the application is highly intrusive and costly. The implementation is difficult. You need to implement different rollback policies based on different failure causes, such as network status and system faults. To meet the requirements for consistency, the Confirm and Cancel interfaces must be idempotent. As a result, TCC solutions are mostly adopted by large companies with strong R&D capabilities and urgent needs. Microservices advocate lightweight and easy deployment of services, while the processing logic of many transactions in TCC solution needs to be implemented by its own coding, which is complicated and requires a large amount of development.

Message-based final consistency scheme (reliable message final consistency, local message table)

The message consistency scheme ensures the consistency of data operation between upstream and downstream applications through message middleware. The basic idea is to place the local operation and the send message in a single transaction, ensuring that either the local operation and the send message succeed or both fail. The downstream application subscribes to the message system and performs operations after receiving the message. The messaging scheme essentially transforms a distributed transaction into two local transactions and then relies on the retry mechanism of the downstream business to achieve final consistency. The message-based final consistency scheme is also highly intrusive to applications, requiring a lot of business transformation and high cost.

seata

AT

Users only need to pay attention to their own “business SQL” as a phase, Seata framework will automatically generate transaction two-phase submission and rollback operations

TCC

The Try operation is used as a phase to check and reserve resources, the Confirm operation is used as a phase to submit services, and the Cancel operation cancels the intrusion of reserved resources. TCC mode No global row lock is provided in AT mode, and TCC performance is much higher than that in AT mode.

  • Empty rollback allowed: Try timeout, distributed rollback, cancel triggered. Otherwise, the system will retry continuously
  • Anti-suspension control: reject try operations after empty rollback. When implementing TCC services, empty rollback should be allowed, but reject phase TRY requests that come after empty rollback. Cancel Empty rollback Records the transaction XID or service primary key before success returns, indicating that the record has been rolled back. The Try interface checks the transaction XID or service primary key and does not perform the Try operation if the transaction xID or service primary key has been marked as rollback successfully
  • Idempotent control: The three methods need to be idempotent because the transaction manager may retry due to network congestion, timeout
  • Service data visibility control: The Try operation in the first phase of THE TCC service reserves resources. Before the Try operation in the second phase, if other transactions need to read the reserved resource data, how to display the service data in the middle state to users needs to be considered in the implementation. The general design principle is “better not to show and show too little, than not to show too much and show wrong”
  • Concurrent access control of service data: After the TCC service is performed in phase 1, the reserved resources are not released until the phase 2 operation is performed. If these business resources are modified by other distributed transactions at this time, concurrency problems for distributed transactions can occur

Saga

Long transactions, asynchronous. Original method, compensation method. Allow empty compensation, anti suspension. No isolation, from the business point of view to ensure that there is no problem with our data in the first stage to submit the local database transaction, no lock, high performance; Participants can use transaction-driven asynchronous execution with high throughput; Compensation service is the “reverse” of forward service, easy to understand, easy to implement

XA

Distributed strong consistency solution, but low performance and less use

Electoral consensus algorithm

  • Question of the Byzantine general

On unreliable channels with message loss, it is impossible to achieve consistency through message passing: asymmetric encryption

  • The Paxos protocol is difficult to understand

Application: they are

  • ZAB atomic message broadcast protocol, because paxOS is too complex, ZK based on PAxOS implementation of ZAB protocol

The node with the largest transaction ID is selected as the main node, which is similar to the two-stage commit atomic broadcast process: 1. The leader receives a request from the client; 2. The leader generates a new transaction and a proposal-transaction ID for this transaction, and then notifits other followers of the transaction. After receiving the request, the follower adds the transaction ID to a queue, executes the transaction, and responds to the leader. 3. Before committing the transaction, the follower determines whether the transaction ID is smaller than all the transaction ids in the queue. If so, the follower commits the transaction. If not, wait for the commit command for a smaller transaction. ZAB is also in a LeaderID in addition to the transaction ID

  • Raft algorithm

Similar to democratic voting, the core idea is “majority rule”. The follower, candidate, and leader are randomly selected as the master, and the master constantly sends heartbeat packets to other units. If a request comes in, the heartbeat packets send data at the same time. Only solve the node failure problem, does not support the evil node cluster in a node at a time can only be one of the three states, the three roles can be changed with the change of time and conditions. Raft algorithm has two main processes: one is leader election and the other is log replication, which is divided into logging and data submission phases. The maximum fault-tolerant failure node supported by raft algorithm is (n-1) /2, where N is the total number of nodes in the cluster

  • PBFT algorithm

Support evil node, the largest fault tolerant fault node is (n-1) /3

consensus

If a node currently has data X and now has the add+1 operation log, the state is now X+1. Ok, the state (X) is there, the change (operation log) is there, that is the state machine. Consensus: A consensus is a group of people who agree or agree on one or more things. Computer world: Multiple nodes agree on some data. Multiple nodes agree on the order of multiple data. Consensus model: master-slave synchronization, majority

  • POW: Proof of Work
  • POS: Proof of Stake
  • DPOS, Delegated Proof of Stake, each of the owners of digital currency vote for the organization or individual they support, and the one with the highest number of votes becomes the super node
  • Conclusion:

POW has a strong emphasis on decentralization; POS is seemingly centralized, but actually it is easy to centralize; DPOS has an obvious center and improves efficiency by bringing in part of the center

Replication, sharding, and routing

  • Replication: Copy the same data to multiple nodes (master-slave, peer-to-peer)
  • Sharding: Storing different data on different nodes

To increase the read performance of the system, add the slave node. If you want to improve write performance, fragment the data.

  • Routing:
  1. Hash sharding, point query, using hash function to establish key-partition mapping (most KV databases support this method)

    1. Round Robbin

    Commonly known as the hash mode algorithm, H(key) = hash(key) mode K (the physical machine is numbered from 0 to K-1. Key indicates the primary key of a record, and H(key) indicates the number of the physical machine that stores the data). The advantage is simplicity, the disadvantage is that adding and subtracting machines need to be rehashed and lack flexibility. In fact, it is the physical machine and data sharding two function points into one, so it lacks flexibility.

    1. Virtual barrels

    Membase introduces virtual buckets between records to be stored and physical machines to form a two-level mapping. Key-partition mapping uses hash function, partition-machine uses table management. When a new machine is added, you only need to allocate some virtual buckets to the new machine and modify the partition-machine mapping, which provides flexibility

    1. Consistency hashing

    Consistent hash is a distributed hash table implementation algorithm, according to the size of the hash value space into a ring sequence end to end, for each machine, according to IP and port number through the hash function mapping into the hash value space. Find by directed loop sequential lookup or Finger Table. Virtual nodes can be used to solve the load imbalance caused by consistent hashing. A physical machine node is virtualized into several virtual nodes, which are mapped to different locations in the ring structure.

  2. Range sharding, range query

Local Message table – Transaction message scenario

First, the order system needs to create a new order, and the items associated with the order are those selected in the shopping cart.

Second, after the order is successfully created, the shopping cart system needs to remove the items in the order from the cart. The two data update operations, creating the order and empting the shopping cart, need to be guaranteed to either succeed or fail. Emptying the cart, however, does not require as much consistency as deducting coupons, and it is perfectly acceptable to empty the cart a few seconds after an order has been created. Just make sure that the final order data is consistent with the shopping cart data after a small delay.

Local message tables are well suited to solve this problem of distributed final consistency. Let’s take a look at how local message tables can be used to solve the data consistency problem between the order and the shopping cart.

The idea behind the local message table is that when the order service receives the order request, it normally uses the order library transaction to update the order data and, during the execution of the database transaction, records a message locally. This message is a log of the “empty cart” operation. Since this log is recorded locally, there is no distribution problem in it, so it is a normal stand-alone transaction, so we can let the order library transaction, to ensure the consistency of recording local messages and order library. Once this is done, you can return a success response to the client. We then use an asynchronous service that reads the local message just recorded to empty the cart and calls the cart system’s service to empty the cart. Once the cart is empty, update the status of the local message to done. If an operation fails to empty the shopping cart asynchronously, you can retry. Finally, you can ensure that the order system and the shopping cart system have the same data.

Create order without cleaning shopping cart; The order was not created and the cart was cleared. Both the order creation and message sending steps either succeeded or failed, allowing one to succeed and the other to fail

Transactional messages are mainly suitable for scenarios where data needs to be updated asynchronously and data is not required to be real-time. For example, in the example we mentioned at the beginning, it’s not totally unacceptable for the cart to be empty for a few seconds after the order is created, as long as the final cart data matches the order data

Design your own reliable messaging platform

Message independent subsystem

Message status: To be confirmed, ready to send, completed

Interface design:

  • Interface for querying sender message status

  • The receiver ensures that the interface is idempotent

  • Message independent subsystem

  • Front desk interface

  1. Send message directly
  2. Store and send messages
  3. Reliable message sending
  4. Pre-stored message – to be confirmed
  5. Acknowledge and send a message – yes
  6. The update message status is – Completed
  • Operation interface
  1. Paging query message data
  2. Resend all dead messages in a message queue
  3. Resend the message according to MqMessageEntity
  4. Mark the message as dead
  5. Get the message by message ID
  • Scheduled Task Interface
  1. Delete the message based on the message ID
  2. Resend a message according to msgObjId
  • Timing task
  1. Message sender – Message status confirmation Background thread: Paging query 5 minutes ago ASC – Message to be confirmed, invoke the sender message status confirmation interface according to the service expansion field. Success (acknowledge and send message) Failed: Delete message
  2. Message receiver – message recovery system background thread: paging query 5 minutes ago asc- can send messages, judge the sending times (maximum 6 times) according to the corresponding time interval (3,5,10,15,30,60), resend (resend a message according to msgObjId)

Reference:

  • Blog.csdn.net/WuLex/artic…
  • www.jianshu.com/p/044e95223…
  • zhuanlan.zhihu.com/p/141645544
  • www.pianshen.com/article/343…
  • www.jianshu.com/p/eb571e406…
  • www.mamicode.com/info-detail…
  • thesecretlivesofdata.com/raft
  • Blog.csdn.net/luoyang_jav…
  • Segmentfault.com/a/119000000…
  • www.jianshu.com/p/3fb0d4c56…
  • Baijiahao.baidu.com/s?id=166646…
  • Mp.weixin.qq.com/s?__biz=MjM…