Distributed theory and distributed transactions

distributed
- Distributed lock
  - The unique index of the database
  - Redis SETNX instruction
  - RedLock algorithm of Redis
  - Ordered node of Zookeeper
- Distributed transactions
  - 2PC
  - Local message table
- Third, CAP
  - consistency
  - availability
  - Zonal tolerance
  - Weigh the
- Four, BASE
  - Basic available
  - Soft state
  - Final consistency
- Five, the Paxos
  - Implementation process
  - The constraint
- Six, Raft
  - A single Candidate’s campaign
  - Multiple candidates
  - Data synchronization
- reference

Distributed lock

In the single-machine scenario, process synchronization can be achieved by using the language’s built-in lock. But in distributed scenarios, where the processes that need to be synchronized may be on different nodes, distributed locks are required.

Blocking locks are usually implemented using mutex:

A mutex value of 0 indicates that another process is using the lock and is locked.
A mutex of 1 indicates an unlocked state.

1 and 0 can be represented by an integer value, or by the presence or absence of some data.

The unique index of the database

A record is inserted into the table when the lock is acquired and deleted when the lock is released. A unique index guarantees that the record is inserted only once, so the existence of the record can be used to determine whether the record is locked.

There are the following problems:

The lock has no expiration time. If the unlock fails, other processes cannot obtain the lock.
This lock can only be a non-blocking lock. If the lock fails to be inserted, an error will be reported and the lock cannot be retried.
There is no reentrant, and processes that have acquired locks must also acquire them again.

Redis SETNX instruction

Insert a key-value pair using the SETNX (Set if not exist) directive, returning False if the Key already exists, and True otherwise.

The SETNX instruction is similar to the unique index of the database, which guarantees the existence of only one Key value pair. Therefore, the existence of a Key value pair can be used to determine whether it is stored in the locked state.

The EXPIRE directive sets an expiration time for a key-value pair, avoiding lock release failures in the database unique index implementation.

RedLock algorithm of Redis

Multiple instances of Redis are used to implement distributed locks to ensure availability in the event of a single point of failure.

Attempts to obtain locks from N independent Redis instances;
The lock acquisition is considered successful only if the time is less than the lock expiration time and the lock is obtained from most (N / 2 + 1) instances;
If the lock fails to be acquired, the lock is released on each instance.

Ordered node of Zookeeper

1. Zookeeper abstract model

Zookeeper provides a tree-structured namespace. The parent of the /app1/p_1 node is /app1.

2. Node type

Permanent node: does not disappear when the session ends or times out;
Temporary node: Disappears if the session ends or times out;
Ordered node: a numeric suffix is added after the node name and the node name is ordered. For example, the generated ordered node is /lock/ Node-0000000000, and its next ordered node is /lock/ Node-0000000001, and so on.

3. The listener

Registers a listener for a node to send messages to clients when the state of the node changes.

4. Distributed lock implementation

Create a lock directory /lock;
When a client needs to acquire a lock, temporary and ordered child nodes are created under /lock;
The client obtains the child node list under /lock and determines whether the child node created by itself is the one with the smallest serial number in the current child node list. If so, it considers that the lock is obtained. Otherwise, listen to the previous child node and repeat this step until the lock is obtained after receiving the change notification of the child node.
Execute the business code and delete the corresponding child node when complete.

5. The session times out

If a session that has been locked times out, the temporary node corresponding to the session is deleted because it was created as a temporary node, and the other sessions can be locked. As you can see, this implementation does not have the problem of the database’s unique index implementation failing to release the lock.

6. Herd behavior

A node for lock, only need to monitor their own before a child node, this is because if all child nodes monitoring, so any child node state changes, the other will receive notification to all child nodes (herd behaviour, a sheep move, other sheep will rush), and we can only hope it after receive notice a child node.

Distributed transactions

The operation of a transaction is on different nodes and the ACID properties of the transaction need to be guaranteed.

For example, in an order scenario, a distributed transaction is involved if the inventory and order are not on the same node.

Distributed locks and distributed transactions:

The key to the lock problem is the mutual exclusion of process operations. For example, multiple processes modify the account balance at the same time. If the mutual exclusion is not present, the account balance will be incorrect.
The crux of the transaction problem is that a transaction involves a set of operations that need to meet ACID properties, such as atomic operations that require either all or none of these operations to be performed.

2PC

A two-phase Commit (2PC) is a process by which a Coordinator is introduced to coordinate the actions of participants and ultimately determine whether these participants will actually perform transactions.

1. Run the process

1.1 Preparation Phase

The coordinator asks the participant if the transaction executed successfully, and the participant sends back the result of the transaction execution. A query can be viewed as a vote that requires the consent of all participants.

1.2 Submission Phase

If the transaction executes successfully on each participant, the transaction coordinator sends a notification for the participant to commit the transaction; Otherwise, the coordinator sends a notification to the participant to roll back the transaction.

It is important to note that in the preparation phase, the participant executes the transaction but has not committed it yet. Commit or rollback occurs only after notification from the coordinator is received during commit phase.

2. Existing problems

2.1 Synchronization Blocking

All transaction participants are in a synchronous blocking wait state while waiting for responses from other participants and cannot perform other operations.

2.2 Single point problem

The coordinator plays a very large role in 2PC, and a failure can have a big impact. In particular, if a failure occurs during the commit phase, all participants will block and wait synchronously, unable to complete other operations.

2.3 Inconsistent data

During the Commit phase, if the coordinator sends only part of the Commit message, and an exception occurs on the network, only part of the participants receive the Commit message, that is, only part of the participants Commit the transaction, making the system data inconsistent.

2.4 Too conservative

The failure of any node will result in the failure of the entire transaction, and there is no perfect fault tolerance mechanism.

Local message table

The local message table is in the same database as the business data table, so local transactions can be used to ensure that the transaction characteristics are met during operations on the two tables, and message queues are used to ensure final consistency.

A message is sent to the local message table after a party to a distributed transaction has written business data. The local transaction guarantees that the message is written to the local message table.
After that, the message in the local message table is forwarded to the message queue. If the forwarding succeeds, the message is deleted from the local message table. Otherwise, the message is forwarded again.
The other party in the distributed transaction reads a message from the message queue and performs the operation in the message.

Third, CAP

A distributed system cannot satisfy C: Consistency, A: Availability, and P: Partition Tolerance at most.

consistency

Consistency refers to whether multiple copies of data can remain consistent. Under the condition of consistency, the system can move from a consistent state to another consistent state after the data update operation.

After a successful data update to the system, the system is considered to be highly consistent if all users can read the latest values.

availability

Availability refers to the ability of a distributed system to provide normal service in the face of various exceptions. It can be measured as the ratio of the available time to the total time of the system. Four nines indicate that the system is available 99.99% of the time.

Under availability conditions, the services provided by the system are required to be available all the time, and the results can always be returned within a limited time for each user’s operation request.

Zonal tolerance

Network partition means that nodes in a distributed system are divided into multiple zones. Each zone can communicate with each other but cannot.

Under the condition of partition tolerance, the distributed system still needs services that can provide consistency and availability when encountering any network partition failure, unless the whole network environment fails.

Weigh the

Partition tolerance is essential in distributed systems because you always need to assume that the network is unreliable. Therefore, CAP theory is really a trade-off between usability and consistency.

Usability and consistency are often in conflict, and it is difficult to satisfy both. When synchronizing data between multiple nodes,

To ensure consistency (CP), unsynchronized nodes cannot be accessed and some availability is lost;
To ensure availability (AP), it is allowed to read data from all nodes, but the data may be inconsistent.

Four, BASE

BASE is an abbreviation for the phrases Basically Available, Soft State, and Eventually Consistent.

BASE theory is the result of balancing consistency and availability in CAP. Its core idea is: even if strong consistency cannot be achieved, each application can adopt appropriate ways to achieve final consistency according to its own business characteristics.

Basic available

It refers to a distributed system that guarantees core availability and allows partial loss of availability in the event of failure.

For example, in order to ensure the stability of the shopping system, some consumers may be directed to a degraded page when e-commerce companies are promoting their products.

Soft state

It refers to allowing the existence of an intermediate state of data in the system and considering that the intermediate state will not affect the overall availability of the system, that is, allowing the synchronization between data copies of different nodes in the system to have a delay.

Final consistency

Final consistency emphasizes that all copies of data in the system can finally reach a consistent state after a period of synchronization.

ACID requires strong consistency and is typically used in traditional database systems. BASE, on the other hand, requires ultimate consistency, sacrificing strong consistency to achieve availability, and is often used in large distributed systems.

In a real distributed scenario, different business units and components have different requirements for consistency, so ACID and BASE tend to be used together.

Five, the Paxos

For consensus problems, that is, for multiple nodes, the algorithm can ensure that only one value is selected.

There are three main types of nodes:

Proposer (Proposer) : proposes a value;
Acceptors: vote on each proposal;
A Learner is told the result of a vote and does not participate in the voting process.

Implementation process

Specify that a proposal contains two fields: [n, v], where n is the ordinal number (unique) and v is the proposal value.

1. Prepare phase

The following diagram illustrates the initial process of running this algorithm with two proposers and three acceptors, each of which sends a Prepare request to all acceptors.

An Acceptor receives a Prepare response with the Prepare proposal [n1, v1] and has not received a Prepare request before. The Acceptor sends a Prepare response with the Prepare proposal [n1, v1]. And promised not to accept any proposal with a serial number less than N1.

Acceptor X sends a Prepare response with [no previous] after receiving a Prepare request for an Acceptor request for [n=2, v=8]. The response is set to [n=2, v=8]. And promised not to accept any proposal with a serial number less than 2 again. Other acceptors are similar.

If an Acceptor receives a Prepare request containing a proposal for [n2, v2] that has previously received a proposal for [N1, v1]. If n1 > n2, the proposal request is discarded; Otherwise, the Prepare response contains the received proposal [N1, v1]. Set the received proposal to [n2, v2] and ensure that no proposal whose serial number is less than n2 is accepted in the future.

If an Acceptor Z receives A Prepare request (n=2, v=8) from A Proposer (n=4, v=5), it abandonsto the request. (a) Acceptor X receives a Prepare request for a Proposer (n=4, v=5) from an Acceptor B. The Acceptor X receives a Prepare request for a Proposer (n=2, v=8) with 2 <= 4, and sends a Prepare response with n=2, v=8 Set the current received proposal to [n=4, v=5] and ensure that the proposal whose sequence number is less than 4 will not be accepted in the future. Acceptor Y is similar.

2. Accept stage

A Proposer sends an Accept request when it receives a Prepare response from more than half of its acceptors.

Proposer A sends A [n=2, v=8] Accept request after receiving two Prepare responses. This Accept request is discarded by all acceptors because all acceptors at this point promise not to Accept proposals with an Acceptor number less than 4.

Proposer B also receives two Prepare responses and sends a Accept request. It is important to note that the v of the Accept request takes the v value corresponding to the maximum offer number it received, which is 8. So it sends the Accept request [n=4, v=8].

3. Learn stage

An Acceptor sends a Learn proposal to all acceptors if the Acceptor receives an Accept request with a sequence number greater than or equal to the minimum number promised by the Acceptor. When Learner detects that a majority of acceptors have accepted a proposal, the proposal value is selected by Paxos.

The constraint

Accuracy of 1.

Only one of the proposed values will take effect.

Because the Paxos protocol requires that each valid proposal be accepted by a majority of acceptors, and acceptors will not accept two different proposals, correctness is guaranteed.

2. Termination

In the end, a proposal is always made.

The Paxos protocol allows a Proposer to send a proposal closer to one that is accepted by a majority of acceptors, thereby guaranteeing terminability.

Six, Raft

Raft is also a distributed consistency protocol that is primarily used for primary nodes.

Raft: Understandable Distributed Consensus

A single Candidate’s campaign

There are three types of nodes: Follower, Candidate, and Leader. The Leader periodically sends heartbeat packets to the followers. Each Follower sets a random campaign timeout period, usually 150ms to 300ms. If the Follower does not receive a heartbeat packet from the Leader within this period, it becomes a Candidate and enters the campaign phase.

The following figure shows the initial stage of a distributed system, where there are only followers and no Leader. Node A waits for A random campaign timeout and does not receive A heartbeat packet from the Leader, so Node A enters the campaign phase.

Node A then sends A vote request to all other nodes.

The other nodes respond to the request, and if more than half respond, the Candidate becomes the Leader.

The Leader then periodically sends heartbeat packets to the Follower. The Follower receives the heartbeat packets and starts the timer again.

Multiple candidates

If more than one Follower becomes a Candidate and receives the same number of votes, the vote needs to be restarted. For example, in the following figure, Node B and Node D both get two votes and need to start voting again.

Since each node sets different random election timeouts, the probability of multiple candidates getting the same vote next time is very low.

Data synchronization

Changes from the client are passed to the Leader. Note that the change has not yet been committed, but is just written to the log.

The Leader copies the changes to all followers.

The Leader waits for most followers to make changes before committing the changes.

At this point, the Leader notifies all followers to submit changes as well, and the values of all nodes agree.

reference

Ni Chao. From Paxos to ZooKeeper: Principle and Practice of distributed consistency [M]. Publishing House of Electronics Industry, 2015.
Distributed locks with Redis
Discussion on Distributed Lock
Distributed lock based on Zookeeper
Talk about distributed transactions, and talk about solutions
Transaction processing in distributed systems
Deep understanding of distributed transactions
What is CAP theorem in distributed database system?
NEAT ALGORITHMS – PAXOS
Paxos By Example