1 the concept

1.1 model

node

In a specific engineering project, a node is usually a process on an operating system. In the model of this paper, nodes are considered as a complete and indivisible whole. If a program process is actually composed of several relatively independent parts, a process can be divided into multiple nodes in the model.

abnormal

  1. Machine outages: Machine outages are one of the most common exceptions. In a large cluster, the probability of downtime is about 1/1000 per day. In practice, the recovery time of a downtime machine is generally considered to be 24 hours, usually requiring manual intervention to restart the machine.
  2. Network abnormality: messages are lost and two nodes cannot communicate with each other at all, that is, “network differentiation” occurs. Messages are out of order, and there is a certain probability that they do not arrive at the destination node in the order in which they are sent. Therefore, mechanisms such as serial numbers are considered to deal with the problem of out-of-order network messages, so that invalid and expired network messages do not affect the correctness of the system. Data error; Unreliable TCP, TCP protocol for the application layer to provide reliable, connection-oriented transmission services, but in the distributed system protocol design can not think that all network communication based on TCP protocol communication is reliable. TCP can only ensure the sequence of network messages within a TCP connection, but cannot guarantee the sequence of network messages between TCP links.
  3. Distributed three states: If A node initiates A Remote Procedure Call (RPC) to another node, that is, A node A sends A message to another node B, node B performs some operations based on the received message content and returns the operation result to node A through another message. Then the result of this RPC execution has three states: “success”, “failure” and “timeout (unknown)”, which are called the three states of distributed system.
  4. Storage data loss: For stateful nodes, data loss means state loss. Usually, the storage state can only be read and restored from other nodes.
  5. * * : exception handling principles was tested by a large number of engineering practice the exception handling of gold principle is: any abnormal situation must be considered in the design stage to occur in actual operation of the system, but the anomalies in the system operation is likely to failed to consider when designing, so unless demand indicators allow, can’t let go of any abnormal situation in the system design.

1.2 a copy of the

Replica /copy refers to the redundancy of data or services in a distributed system. Data copy refers to the persistence of the same copy of data on different nodes. When the storage data of a node is lost, data can be read from the copy. Data copy is the only method to solve data loss anomaly in distributed system. Another type of replica is service replica, in which multiple nodes provide the same service. Such service generally does not depend on the local storage of the node, and its data is usually from other nodes.

Replica protocol is the theoretical core of the whole distributed system.

Copy consistency

The distributed system uses the copy control protocol to ensure that the data read from the external copy of the system is the same under certain constraints, which is called consistency. Replica consistency is specific to distributed systems, not individual replicas.

  1. Strong consistency: Any user or node can read the latest updated data copy at any time. Strong consistency is the highest consistency requirement, but also the most difficult consistency to achieve in practice.
  2. Monotonic consistency: at any time, once any user has read the value of a certain data after a certain update, the user will not read any value older than this value. Monotonic consistency is a level of consistency that is weaker than strong consistency but very useful. Generally, users are only interested in what they see as consistency from their own perspective, not what other users see as consistency.
  3. Session consistency: Once any user reads the updated value of a certain data in a certain session, the user will not read any value older than the value in the current session. Session consistency The concept of session is introduced to further relax the constraints on monotonous consistency. Session consistency only guarantees monotonous data modification within a single session by a single user, and does not guarantee consistency between different users or between different sessions of the same user. There are many mechanisms in practice that correspond to the concept of sessions, such as the session concept in PHP.
  4. Eventual consistency: The data on each copy will be in the same state once the update is successful. However, the time required for the data to be in the same state cannot be guaranteed. For the final consistency system, as long as a user always reads the data of a copy, the effect of monotonous consistency can be achieved, but once the user changes the read copy, no consistency can be guaranteed.
  5. Week consistency: Once a certain update is successful, the user cannot read the updated value within a certain period of time. Even if a new value is read on one copy, it is not guaranteed that the new value can be read on other copies. Weak consistency systems are generally difficult to use in practice and require more work from the application side to make the system usable.

1.3 Metrics for measuring distributed systems

  1. Performance: The throughput capacity of a system, which refers to the amount of data that the system can process at a given time, usually measured by the total amount of data processed per second. System response delay refers to the time required by the system to complete a certain function; The concurrency capability of a system refers to the ability of the system to perform a certain function at the same time. It is also measured by QPS(Query per second). The above three performance indicators tend to restrict each other, the pursuit of high throughput system, often difficult to achieve low latency; It is also difficult to improve QPS when the average response time of the system is long.
  2. Availability: System availability refers to the ability of the system to provide services correctly in the face of various exceptions. The availability of the system can be measured by the ratio of the time when the system stops services to the time when the service is normal, or by the ratio of the number of failures to the number of successes of a function. Availability is an important index of distribution, which measures the robustness of the system and reflects the fault tolerance of the system.
  3. Scalability: Scalability refers to the scalability of a distributed system that improves system performance (throughput, latency, concurrency), storage capacity, and computing capacity by scaling cluster machines. Good distributed systems strive for “linear scalability,” where a metric of the system grows linearly with the number of machines in the cluster.
  4. Consistency: In order to improve availability, distributed systems will inevitably use the mechanism of duplicates, which leads to the problem of duplicates consistency. The more consistent the model, the easier it is for users to use.

2 Principle of distributed system

2.1 Data distribution mode

The so-called distributed system, as its name implies, is the use of multiple computers to cooperate to solve the computing, storage and other problems that a single computer can not solve. The biggest difference between a stand-alone system and a distributed system is the size of the problem, that is, the amount of data to be calculated and stored. To solve a single-machine problem with distributed solution, the first problem to be solved is how to decompose the problem into a multi-machine distributed solution, so that each machine in the distributed system is responsible for a subset of the original problem. How to disassemble the input data of distributed system becomes the basic problem of distributed system.

Hash way

The disadvantages of hash distributed data are also obvious, which is highlighted by low scalability. Once the cluster size needs to be expanded, almost all the data needs to be migrated and redistributed. In engineering, when the system of hashing distributed data is extended, the scale of cluster is often multiplied, and the hashing is recalculated according to the data, so that the data on one machine can be expanded only by migrating half to another corresponding machine.

To solve the problem of poor scalability of hash, one idea is to take the corresponding relationship as metadata and manage it by special metadata server instead of simply dividing the hash value and the machine. At the same time, the number of hashes is often greater than the number of machines, so the same machine needs to be responsible for the remainder of multiple hashes. However, a large amount of metadata needs to be maintained in a more complex mechanism. Another disadvantage of hash distribution data is that “data skew” is easy to occur once the characteristic values of a data are seriously uneven.

Another disadvantage of hash distribution data is that “data skew” is easy to occur once the characteristic values of a data are seriously uneven

Distribution by data range

Distribution by data range is another common data distribution, which divides data into different intervals according to the range of eigenvalues, so that each server (group) in the cluster processes data in different intervals.

In engineering, in order to facilitate load balancing operations such as data migration, dynamic interval partitioning technology is often used to make the amount of data served in each interval as much as possible. When the amount of data in a certain interval is large, the interval is divided into two intervals by “splitting”, so that the amount of data in each data interval is kept under a relatively fixed threshold as far as possible.

In general, it is often necessary to use a special server to maintain data distribution information in memory, which is called a kind of meta-information. Even for large-scale clusters, because the scale of meta-information is very large, a single computer cannot be maintained independently, and multiple machines need to be used as the meta-information server.

Distribution by data volume

Data volume distribution Data is not related to specific data characteristics. Instead, data is regarded as a file that grows sequentially and is divided into several chunks according to a fixed size. Different data blocks are distributed to different servers. Similar to distributing data by data range, distributing data by data volume also requires recording the specific distribution of data blocks and managing this distribution information as metadata using a metadata server.

Since it has nothing to do with the specific data content, the distribution of data by data volume generally does not have the problem of data skew, and the data is always evenly segmented and distributed in the cluster. When a cluster needs to be re-balanced, it only needs to migrate data blocks. Cluster capacity expansion is not limited. You only need to migrate some databases to newly added machines. The disadvantage of dividing data by data volume is that it needs to manage relatively complex meta information, which is similar to the way of distributing data by scope. When the cluster scale is large, the data volume of meta information also becomes large, and efficient management of meta information becomes a new topic.

Consistency hashing

Consistent hashing is another data distribution method that is widely used in engineering. Consistent hashing was originally used as a common data distribution algorithm for distributed hash tables (DHTS) in P2P networks. The basic way of consistent hash is to use a hash function to compute the hash value of data or data characteristics. The output range of the hash function is a closed ring, that is, the maximum value of the output of the hash function is the preceding order of the minimum value. Nodes are randomly distributed over the ring, with each node responsible for processing data in the entire hash range clockwise from itself to the next node.

The consistent hash approach requires that the node’s position on the consistent hash ring be managed as meta-information, which is more complex than using hash distribution data directly. However, the location information of a node is only related to the size of the machine in the cluster, and the amount of meta information is usually much smaller than the amount of meta information distributed by data range and by data volume.

For this reason, a common improved algorithm is to introduce the concept of virtual nodes. Many virtual nodes are created at the beginning of the system, and the number of virtual nodes is generally much larger than the number of machines in the future cluster. The virtual nodes are evenly distributed on the consistent hash domain ring, whose function is the same as that of nodes in the basic consistent hash algorithm. Assign several virtual nodes to each node. When manipulating data, the corresponding virtual node is first found on the ring through the hash value of the data, and then the corresponding real node is found by searching the metadata. Using virtual node improvements has several advantages. First of all, once a node becomes unavailable, it will make multiple virtual nodes unavailable, so that the load of multiple adjacent real nodes becomes the pressure of the failed nodes. Similarly, once a new node is added, multiple virtual nodes can be allocated so that the new node can bear the pressure of multiple original nodes. From a global perspective, load balancing is easier to achieve during capacity expansion.

Copy and data distribution

The basic means of fault tolerance and availability improvement in distributed systems is the use of replicas. The distribution of data copies mainly affects the scalability of the system. A basic data replication strategy is based on machines, where several machines are replicas of each other, and the data between the replicas is identical. This strategy applies to all of the above data distribution patterns. Its advantage is very simple, but its disadvantage is that the data recovery efficiency is not high, scalability is not high.

Instead of using machines as units of copy, it is more appropriate to split the data into more reasonable data segments and use them as units of copy. In practice, it is common to keep the size of each data segment as equal as possible and within a certain size. There are many different names for data segments: segment, fragment, chunk, partition, etc. The selection of data segments is directly related to the way data is distributed. For hashing score data, the remainder after each hashing bucket can be used as a data segment. In order to control the size of the data segment, the number of score buckets is often larger than the cluster size. Once the data is divided into data segments, replicas can be managed on a data segment basis, so that replicas are no longer hard dependent on the machine, and each machine can be responsible for a copy of a data segment.

Once the copy distribution is machine-independent, the recovery efficiency after data loss is very high. This is because if data is lost on one machine, copies of its data segments are distributed across all machines in the cluster, instead of just a few replica machines, so that data can be copied and recovered from the entire cluster at the same time, and each data source machine in the cluster can make copies with very low resources. Even if the rate limit is 1MB/s for all the recovery data sources, the recovery speed can reach 100MB/s if 100 machines participate in the recovery. Furthermore, the machine-independent replica distribution also facilitates cluster fault tolerance. If there is a machine outage, the stress on the down machine is naturally distributed throughout the cluster because the replicas on the down machine are scattered throughout the cluster. Finally, replica distribution is machine-independent and helps with cluster scaling. Theoretically, if the cluster size is N machines, when a new machine is added, the new load balancing can be achieved by migrating 1/N – 1/N+1 data segments from each machine to the new machine. As data is migrated from machines in the cluster, data recovery is also efficient. In engineering, it will increase the cost of metadata to be managed and the difficulty of copy maintenance. A compromise approach is to group some data segments into a data segment group for copy management by granularity. Doing so keeps the copy granularity within a reasonable range.

Localized computing

In distributed systems, the distribution of data also deeply affects the distribution of computing. In distributed systems, compute nodes and storage nodes for computing data can reside on the same physical machine or on different physical machines. If compute nodes and storage nodes are located on different physical machines, data needs to be transmitted over the network. In this way, the cost is high and the network bandwidth may become the overall bottleneck of the system. Another idea is to try to schedule computations to compute nodes on the same physical machine as the storage nodes, which is called localized computing. Local computing is an important optimization of computing scheduling, which embodies an important idea of distributed scheduling: “mobile data is not as good as mobile computing”.

Selection of data distribution mode

In practical engineering practice, data distribution mode can be reasonably selected according to requirements and implementation complexity. In addition, the data distribution mode can be flexibly combined, often can have the advantages of various ways, receive a better comprehensive effect.

For example, the data skew problem is solved by introducing the method of distributing data by data quantity on the basis of hash score data. According to the hash value of the user ID, when the data volume of a user ID is very large, the data of the user always falls on a certain machine. In this case, the user data volume is collected by data volume, and the user data is divided into multiple uniform data segments based on a certain threshold. These data segments are distributed to the cluster. The data amount of most users does not exceed the threshold. Therefore, metadata stores only the data segment distribution information of users that exceed the threshold to control the size of metadata. The combination of hashing data distribution mode and distributing data according to data quantity has been used in a real system and achieved good results.

2.2 Basic Duplicate Agreement

Copy control protocol is a distributed protocol that controls read and write behaviors of copy data according to specific protocol flow and ensures that the copy meets certain availability and consistency requirements. The copy control protocol must be fault-tolerant against abnormal state, so that the system can have certain availability. Meanwhile, the copy control protocol must provide certain consistency level. As the CAP principle (discussed in detail in Section 2.9) shows, it is not possible to design a replica protocol that meets strong consistency and is available in the event of any network exception. To this end, the actual replica control protocol is always a compromise between availability, consistency, and performance based on specific requirements.

Replica control protocols can be divided into two main categories: centralized replica control protocols and decentralized replica control protocols.

Centralized copy control protocol

The basic idea of the centralized copy control protocol is that a central node coordinates the update of copy data and maintains the consistency between copies. Figure shows the general architecture of the centralized replica protocol. The advantage of the centralized copy control protocol is that the protocol is relatively simple, and all the control related to the copy is completed by the central node. Concurrency control will be done by the central node, so that a distributed concurrency control problem is simplified into a single machine concurrency control problem. Concurrency control refers to solving concurrency conflicts such as “write” and “read/write” when multiple nodes need to modify duplicate data at the same time. Concurrency control is usually carried out by locking in single – machine system. For distributed concurrent control, locking is also a common method, but if there is no centralized lock management, it needs a completely distributed lock system, which will make the protocol very complicated. The disadvantage of the central copy control protocol is that the availability of the system depends on the central node. When the central node is abnormal or the communication with the central node is interrupted, the system will lose some services (usually at least the update service), so the disadvantage of the central copy control protocol is that there is a certain outage time.

Primary and secondary agreement

In a primary-secondary protocol, there are two types of replicas, in which only one copy is used as the primary copy and all other copies are used as secondary copies. The node that maintains the primary copy acts as the central node. The central node is responsible for maintaining data updates, concurrency control, and coordinating the consistency of copies.

Primary-secondary protocols typically resolve four major problems: data update processes, data read methods, Primary copy determination and switching, and data synchronization.

Basic data update process
  1. Data updates are coordinated by the primary node.
  2. The external node sends the update operation to the primary node
  3. The primary node performs concurrency control to determine the sequence of concurrent update operations
  4. The primary node sends the update operation to the secondary node
  5. The primary determines whether the update is successful based on the completion of the secondary node and returns the result to the external node

In engineering practice, if the primary sends data directly to N other copies simultaneously, the update throughput of each secondary is limited to the primary’s total egress network bandwidth, and the maximum is 1/N of the primary network egress network bandwidth. To solve this problem, some systems (for example, GFS) use a relay to synchronize data, with the primary sending updates to the first secondary copy, the first secondary copy to the second secondary copy, and so on.

Data reading mode

The way data is read is also highly correlated with consistency. If only final consistency is required, reading any copy will suffice. If session consistency is required, you can set the version number of the copy and increase the version number after each update. When users read the copy, the version number is verified to ensure that the data read by users increases monotonously within the session range. What is difficult with primary-secondary is to achieve strong consistency.

  1. Since the data update process is controlled by the primary, the data on the primary copy must be up to date, so if you always read only the data on the primary copy, you can achieve strong consistency. If the primary copy is read-only, the secondary copy does not provide read services. In practice, if the replica is not bound to the machine, but maintains the replica on a data segment basis, only the primary copy provides read services and does not create a waste of machine resources in many scenarios.

Split the copies into clusters. Assuming that the primary is also randomly determined, then each machine has some primary copies of the data and some secondary copies of the data segment. So a server actually provides read and write services.

  1. The primary controls the availability of the secondary node. When the primary fails to update a secondary copy, it marks the secondary copy as unavailable so that the user does not read the unavailable copy again. The secondary copy that is not available can continue to try to synchronize data with the primary, which can be marked as available when the data is synchronized with the primary. In this way, all available copies, whether primary or secondary, are readable, and within a certain period of time, a secondary copy is either updated to the latest state, which is the same as the primary, or marked as unavailable, thus meeting the high consistency requirements. This approach relies on a central metadata management system to keep track of which replicas are available and which are not. In a sense, this approach improves system consistency by reducing the availability of the system.
Primary copy determination and switchover

In primary-secondary protocols, another core issue is how to determine the primary copy. In particular, there needs to be some mechanism to switch the primary copy if the machine on which the original primary copy resides fails, for example. Makes a secondary copy the new primary copy.

Typically, in a primary-secondary type of distributed system, the information about which replica is primary is meta-information maintained by a dedicated metadata server. When an update operation is performed, the metadata server is first queried for the primary information of the replica to further the data update process.

In distributed system, it takes a certain amount of detection time to reliably detect node anomalies. Such detection time is usually at the level of 10 seconds, which means that once the primary is abnormal, it takes a maximum of 10 seconds to detect the primary before the system can start primary switchover. During this 10 seconds, as there is no primary, The system cannot provide newer services. If the system can only read the primary copy, it cannot even provide read services for this period of time. The main disadvantage of the primary-backup protocol is the downtime caused by the primary switchover.

Data synchronization

Inconsistent secondary copies need to be synchronized with the primary (reconcile).

Generally, there are three forms of inconsistency: 1. Due to anomalies such as network differentiation, data on secondary data lags behind data on primary. 2. Under some protocols, data on secondary may be dirty and needs to be discarded. The so-called dirty data is because the primary copy did not perform a certain update operation, but the secondary copy did the unnecessary modification operation, resulting in the secondary copy data error. 3. Secondary is a newly added copy with no data at all. Data needs to be copied from other copies.

For the first type of secondary data lagging behind, the common synchronization method is to play back the operation log (usually redo log) on the primary to catch up with the update progress of the primary. In the case of dirty data, it is a good practice to design distributed protocols that do not produce dirty data. If the protocol has the possibility of generating dirty data, the probability of generating dirty data should be reduced to a very low level. In this way, once dirty data occurs, the copy with dirty data can be discarded directly. In this way, the copy has no data. In addition, you can design some undo log-based methods to remove dirty data. If the secondary copy has no data at all, it is common to copy the data from the primary copy directly, which is often much faster than replaying the log to catch up with the update progress. However, the primary copy needs to be able to continue to provide update services when copying data, which requires the primary copy to support the snapshot function. That is, a snapshot is created for the duplicate data at a certain point in time, and then the snapshot is copied. After the snapshot is copied, the update operation after the snapshot is created is traced back to the log playback mode.

Decentralized copy control protocol

There is no central node in the decentralized copy control protocol. All nodes in the protocol are completely peer and reach agreement through equal negotiation. In this way, the decentralized protocol does not have the problems of service suspension caused by the exception of the centralized node.

The biggest disadvantage of decentralized protocols is that the protocol process is often complex. Especially when a decentralized protocol requires strong consistency, the protocol process becomes complex and difficult to understand. Due to the complexity of the process, the efficiency or performance of decentralized protocols is generally lower than that of centralized protocols. An inappropriate analogy is that the centralized copy control protocol is similar to an autocratic system. The system is highly efficient but highly dependent on the central node. Once the central node is abnormal, the system will be greatly affected. Decentralized copy control protocol is similar to a democratic system, with low efficiency due to collective negotiation among nodes. However, the abnormality of individual nodes will not have much impact on the system as a whole.

2.3 Lease mechanism

Lease is the most important distributed protocol and is widely used in various distributed systems.

Lease – based distributed cache system **** system

The basic problem background is as follows: In a distributed system, there is a central server node. The central server stores and maintains some data, which is the metadata of the system. Other nodes in the system access the central server node to read and modify metadata on it. Because all operations in the system depend on metadata, the performance of the central server node becomes a bottleneck if every metadata read operation accesses the central server node. Therefore, a metadata cache is designed to cache metadata information on each node to reduce the access to the central server node and improve performance. On the other hand, the correct operation of the system strictly depends on the correct metadata, which requires that the data in the cache of each node always be the same as the data in the central server, and the data in the cache cannot be old dirty data. Finally, the design of cache system should be able to deal with node downtime, network interruption and other anomalies as much as possible, and improve the availability of the system to the maximum extent.

Therefore, lease mechanism is used to design a cache system, its basic principle is as follows. The central server issues a LEASE to each node when it sends data to the node. Each lease has an expiration date similar to that on a credit card. The lease usually expires at a specific time, such as 12:00:10. If the lease expires at 12:00:10, the lease will expire. In this case, the lease validity period is independent of the time when the node receives the lease. The lease may become invalid after the node receives the lease. Assume that the clock of the central server is synchronized with that of each node. In the next section, the impact of clock synchronization on lease is discussed. The lease sent by the central server indicates that the central server does not change the data value during the lease term. Therefore, after receiving the lease and data, the node adds the data to the local Cache. If the lease times out, the node deletes the data from the local Cache. When modifying data, the central server blocks all new read requests, waits for all lease timeouts previously issued for the data to expire, and then changes the value of the data.

Based on the lease cache, the client reads metadata

  1. Check whether the metadata is in the local cache and the lease is valid. 1.1 If yes, the metadata in the cache is returned. 1.2 If no, the lease is not valid. Request to read metadata information from the central server node 1.2.1 After receiving the read request, the server Returns metadata and a corresponding lease. 1.2.2 Whether the client receives the data from the server successfully. 1.2.2.1 Failure or Timeout: Exits the process. The metadata is recorded to the memory with the lease of the metadata, and the metadata is returned
  2. Process for modifying metadata on a client node based on the lease cache 2.1 A node initiates a request to modify metadata on a server. 2.2 After receiving a modification request, the server blocks all new read requests. That is, the server receives the read request but does not return data. 2.3 The server waits for all lease timeouts related to the metadata. 2.4 The server modifies metadata and returns a success message to the client.

The above mechanism ensures that the cache on each node is consistent with the center on the central server. This is because when the central server node sends data, it grants the node’s corresponding lease. During the lease period, the server does not modify the data, so that the client node can safely cache data during the lease period. The key to fault tolerance of lease mechanism is: Once the server sends the data and lease, no matter whether the client receives the data, whether the client breaks down later, or whether the network is normal later, the server waits for the lease timeout to ensure that the corresponding client node does not cache data. This allows you to modify data without compromising cache consistency.

The above basic process has some performance and usability issues, but can be easily optimized and modified. Optimization point 1: The server blocks all new read requests before modifying metadata, causing no read service. This prevents a new lease from being issued, causing a new client node to hold the lease and cache data, forming a “live lock”. The optimization method is very simple. After the server enters the data modification process, it only returns data but does not issue lease once it receives a read request. As a result, the client can read metadata but cannot cache metadata during the modification process. As a further optimization, when entering the modification process, the server selects the lease validity period as the maximum lease validity period it has issued. In this way, the client can continue to cache metadata after the server enters the modification process, but the server does not wait for all lease expirations to be extended as new lease issues.

Finally, the =cache mechanism is different from the multi-copy mechanism. The similarity between the Cache mechanism and the multi-copy mechanism is that one copy of data is stored on multiple nodes. However, the Cache mechanism is much simpler. The data in the Cache can be deleted and discarded at any time, and the result of hitting the Cache is only to access the data source to read the data. However, the duplicate mechanism is different. The duplicate cannot be discarded at will. The quality of service deteriorates with the loss of each duplicate.

Lease mechanism analysis

A lease is a promise made by an issuer for a certain period of time. Once the issuer issues a lease, the issuer shall keep its promise as long as the lease does not expire, regardless of whether it is received by the receiving party and regardless of the subsequent receiving party’s state. On the other hand, the recipient may use the issuer’s promise during the lease period, but must not continue to use the issuer’s promise once the lease expires.

The Lease mechanism has high fault tolerance. First, by introducing the validity period, Lease mechanism can be very good fault tolerant network exceptions. The Lease issue process depends on unidirectional communication between the network. Even if the receiver cannot send messages to the issuer, the Lease issue is not affected. Since the lease is valid at a certain point in time, the lease semantics are independent of the specific time when the lease is sent. Therefore, the same lease can be sent repeatedly by the issuer to the receiver. Even if the issuer fails to send the lease occasionally, the issuer can simply resend the lease. Once the lease is accepted by the recipient, the subsequent lease mechanism does not depend on network communication. Even if the network is completely interrupted, the lease mechanism is not affected. Moreover, the Lease mechanism can better tolerate node downtime. If the issuer is down, the issuer is usually unable to change the previous commitment without affecting the lease’s correctness. If the issuer restores the lease information, the issuer can keep the lease’s promise. If the issuer cannot recover the lease information, it only needs to wait for a maximum lease timeout to invalidate all leases without breaking the lease mechanism.

For example, in the cache system example in the previous section, if the server is down, the metadata will not be modified. After the server is restored, it only needs to wait for a maximum lease timeout, and the cache information on all nodes will be cleared. If the recipient is down, the issuer does not need to perform fault tolerance. The issuer only needs to wait for the lease to expire before reclaiming the commitment. In practice, the issuer reclaims the permission and identity granted previously. Finally, the LEASE mechanism does not depend on storage. The issuer can persist the issued lease information so that the lease will remain valid after the outage is recovered. However, this is only an optimization for the lease mechanism. As the previous analysis shows, even if the issuer does not persist the lease information, it can invalidate all the previously issued leases by waiting for a maximum lease time to ensure the validity of the mechanism.

The Lease mechanism relies on expiration dates, which requires that the clocks of the issuer and receiver be synchronized. On the one hand, if the issuer’s clock is slower than the receiver’s, the issuer still considers the lease to be valid even if the receiver thinks the lease has expired. The recipient can solve this problem by applying for a new lease before the lease expires. On the other hand, if the issuer’s clock is faster than the receiver’s, the receiver still considers the lease to be valid even when the issuer thinks the lease has expired. In this case, the issuer may issue the lease to other nodes, which invalidates the commitment and affects the system correctness. For this kind of clock mismatch, it is common practice to set the validity period of the issuer to be slightly larger than the validity period of the receiver.

The lease mechanism is used to determine node status

The distributed protocol relies on the global consistency of node status recognition. That is, once node Q considers that A node A is abnormal, node A must also consider that it is abnormal, so that node A stops acting as the primary to avoid the “dual master” problem. There are two ways to solve this problem. First, the designed distributed protocol can tolerate the “double master” error, that is, it does not depend on the global consensus of the node-like state, or the global consensus state is the result of all negotiations. Second, use lease mechanism. The first idea is to abandon the use of centralized design in favor of decentralized design, which is beyond the scope of this section. The following focuses on using the LEASE mechanism to determine node states.

The central node sends a lease to other nodes. If a node has a valid lease, the node can provide services normally. In example 2.3.1, nodes A, B, and C periodically send A “heart Beat” message to report their status. After receiving the “Heart Beat” message, node Q sends A lease to confirm the status of nodes A, B, and C and allow node Q to work normally within the lease term. Node Q can give the primary node a special lease that indicates that the node can work as the primary. If node Q wants to switch to a new primary, it only needs to wait for the lease of the previous primary to expire. Then it can safely issue a new lease to the new primary without causing the “dual primary” problem.

In actual systems, it is risky to send a lease from a central node. If the central node breaks down or the network is abnormal, all the central nodes do not have a lease. As a result, the system is unavailable. Therefore, the actual system always uses multiple central nodes as replicas of each other to form a small cluster. The small cluster has high availability and provides the function of issuing lease. Both Chubby and ZooKeeper are based on this design.

Lease Indicates the validity period

In engineering, the lease duration is usually 10 seconds, which is a verified experience value and can be used as a reference and a comprehensive selection of appropriate duration in practice.

2.4 the Quorum mechanism

Make a pact like this: An update operation (write) is a sequential process in which the order of the update operations is determined by some other mechanism (for example, in the primary-secondary schema, the order is determined by the primary). Each update operation is denoted as WI, and I is the monotonically increasing sequence number of the update operation. After each WI is executed successfully, duplicate data is changed. The data version is called vi. Assume that each copy holds data for all historical versions.

write-all-read-one

Write-all-read-one (WARO for short) is the simplest copy control rule. As the name implies, all copies are written during the update. The update is considered successful only when the update is successful on all copies.

The update operation can be successful only when all N copies are successful. Therefore, once one copy is abnormal, the update operation fails and the update service is unavailable. For the update service, there are N replicas, but the system cannot tolerate any of them being abnormal. On the other hand, as long as one of the N copies is normal, the system can provide read services. For the read service, when there are N copies, the system can tolerate N-1 copy exceptions. From the above analysis, it can be found that the availability of WARO read service is high, but the availability of update service is not high. Even though a copy is used, the availability of update service is equivalent to no copy.

Quorum is defined

Under Quorum, when an update operation wi succeeds on W of all N replicas, the update operation is called “successfully committed update operation” and the corresponding data is called “successfully committed data”. If R> n-w is set to R, the WI update operation succeeds only on W replicas. Therefore, the updated WI data VI can be read only after R replicas are read. If W copies of a WI are successfully updated, W+R>N indicates that the set of any R copies must intersect with the set of W copies that are successfully updated. Therefore, the updated WI data VI can be read after reading R copies. As shown in Figure 2-10, the principle of Quorum mechanism can be represented by Vinson diagram.

A system has five copies, W=3, R=3, the data of the first five copies are consistent, all are V1, and one update operation W2 succeeds on the first three copies, and the copy situation becomes (V2 v2 V2 V1 v1). At this point, any set of three copies must contain V2. In the above definition, W=N, R=1 yields WARO, that is, WARO is a special case of Quorum. Similar to analyzing WARO, analyzing the availability of the Quorum mechanism. Restrict Quorum parameter to W+R=N+1. The update operation can succeed only when it succeeds on all W copies. Therefore, once n-W +1 copies are abnormal, the update operation cannot succeed on all W copies and the update service is unavailable. On the other hand, once n-R +1 replicas are abnormal, it cannot be guaranteed that the replica set that has intersection with W replicas can be read, and the consistency of read services decreases.

Again: relying on quorum alone does not guarantee strong consistency. Because quorum alone cannot determine the latest successfully committed version number, it is difficult to determine the latest successfully committed version number unless the latest committed version number is managed as metadata by a specific metadata server or metadata cluster. In the next section, we discuss cases in which the latest successful commit version number can be determined solely through the quorum mechanism.

The three system parameters N, W, and R of the Quorum mechanism control the availability of the system and are also the service promise of the system to users: there are at most N copies of data, but if W copies of data are updated successfully, the user is returned with success. For a Quorum system with high consistency requirements, the system should also promise not to read uncommitted data at any time, i.e., read data that has been successful on W replicas.

Read the latest successfully committed data

The Quorum mechanism only needs to successfully update W out of N replicas, and in R replicas it must read the latest successfully committed data. However, because of unsuccessful updates, reading only R copies does not necessarily determine which version of the data is the most recent committed data. For a strong consistent Quorum system, if there are fewer than W copies, let’s say X copies, continue to read other copies. If W copies of this version are successfully read, the data is the latest data successfully committed. If there must be less than W copies of this data in all copies, the copy with the second largest version number in R is the latest successfully committed copy. For example, after reading (v2 v1 v1), read the remaining copies. If the remaining two copies are (v2 v2), v2 is the latest committed copy. If the remaining two copies are read as (v2 v1) or (v1 v1), v1 is the latest successfully committed version. If there is a timeout or failure to read the next two copies, there is no way to determine which version is the most recent successfully committed version.

As can be seen, when using Quorum mechanism alone, to determine the latest successfully committed version, a maximum of R+ (W-R-1) =N copies must be read. When any copy is abnormal, the function of reading the latest successfully committed version may not be available. In practice, other techniques should be used to circumvent Quorum access to the latest successfully committed version. For example, when the quorum mechanism is used in conjunction with the primary-secondary control protocol, the latest committed data can be read by reading the primary.

Select primary copies based on Quorum mechanism

Data can be read in different ways according to the consistency requirements: If a strong consistency is required to immediately read the latest successfully committed data, you can simply read the data on the primary copy, or read through the above method. If session consistency is required, it can selectively read data from each copy based on the version number of previously read data. If only weak consistency is required, you can select any copy to read.

In the primary-secondary protocol, when a primary is abnormal, a new primary needs to be selected, and the secondary copy synchronizes data with the primary. Under normal conditions, the selection of a new primary is performed by a central node. After the introduction of quorum mechanism, the common primary selection method is similar to the way of reading data, that is, the central node reads R copies and selects the copy with the highest version number among R copies as the new primary. After synchronizing data with at least W replicas, the new primary provides read and write services as the new primary. First, the copy with the highest version number of R copies must contain the most recent successfully committed data. In addition, although it is not certain that the highest version number is a successfully committed data, the new primary is followed by the secondary to make the number of copies of the version reach W, making the data of the version a successfully committed data.

For example, in the system with N=5, W=3, and R=3, the maximum version number of the copy at a certain time is (V2, v2, v1, v1, v1). At this time, V1 is the latest successfully submitted data of the system, and v2 is an intermediate data that has not been successfully submitted. If the original primary copy is abnormal, the central node performs a primary switchover. Whether such “intermediate” data will be deleted as “dirty data” or become valid as new data after synchronization depends entirely on whether the data can participate in the election of the new primary. Let’s look at these two cases separately.

First, as shown in Figure 2-12, if the central node successfully communicates with three copies and the version number read is (v1, v1, v1), then any copy is selected as primary, and the new primary takes V1 as the latest successfully submitted version and synchronizes with other copies. When synchronizing data with the first and second copies, Because the versions of copies 1 and 2 are larger than those of the primary, they are dirty data. You can handle dirty data according to 2.2.2.4. In practice, it is possible for the new primary to synchronize with the last two replicas and then provide data services, and then update itself to V2. If the system cannot guarantee that the later V2 is exactly the same as the previous v2, When synchronizing data with the first and second replicas, the new primary compares the data version number and the details of the update operation.

Second, if the central node successfully communicates with the other three copies and the version number read is (v2 v1 v1), then the copy with version number V2 is selected as the new primary. After that, once the new Primary completes data synchronization with the other two copies, the number of copies conforming to V2 reaches W. Becomes the latest copy successfully committed, and the new primary can provide normal read and write services.

2.5 Log Technology

Log technology is one of the main techniques for recovery from downtime. Logging was originally used in database systems. Strictly speaking, the log technology is not a distributed system technology, but in the practice of distributed systems, it is widely used to recover from downtime, even such as BigTable and other systems save logs to a distributed system to further enhance the system fault tolerance.

Redo Log and Check point

A high-speed single-machine query system is designed to store all the data in memory to achieve high-speed data query. Each update operation updates a small part of the data (such as a key in the key-value). Now the problem is to use log technology to realize the memory query system downtime recovery. Unlike database transactions, every successful update operation in this problem model takes effect. This is equivalent to the database having only one update operation per transaction, and each update operation can and must be Auto commit immediately.

  • Redo Log
  1. Write the result of the update operation (for example, Set K1=1, record K1=1) to the disk’s log file as an append
  2. Modify data in memory by update operation
  3. Update succeeded

As you can see from the Redo Log flow, Redo logs are written to the Log as a result of the update operation (although Undo logs are not discussed in this article, which is one of the differences with Undo Log), and sequential appending to Log files is efficient on storage devices with sequential writes, such as disks.

Downtime recovery using the Redo Log is very simple. You only need to “replay” the Log.

Flow 2.5.2: Redo Log outage Recovery

  1. Read the results of each update operation in the log file from scratch and use these results to modify the data in memory.

As you can see from the Redo Log outage recovery process, only updates written to Log files can be recovered after an outage. This is why the Log file is updated before the in-memory data is updated during the Redo Log process. If the data in memory is updated first, the user can read the updated data immediately. If there is an outage between memory modification and writing to the log, the last update cannot be recovered, but the user may have read the updated data before, causing inconsistencies.

  • Check point

. Under the simplified model, the process of Check Point technology is to dump the data in memory completely to disk in a way that is easily reloaded, thus reducing the log data that needs to be played back during downtime recovery.

Process: Check point

  1. Add Begin Check Point to the log file.
  2. Dump the data in memory to disk in a data organization that is easy to reload
  3. Record “End Check Point” to a log file. During the Check Point process, data can be updated according to Flow 2.5.1. During this process, newly updated data can be dumped to disks or not, depending on the implementation. For example, if k1 is v1 at the start of the check point and k1 is v2 at some time during the check point, the value of k1 dumped to the disk can be either V1 or v2.

Process: Process for recovering from a breakdown based on the Check point

  1. Load data dumped to disk into memory.
  2. Scan the log file from back to front for the last End Check Point log.
  3. Locate the latest Begin Check Point log from the last End Check Point log and play back all update operation logs since this log.

  • No Undo/No Redo log

If data is maintained on disk, a batch of updates consists of several update operations that must take effect at the same time or none at all.

There are two Directory structures in 0/1 Directory technology, called Directory 0(Directory 0) and Directory 1(Directory 1). Another structure is called a Master record. The directory that records are currently in use is called an active directory. The master record uses either directory 0 or directory 1. Directory 0 or directory 1 records the location of each data in the log file. The 0/1 directory data update process is always performed on the inactive directory, but the primary record is switched by reversing the values of 0 and 1 in the primary record before the data takes effect.

Process: 0/1 directory data update process

  1. Copy the entire active directory to the inactive directory.
  2. For each update operation, a new log entry is created to record the value after the operation, and the location of the corresponding data in the inactive directory is changed to the location of the new log entry.
  3. Atomic modification of the master record: Reverses the values in the master record so that the inactive directory takes effect.

The 0/1 directory update process is very simple, and the effect of a batch of changes is atomic by switching the primary records of the 0 and 1 directories. The 0/1 directory attributes the atomicity of bulk transaction operations to atomic switching of master records by directory means. Since the atomic modification of multiple records is generally difficult to achieve while the atomic modification of single record can be achieved, the difficulty of problem realization is reduced. In engineering, the idea of 0/1 directory is widely used, and its form is not limited to the above process. It can be switched between two data structures in memory, or between two file directories on disk.

2.6 Two-phase Commit protocol

Two-phase commit protocol is a classical strong consistency – centered copy control protocol. Although this protocol has many problems in engineering, studying this protocol can well understand several typical problems of distributed system.

The process description

Two-phase commit protocol is a typical “centralized copy control” protocol. In this protocol, the participating nodes are divided into two types: a central coordinator node and N participant nodes. Each participant node is the node that manages the database replica described in the background above.

The idea of two-phase commit is simple. In the first phase, the coordinator asks all participants if they can commit the transaction (participants are asked to vote), and all participants vote to the coordinator. In the second phase, the coordinator makes a decision on whether the transaction can be committed globally based on the votes of all participants, and informs all participants to execute the decision. In a two-stage submission process, participants cannot change their vote. The two-phase commit protocol allows global commit only if all participants agree to commit the transaction. If one participant votes abort, the transaction must be abandoned.

Process: Two-phase submission coordinator process

  1. Write local log begin_commit and enter WAIT state.
  2. Send a prepare message to all participants.
  3. Wait for and receive the response to the Prepare message sent by the participant. 3.1 If the Vote-abort message is received from any participant; 3.1.1 Write the local global-abort log to abort. 3.1.2 Send the “global-Abort message” to all participants; 3.1.3 Enter the ABORT state; 3.2 If the Vote-commit Message sent by all Participants is Received; 3.2.1 Write a local global-commit log to enter the COMMIT state. 3.1.2 Send a “global-commit Message” to all participants.
  4. Wait for and receive the acknowledgement response message to the “global-abort message” or “global-commit message” sent by the participant. Once the acknowledgement message from all participants is received, the writing of the local “END_TRANSACTION” log process ends.

Process: Two-phase submission coordinator process

  1. Write local logs to init and enter the Init state
  2. After receiving the prepare message sent by the coordinator, 2.1 If participants can submit the transaction 2.1.1 Writing a Local Log Ready 2.1.2 Sending the “vote-commit” message to the Coordinator 2.1.4 Waiting for the Coordinator’s Message 2.1.4.1 If the “global-Abort” message 2.1.4.1.1 Writing the Local Log “Abort” is Received from the Coordinator, 2.1.4.1.2 Sending a global-Abort acknowledgement message to the Coordinator 2.1.4.2 If receiving a global-commit message from the Coordinator 2.1.4.1.1 Writing a Local Log commit, 2.1.4.1.2 Sending a global-commit message to the Coordinator 2.2 If the Participant fails to COMMIT this transaction 2.2.1 Writing the Local Log “Abort” Enter the ABORT state 2.2.2 Sending the “vote-abort” message to the coordinator 2.2.3 The process ends for this participant 2.2.4 If it receives the “global-abort” message from the coordinator, it can respond
  3. Any time a “global-abort” or “global-commit” message is received from the coordinator, a corresponding acknowledgement message is sent, even if the process ends.

Exception handling

Outage recovery
  1. Coordinator downtime Recovery After the coordinator recovers from downtime, logs are used to find the status before the downtime. If begin_COMMIT is displayed at the end of the log, the coordinator is in WAIT state before the shutdown. The coordinator may or may not have sent a prepare message. However, the coordinator must not have sent a “global-commit message” or “global-Abort message,” meaning that the global state of the transaction has not been determined. In this case, the coordinator can send the prepare message again to continue the two-phase submission process. Even if participants have sent a response to the Prepare message, the response is retransmitted without affecting the consistency of the protocol. If the log ends with “global-commit” or “global-abort”, the coordinator is in the COMMIT or abort state before the outage. At this point, the coordinator simply re-issues a “global-commit message” or “global-Abort message” to all participants to continue the two-phase commit process.
  2. Participant Breakdown Recovery After a participant is recovered from a breakdown, logs are used to check the status before the breakdown. If init is recorded at the end of the log, the participant is in init state and has not voted for the transaction. The participant can continue the process and wait for the Prepare message sent by the coordinator. If “Ready” is displayed at the end of the log, the participant is in the REDAY state. In this case, the participant has voted for the transaction. It is unknown whether the participant has sent the “vote-commit” message to the coordinator before the outage. At this point, participants can resend “vote-commit” to the coordinator and continue the protocol process. If the log ends with a “commit” or “abort” record, the participant has received either a “global-commit message” (in the COMMIT state) or a “global-abort message” (in the abort state) from the coordinator. It is unknown whether an acknowledgement message to “global-commit” or “global-abort” was sent to the coordinator. However, even if no acknowledgement message is sent, the coordinator repeatedly reissues “global-commit” or “global-abort”. Therefore, you only need to send the acknowledgement message when it is received, which does not affect the global consistency of the protocol.

Protocol analysis

The two-phase commit protocol is rarely used in engineering practice for the following reasons:

  1. The two-phase commit protocol has poor fault tolerance. From the above analysis, it can be seen that in some cases, the two-phase submission protocol process cannot be executed, and the process state cannot be judged. In engineering, good distributed protocols can always perform even when exceptions occur. For example, the recall Lease mechanism (2.3), once a Lease is issued, the Lease server node can always determine whether the Lease is valid or not by time regardless of any exception. It can also recover the Lease permission by waiting for the Lease timeout. The entire flow of Lease does not block and cannot be executed. Compared with the simple and effective Lease mechanism, the two-phase commit protocol is more complex and has poor fault tolerance.
  2. The two-phase commit protocol has poor performance. In a successful two-phase protocol submission process, the coordinator and each participant must exchange at least four messages: Prepare, vote-commit, global-commit, and confirm global-commit. Too many interactions degrade performance. On the other hand, the coordinator needs to wait for the voting results of all participants. Once there are slow participants, the global process execution speed will be affected.

Although some improved two-phase commit protocols can improve fault tolerance and performance, they are still rarely used in engineering, and their theoretical value is greater than practical significance.

2.7 MVCC

Multi-version Cocurrent Control (MVCC) technology. MVCC technology was also initially proposed in database systems, but this idea is not limited to stand-alone distributed systems, in distributed systems also effective.

MVCC is a technology to realize concurrency control with multiple different versions of data. Its basic idea is to generate a new version of data for each transaction. When reading data, the integrity of transaction results can be read by selecting different versions of data. With MVCC, each transaction is updated based on a base version that is in effect, and transactions can be carried out in parallel, resulting in a graph-like structure.

Version 1 of the underlying data generates two transactions simultaneously: transaction A and transaction B. Each of these transactions makes some local changes to the data (these changes are visible only to the transaction itself and do not affect the real data), after which transaction A commits first, generating data version 2. Based on data version 2, transaction C is initiated, and transaction C continues to commit, and data version 3 is generated. Finally, transaction B commits, at which point the result of transaction B needs to be merged with the result of transaction C. If there is no data conflict, that is, transaction B does not modify the variables modified by transaction A and transaction C, then transaction B can commit, otherwise transaction B fails to commit. The flow of MVCC is very similar to the flow of SVN and other version control systems, or SVN and other version control systems use MVCC ideas. When a transaction performs local modification based on the basic data version, in order not to affect the real data, there are two methods. One is to copy all the data in the basic data version and then modify the data. The SVN uses this method. Second, only the update operation is recorded in each transaction, instead of the complete data. When the data is read, the update operation is applied to the data of the basic version to calculate the result. This process is similar to the incremental commit of SVN.

2.8 Paxos agreement

Paxos protocol is one of the few decentralized and distributed protocols with strong consistency and high availability proven in engineering practice. The Paxos protocol process is complex, but the basic idea is not hard to understand, similar to the voting process in human society. In THE Paxos protocol, there is a set of completely peer participating nodes (called Accpetor), each of which makes a decision on a certain event. If a decision is approved by more than half of the nodes, it takes effect. Paxos can work as long as more than half of the nodes are normal, and it can resist exceptions such as downtime and network fragmentation.

role

A Proposer. A Proposer can have more than one numbered Proposer with a value. A value can be any operation, such as “change the value of a variable to a value”, “set the current primary to a node”, etc. These operations are abstracted as values in Paxos. Each Proposer can propose different or even contradictory values, such as one Proposer proposing “to set variable X to 1” and another Proposer proposing “to set variable X to 2,” but at most only one value is approved for a single Paxos process. Acceptors accept acceptors. If N acceptors are numbered, a numbered value (Proposer) submitted must be approved by a majority of the acceptors (N/2+1). Acceptors are completely independent of each other. Are you a Learner? Learner learns the approved value. If more than half of the proposers accept a Proposer with a value, then the Learner learns the value. A value must be approved by W=N/2 +1 acceptors. Therefore, learners need to read at least N/2+1 Accpetor. After reading at most N Acceptor results, Can learn a value that passes. The above three roles are only logical. In practice, a node can play all three roles.

process

The Paxos protocol goes round by round, and each round has a number. The Paxos protocol may or may not approve a value in each round. If a value is approved in one round of Paxos, only this value can be approved in subsequent rounds of Paxos. Each round of protocol flow forms a Paxos protocol instance, that is, only one value can be approved by a Paxos protocol instance at a time, which is an important embodiment of the strong consistency of Paxos protocol. Each Paxos round is divided into phases, preparation phase, and approval phase, where the Proposer and Acceptor have their own processes.

Process: the process for preparing a Proposer

  1. Send the Prepare(b) message to all acceptors. Here B is the number of rounds of Paxos, increasing with each round
  2. If any Acceptor sends a “Reject(B)” message, then the phase Paxos for that Proposer fails. Set the number of rounds B to B+1 and proceed to step 1 again. (In the approval stage, different choices are made according to the received Acceptor messages)
  3. If we receive N/2+1 “Promise(b, v_I)” messages from acceptors, N is the total number of acceptors. V_i indicates that an Acceptor last approved value V in round I. If a Promise(b, v) with a blank v is received, the Proposer selects a value V and broadcasts an Accept(b, v) to all acceptors. 3.2 Otherwise, select the maximum value V of I from all received “Promise(b, v_i)” messages and broadcast the Accept(b, v) message to all acceptors.
  4. If Nack(B) is received, set the number of rounds B to B+1 and repeat Step 1.

Process: Accpetor Process (Preparation stage)

  1. Accept a Propeser message Prepare(b). Parameter B is the number of the maximum Paxos rounds received by this Acceptor. If b> b, reply to a Promise(b, V_B) with b =b; Represents an assurance that proposals numbered less than B will not be accepted again. 1.2 Otherwise, reply Reject(B) (Approval stage)
  2. Accept(b, v), 2.1 If b < b, reply Nack(b), indicating that a proposer with a greater number was accepted by this Acceptor. Indicates that the Acceptor approved a Value of v. Broadcast the Accepted message.

example

The base example has five acceptors and one Proposer with no network outages. We focus on changes in variables B and V at each Accpetor, and changes in variables B at the Proposer.

  1. The initial state
  2. Proposer sends “Prepare(1)” to all accpetors, receives a Prepare(1) response, and replies with a Promise(1, NULL)
  3. A Proposer receives five promises (1, NULL) whose values satisfy more than half of the proposals, and sends Accept(1, v1), where v1 is the value it selects for a Proposer.
  4. At this time, v1 is the Value approved by more than half of acceptors. If Learner learns value, he can only learn V1

An approved Value cannot be changed within the same Paxos instance, even if a subsequent Proposer initiates a Paxos protocol with a higher numbered number. The core of the Paxos protocol is that “the approved value cannot be changed”, which is also the basis for the correctness of the entire protocol.

Paxos protocol is artificially designed and its design process is also the derivation process of the protocol. The Paxos protocol utilizes Quorom, with W=R=N/2+1 selected. In simple terms, a protocol is a process in which an Acceptor updates an Acceptor successfully if it updates more than half of its acceptors. Learner reads acceptors on Quorum, and if a value is successfully read on more than half of the proposers, it is an approved value. The protocol avoids deadlocks by introducing rounds so that the proposal of high rounds preempts the proposal of low rounds. The key point of protocol design is how to satisfy the constraint that “only one Value is approved in a Paxos algorithm instance”.

2.9 CAP

The definition of CAP theory is very simple. The three letters CAP respectively represent three contradictory attributes of distributed system:

  • Consistency: the Consistency of copies in CAP theory refers specifically to strong Consistency (1.3.4);
  • Availiablity: the ability of the system to perform services when abnormalities occur;
  • Tolerance to the partition of network: the system can handle the fault Tolerance of the abnormal situation of network partition (1.1.4.2);

CAP theory points out that it is impossible to design a distributed protocol that has three properties of CAP at the same time: 1) the copy under this protocol is always consistent, 2) the service is always available, 3) the protocol can tolerate any network partition anomaly; The distributed system protocol can only compromise all three of CAP.

The second law of thermodynamics explains that perpetual motion is impossible to exist, do not try to design a perpetual motion machine. Similarly, the significance of CAP theory is that we should not try to design a perfect system that has all three attributes of CAP, because such a system has been proved not to exist in theory.

  • Lease: Lease sacrifices some of the A’s for A full C and A good P.
  • Quorum mechanism: The Quorum mechanism has made compromises among the three major factors of CAP, with A certain degree of C, A better degree of A and P better degree of P. It is A relatively balanced distributed protocol.
  • Two-phase commit protocol: Two-phase commit systems have full C, very bad A, and very bad P.
  • Paxos: Also a strong consistency protocol, Paxos is much better than the two-phase commit protocol in CAP triad. The Paxos protocol has full C, better A, better P. The A and P properties of Paxos are similar to those of Quorum because the Paxos protocol itself has Quorum factors.