There is a famous CAP theory in distributed system, which is also the basis of distributed system theory.

CAP theory was first published in 2000 by a professor in Berkeley, California, at the ACM PODC conference. Two years later, it was theoretically proved by professors Seth Gilbert and Nancy Lynch at the Massachusetts Institute of Technology. It has since become an accepted theorem in the field of distributed systems.

In this article, WE will talk about the famous CAP theory.

CAP theory is simple to describe. It states that A distributed system can only satisfy at most two of C (consistency), A (availability), and P (partition). So let’s take a look at what these three represent.


Consistency Consistency


Consistency in distributed systems refers to the consistency of data on all nodes, or on all copies. All the nodes see the same data at the same time. This is not the same thing as consistency in database transactions. In our previous articles, we have described various consistency models in distributed systems in detail. Those who are interested can click here.

Four principles of database transactions

We can split consistency in two and explore it from the client side and the server side. As far as the client is concerned, it does not care about the implementation of the back end, nor does it care about the node performance of the back end. The only thing we care about is that we can get accurate and expected results with multiple concurrent accesses. For example, if the user clicks to pay for several times, the payment will only be made once, and the balance is the latest value no matter when it is queried.

The server is concerned with requests that will cause data changes, timely and accurate synchronization to all nodes and replicas, and considering possible network and communication problems to ensure that there are no errors in extreme cases.

In distributed systems, different models are designed for consistency under different conditions and requirements. As a quick summary, we can divide them into three categories:

  1. Requires that the data that has been successfully updated takes effect immediately and that the latest results are returned on subsequent visits. This is strong consistency.

  2. If you can tolerate some cases where the latest data is not accessible after an update occurs, this is weak consistency.

  3. If you can tolerate not being able to access the latest data for a period of time after the update, but ultimately ensure that the results are accurate, this is the ultimate consistency.

In CAP theory, what we mean by unsatisfiable consistency is strong consistency.


The Availability Availability


Availability means: Reads and writes always succeed. This means that the system is always available and the service is always normal.

A highly available distributed system must respond to every user request. User experience cannot be affected by access failure or response timeout. In a distributed system, the instability of any node may affect the availability of the system, such as database server, load balancing, Web server hosting and so on. To quantify the availability of a system, we usually use the metric system downtime. That’s the total amount of time the system is down in a year.

It is said that Taobao can do five nines, which is 99.999% of the time. It works out to no more than five minutes of system downtime a year, which is very difficult to do.


Partition Tolerance Fault Tolerance of a Partition


Partition fault tolerance refers to: System continues operating despire arbitrary message loss or failure of part of the System. This means that the system can provide consistent and available services even when some nodes or network partitions fail.

Partition fault tolerance is closely related to scalability, because the larger the distributed system, the more likely it is to experience machine downtime, network congestion, and so on. Even if these unexpected situations occur, the system can still maintain stability is the premise of system expansion. There are many possibilities for problems in a distributed system, from the possibility of some machines going down to the possibility of internal network blocking, where the whole cluster is broken up into parts that cannot communicate with each other. Partitioning fault tolerance is required to ensure that the system is consistent and available even when these conditions occur.

For example, Ali often does experiments of power outage in the computer room. During the experiments, he directly cuts off the power supply of a computer room to observe whether the system can still remain stable at this time.


The proof of CAP theorem


So that’s all we have to say about CAP, and then we’re going to try to prove why CAP can’t all work.

To simplify the proof process, we assume that there are only two N1 and N2 nodes in the whole cluster, as shown in the figure below:

N1 and N2 each have an application, AB, and a database. When the system meets consistency, we consider the data in N1 and N2 databases to be consistent. In the case of availability, we believe that no matter the user accesses N1 or N2, the correct result can be obtained. In the case of fault tolerance of partition, we believe that whether N1 or N2 goes down or the communication between them is not affected.

Let’s assume an extreme case, where at some point the network communication between N1 and N2 suddenly breaks down. If the system is fault tolerant for partitions, then this exception can obviously be supported. The question is, can consistency and availability remain intact?

Let’s do a hypothetical experiment like this, where suddenly at some point the connection between N1 and N2 breaks:

A user sent a request to N1 to change the data, to update the database from V0 to V1. Because the network is disconnected, N2 database is still V0, what if N2 has a request to N2, but N2 has no way to directly give the latest result V1?

At this time, there are no two ways, one is to go ahead and return the wrong V0 data to the user. The second is a blocking wait, where the network communication is restored and the data in N2 is updated before being returned to the user. Obviously the former sacrifices consistency and the latter usability.

This example is simple, but the point is important. In distributed system, we cannot satisfy the three characteristics of CAP at the same time, so we must abandon one of them. There are obviously three possible permutations without one of them.

1. Discard A and retain CP

One system guarantees consistency and fault tolerance of partitions at the expense of availability. In other words, in extreme cases, the system is allowed to be inaccessible. At this time, the user experience is often sacrificed and the user waits until the system data is consistent before resuming the service.

For some systems, such as Hbase and Redis, data consistency is essential. Users will not want to use storage that does not meet consistency requirements.

The same is true with ZooKeeper, where you can get consistent results anytime you visit ZK. Its job is to ensure that services under its jurisdiction remain synchronized and consistent, and it is obviously impossible to give up consistency. In extreme cases, however, ZK may discard some requests and the consumer may need to request again to get the results.

2. Discard C and retain AP

This is the design of most distributed systems, guaranteeing high availability and fault tolerance of partitions, but sacrificing consistency. For example, Taobao shopping and 12306 ticket purchase, etc., said that Taobao can achieve the annual availability of 5 9 super high level, but at this time can not guarantee the data consistency.

For example, we often encounter it when buying tickets at 12306. When we click to buy, the system does not say no tickets. After we entered the verification code, we were told when we paid that there were no tickets left. This is because when we click to buy, the data did not reach consistency, and the shortage of remaining tickets was detected in the payment verification. This design sacrifices some user experience, but ensures high availability and prevents users from being unable to access or waiting for a long time, which is a trade-off.

3. Discard P and retain CA

Sadly, this is rarely the case. Because of distributed systems, network partitioning is inevitable. If P is to be discarded, then the distributed system is to be discarded, and CAP is out of the question. It can be said that P is a prerequisite for distributed systems, so this case does not exist.

For example, general relational databases such as MySQL or Oracle guarantee consistency and availability, but are not distributed systems. In this sense, CAP is not equivalent, and we cannot promote P by sacrificing CA. The fault tolerance of zoning can only be improved by improving the stability of infrastructure. So this is not a software problem.


This is the end of the introduction of CAP theory. In the end, we will find that it is a matter of trade-off and there is no perfect solution. Each architect designing distributed systems needs to consider the actual characteristics of their own business scenarios. For example, when it comes to money, consistency is a must, and in extreme cases, the relevant data cannot be inaccurate, even if the user is temporarily unable to access it. This will not only affect a company’s reputation, but also bring about many other aspects of trouble.

If you feel that you have gained something, please click the following ~