This post was originally posted on my public account: TechFlow

In the field of computer systems, consistency is a high frequency word, and there are many possible scenarios. It can be found in transactions ranging from distributed systems to databases.

Earlier, when we introduced database transactions, we talked about transaction consistency. In databases, consistency is an end, not a means. Databases want to control atomicity, isolation, and persistence of transactions to ensure data consistency. Consistency here refers more to actual agreement with our ideas. Which means the results are all within our expectations. In distributed system, consistency has another meaning, one is the consistency between multiple copies of data, the other is the consistency of multi-phase submission. Let’s start today by talking about duplicate consistency.

The reason for this problem is simple: in distributed systems, there are often multiple copies of data. If a single database handles all data requests, then the four ACID Principles basically guarantee data consistency. Multiple copies guarantee multiple copies of the data. This creates synchronization problems, as there is almost no way to ensure that all machines can be updated at the same time, including backing up all data. Especially when these machines are distributed across the country or even around the world, due to network latency, even if I send requests to update data to all machines at the same time, there is no guarantee that these requests will be answered at the same time. As long as there is a time difference, there will be data inconsistencies between some machines. In other words, consistency in distributed systems refers to data consistency.

This is actually a dilemma. In order to solve the pressure problem of excessive flow, we designed a distributed system. However, the distributed system will bring the problem of inconsistent data copies, we can not tolerate data error, also can not give up the distributed system, the only way is to take some measures to minimize the impact of this problem.

A variety of consistency models are the embodiment of these measures. Let’s start with the simplest form of strict consistency.


Strict consistency


Strict consistency is ideal if we request one piece of data every time, in any case, we get the result of the last change to it. Unfortunately, strict consistency is impossible to achieve.

The reason this is impossible is simply that data synchronization between multiple machines takes time, no matter how small, and it is guaranteed to exist. Strict consistency is impossible as long as it exists. Let’s take A simple example. We have machines A and B. At time T, machine A modifies A piece of data, and machine B receives A request to query the data 1 millisecond ago. When B executes this query, machine A has already modified, so what should be the value of B’s query? Is it before or after modification by A? From the point of view of machine A, B’s query occurred after its modification, whereas from the point of view of machine B, A’s modification occurred after it received the request. If we want to ensure strict consistency, what is the right outcome?

Of course, the above example is only the most extreme case, which can only happen in theory, but through the analysis of the extreme case, we can also see that strict consistency is impossible to achieve.


Strong consistency and weak consistency


The fundamental reason for data inconsistency lies in the time difference between multiple machines updating data. When we update multiple machines, there are always some before others, so it is difficult to ensure complete synchronization. Consistency can be divided into strong consistency and weak consistency according to whether the data is synchronized or asynchronous.

When we use the synchronization policy to update data, each time we request to the master node, the master node receives the data and sends it to the slave node using the synchronization policy. When all the slave nodes are updated successfully, the master node will update the state of the data to make it take effect, and then return a response to the user to inform the user that the update is successful.

Obviously, a synchronous data update strategy guarantees greater consistency. If we put quarantine measures on the host, such as the user can not initiate the next update until the update is over, then as much data consistency as possible is not a problem. But this approach also has a big disadvantage, the biggest disadvantage is the use of synchronous update operation, and all slave libraries have to update successfully before returning, such a time is very expensive. One of the key drawbacks is that if a slave goes down, it is impossible for the master to update all slave libraries, so incoming requests are never updated, which is obviously unacceptable.

The strong consistency corresponds to the weak consistency. We do not use the synchronous strategy to update data, but use the asynchronous update method. The benefits are also obvious, synchronous is changed to asynchronous, time consumption is greatly reduced. However, the problem is obvious. In addition to the problems caused by asynchrony itself, the results of multiple accesses can be inconsistent if the data updates between multiple copies occur at different times.

There are essentially two consistent models of distributed systems, strong and weak. However, on these two basic models, many possible problems will be optimized accordingly. Generally speaking, distributed systems require higher performance than consistency, so most of the consistency models of distributed systems are designed based on weak consistency. The following is a list of several classic weak consistency model optimization schemes.


Read and write consistency


Read and write consistency is common in daily life, such as in a forum where a user replies to a post. However, when the user refreshes, the reply may disappear. The user is left wondering if the response was successful or not.

The reason why this happens is simple, because when the user refreshes, the user may not have retrieved the data that the user replied to, so the results will not show the content that the user just replied to. In this case, we need to ensure read and write consistency. That is to say, users read their own written results consistent, to ensure that users can always see their updated content first time. For example, if we send a circle of friends, it is not important whether the content of the circle of friends is seen by friends for the first time, but it must be displayed in their own list.

So how do you do that?

There are several ways to do this, but one way is to go to the main library every time for certain things. For example, when we read replies to posts, we always go to the main library to read them. However, such problems are also obvious and can cause the main library to become overstressed. Another option is to set up an update window in which we read from the master library by default during the period immediately after the update and then select the slave library that has been updated recently.

A better solution would be to simply record the timestamp of the user’s update, carry it with us when we request it, and any slave library with a last update time less than this timestamp would not respond. This means that only the library in which the user writes the update can respond to the request, ensuring user read-write consistency.


Boring to read


Monotonic reading solves the most classic inconsistency problem of weak consistency, and the scene is also very simple. The data update time of the primary and secondary nodes is inconsistent. As a result, when the user refreshes the data repeatedly, sometimes the data can be refreshed. After refreshing the data again, the data may be refreshed again, just like a supernatural event. I remember this problem existed before microblog or everyone, monotonous reading is aimed at this scene, can ensure that this situation will not occur.

The solution is simple: compute a hash value based on the user ID and map the hash value to the machine. The same user will only be mapped to the same machine no matter how much it is refreshed. This ensures that you don’t read content from other libraries and have a bad user experience. Of course, this is only one solution, and there are many other solutions that are not discussed here.


Causal consistency


Causal consistency refers to logical causal problems between data, for example, users ask and answer questions in Zhihu. If you want to answer questions, you have to make sure there are questions. That is to say, there must be questions first, and then some answers. However, the questions and answers are not always stored on the same node. It is quite possible that the questions are stored on node A and the answers are stored on node B. Because of the synchronization delay, it is possible that the user of the query will only see the answer, but will not find the corresponding question, violating the causality between things.

One solution to this problem is to follow some logical order when writing, so that when reading, no causal confusion can be guaranteed. However, the problem is that many causes and effects are not as obvious as the questions and answers. Some hidden causes and effects may be difficult to judge easily, so we need to introduce more in-depth technology. Interested students can search vector clock for in-depth understanding.


Final consistency


As impressive as the name sounds, the final consistency is the weakest of all distributed consistency models. It can be thought of as the “weakest” consistency without any optimization, which means THAT I don’t take into account the effects of all intermediate states, but only guarantee that after a certain period of time, when there are no new updates, the data in all copies of the system will be correct.

It’s like ordering starbucks. You don’t know when starbucks will be ready, you get it before it’s ready, you get the wrong results. But you know, over time, you can get what you want. There is often no consensus on how long starbucks takes to make, ranging from a few hundredths of a second to hours.

As far-fetched as this sounds, we have a few more questions to ask before we can be sure it works. First of all, can the system guarantee a period of time for how long? What if the end of time? Secondly, because convergence is eventually achieved, the values of multiple copies may be different before convergence, which one should prevail?

Fortunately, you can answer both questions. The answer to the first question is that the system has no way of determining exactly how long convergence takes, but it can determine the maximum convergence time. Sort of like the half-life of physics, we don’t know exactly how long it takes for a particle to decay, but we can determine how long it takes for half of enough particles to decay. The second question is easier to answer. When multiple sets of data are present, it is common practice to choose the one with the larger timestamp, meaning that the value appears later.

Consistency may seem unreliable in the end, but it guarantees the concurrency of the system as much as possible, and is therefore the most widely used consistency model in high concurrency scenarios.

This concludes the presentation of the common consistency model in distributed systems. One of the great features of distributed systems is that we often get confused when looking at technical terms. But when we understand the design philosophy behind it and the reason for its emergence, we can find it interesting. I sincerely hope that everyone can find their own fun and learn from it.

If you like this article, please feel free to click the following, your support is my biggest motivation.