What is a distributed consistency problem?

When we talk about distributed consistency problem, we refer to data consistency problem.

Data consistency

Data consistency is a concept in database system. Data consistency usually refers to the correctness and integrity of the logical relationship between associated data.

In a centralized system, because a single database handles all data requests, we can guarantee ACID data through transactions + locks.

Why is there a distributed consistency problem?

We know that distributed system has many advantages, because the system is distributed deployment of multiple machines, then there must be data replication (such as database remote disaster recovery, multi-site deployment). The demand for data replication in distributed systems comes from the following two reasons:

  • Eliminate single point of failure

    Copying data to multiple machines in a distributed deployment eliminates a single point of failure. Prevent the system from being unusable due to a machine downtime.

  • Improve system performance

    With load balancing technology, copies of data distributed in different places can all be served externally. Improves system performance.

Although data replication improves system performance and availability, how to ensure data consistency between multiple data copies (consistency problem) becomes a big problem.

We almost have no way to ensure you can update all the data in all the machines at the same time, because the network delay, even if we sent all the machines at the same time update the data request, there is no guarantee that all the request is consistent with the response time, lag, there will be some machine data inconsistency problem.

Distributed consistency model

Strong consistency

The system reads what it is asked to write;

The user experience is the best, but it is difficult to achieve, which often has a great impact on system performance.

Weak consistency

This level of consistency ensures that data can be read immediately after data is written to the system, and data consistency is guaranteed after a certain level of time (for example, seconds).

Weak consistency is divided into the following types:

  • Read and write consistency

    Users read the consistency of their own writing results, ensuring that users can always see their updated content in the first time.

    For example, if we send a circle of friends, it is not important whether the content of the circle of friends is seen by friends for the first time, but it must be displayed in their own list.

    Solution:

    Option 1: One option is to go to the main library every time for certain items. (Problem: the main library is under great pressure)

    Option 2: We set up an update window, in the latest update period, we default to read from the main library, after this window, we will select the slave library that has been updated recently to read

    Scheme 3: We directly record the updated time stamp of the user, and carry this time stamp with the request. Any slave library whose last update time is less than this time stamp will not respond.

  • Monotonic read consistency

    Each read data cannot be older than the last one.

    The data update time of the primary and secondary nodes is inconsistent. As a result, when the user refreshes the data repeatedly, sometimes the data can be refreshed. After refreshing the data again, the data may be refreshed again, just like a supernatural event.

    Solution:

    A hash value is calculated based on the user ID and mapped to the machine using the hash value.

    The same user will only be mapped to the same machine no matter how much it is refreshed. This ensures that you don’t read content from other libraries and have a bad user experience.

  • Causal consistency

    If node A notifies node B after updating some data, node B’s subsequent access to and modification of that data is based on A’s updated value.

    Meanwhile, data access of node C, which has no causal relationship with node A, has no such restriction.

  • Final consistency

    The final consistency is the weakest of all distributed consistency models. Can be considered the “weakest” consistency without any optimization.

    It means that I don’t take into account the effects of all intermediate states, only that after a certain period of time, when there are no new updates, the data in all copies of the system will be correct.

    It guarantees the concurrency capability of the system to the greatest extent, and therefore, it is the most widely used consistency model in high concurrency scenarios.

Distributed consistency solutions

As a difficult problem to be ignored in distributed system, a large number of consistency protocols and algorithms have emerged in the research and exploration for a long time, including the well-known 2PC (two-stage submission protocol), 3PC (three-stage submission protocol) and Paxos algorithm.