Redis Redlock controversy

Redis implements distributed locking

Redis author proposed Redlock program, immediately by the University of Cambridge, the industry’s famous distributed system expert Martin questioned! This paper considers that the Redlock algorithm model is problematic, and writes a document on the design of distributed lock, put forward his own views.

Not to be outdone, Antirez, the Redis author, wrote a rebuttal and detailed design details of Redlock’s algorithm model.

The debate on this issue also caused a very heated discussion on the Internet at that time. Both of them are experts in the field of distributed systems, but they put forward many opposite arguments on the same issue, and they have clear ideas and sufficient arguments. Who is right and who is wrong? Let’s take a look at Martin, a distributed expert, who doubts Relock

Martin, a distributed expert, is skeptical of Relock

In his article, he mainly elaborated four arguments:

Preference for distributed locks

Martin says there are two preferences for using distributed locks

Efficiency: Use the mutual exclusion capability of distributed locks to avoid doing repetitive work (such as some “expensive” computing tasks) many times. This situation requires that even if the lock fails, there will be no “malignant” consequences. For example, a harmless situation such as sending multiple emails.
Correctness: Locks are used to prevent concurrent processes from interfering with each other. If the lock fails, multiple processes concurrently operate the same data, resulting in serious data errors, permanent data inconsistency, and data loss.

Martin believes that if you’re looking for efficiency, the stand-alone Redis is fine, and even the occasional lock failure (downtime, master/slave switching) won’t have serious consequences. Using Redlock is too heavy to be necessary.

And if you want to be correct, Martin thinks that Redlock does not meet the requirements of security, and there is still lock failure problem!

Problems with locks in distributed systems

Martin said that in a distributed system, there are all kinds of exceptions, and these exceptions mainly consist of three big blocks, which are also the three mountains that distributed systems encounter: NPCS.

N: Network Delay
P: Process Pause
C) Clock Drift D) Clock Drift

Martin pointed out the Redlock security issue with an example of a process pause:

Client 1 requests to lock nodes A, B, C, D, and E
When client 1 gets the lock, the process is paused.
Locks on all Redis nodes have expired
Client 2 has obtained the locks on A, B, C, D, and E
Client 1 successfully obtains the lock when the GC is complete
Client 2 also thinks it has acquired the lock, causing a “conflict”.

According to Martin, process pauses can occur at any point in the program and the execution time is not controllable.

Note: Of course, even if there is no process pause, network latency and clock drift can all cause Redlock to have such problems. Martin is only using process pause as an example

It is unreasonable to assume that the clock is correct

Relock has an implicit condition that all host times are correct, and problems can arise if the time is incorrect, for example

Client 1 has obtained the locks on nodes A, B, and C
The clock on node C jumps forward, causing the lock to expire
Client 2 obtains locks on nodes C, D, and E
Both clients 1 and 2 now believe they hold a lock (conflict)

Martin argues that Redlock must “rely heavily” on multiple nodes’ clocks being synchronized, and that the model fails if one node clock fails. It is quite possible for a machine’s clock to make an error, such as:

The system administrator “manually modified” the machine clock
Big “jump” in synchronizing machine clocks with NTP time

In conclusion, Martin believes that Redlock’s algorithm is based on the “synchronous model”, and there is a large amount of data research shows that the assumption of synchronous model is problematic in distributed systems. In a chaotic distributed system, you can’t assume that the system clock is correct, so you have to be very careful with your assumptions.

Put forward the fecing token scheme to ensure its correctness

Correspondingly, Martin proposed a scheme called fecing token to ensure the correctness of distributed locks. (A great god is a great god who can not only find and ask problems, but also provide solutions.)

The model flow is as follows:

When a client acquires a lock, the lock service can provide an “incremental” token
The client uses this token to manipulate the shared resource
Shared resources can reject requests from “latecomers” based on tokens

This way, no matter which NPC exceptions occur, the distributed lock is secure because it is built on an “asynchronous model.”

Redlock doesn’t offer a fecing token solution, so it can’t guarantee security.

He also said he

A good distributed lock, no matter how the NPC happens, can give a result within a specified time, but will not give a wrong result. This only affects the “performance” (or activity) of the lock, not its “correctness”.

Martin’s conclusion

Redlock is neither fish nor fowl: For preference efficiency, Redlock is heavy and unnecessary, and for preference correctness, Redlock is insecure.
Unreasonable clock assumptions: This algorithm makes dangerous assumptions about the system clock (assuming that the machine clocks on multiple nodes are consistent), and if these assumptions are not met, the lock will fail.
No Guarantee of correctness: Redlock cannot provide a scheme similar to fencing token, so it cannot solve the correctness problem. To ensure correctness, use software that has a common system, such as Zookeeper.

That’s Martin’s argument against Redlock, and it’s valid. Here’s how Redis author Antirez counters.

Rebuttal by Redis author Antirez

There are three main points in the Redis authors’ rebuttal

The clock problem

First, the Redis authors saw right through the core of the issue: the clock.

Why do Redis authors give priority to explaining clock issues? Because in the later refutation process, need to rely on this basis to do further explanation.

Redis does not require a completely consistent clock, but only a roughly consistent clock, allowing for “errors” as long as the error does not exceed the lease period of the lock. This is not a high requirement for clock accuracy, and this is also suitable for real-world environments.

In response to the “clock modification” issue, Redis retorts:

Manually change the clock: Just don’t do that, otherwise you’ll just change the Raft log and it won’t work…
Clock jumping: It can be done with “proper operations” to ensure that the machine clock does not jump too much (with small adjustments at a time)

Explain network delays and process pauses

Redis also refuted the claim that network delays and process pauses could cause Redlock to fail

So let’s go back to Martin’s question and suppose:

Client 1 requests to lock nodes A, B, C, D, and E
When client 1 gets the lock, the process is paused.
Locks on all Redis nodes have expired
Client 2 has obtained the locks on A, B, C, D, and E
Client 1 successfully obtains the lock when the GC is complete
Client 2 also thinks it has acquired the lock, causing a “conflict”.

Redis retorts that this assumption is flawed, and that Redlock can guarantee lock security. Remember the five steps that introduced the Redlock process earlier? Let’s review it.

The client first gets “current timestamp T1”
The client initiates lock requests to these five Redis instances in turn (using the SET command mentioned above) and sets the timeout time (milliseconds). If one instance fails to lock (including network timeout, lock being held by other people, etc.), the client immediately applies for lock to the next Redis instance
If the client successfully locks more than >=3 (most) Redis instances, the client obtains “current timestamp T2” again. If the lock lease period is > T2-T1, the client is considered to have successfully locked; otherwise, the client is considered to have failed to lock
The shared resource is unlocked successfully
Lock failure or end of operation, issue lock release request to “all nodes” (Lua script release lock mentioned earlier)

Note that the key is 1-3. In step 3, why do I need to retrieve “current timestamp T2” after successfully locking? T2-t1, compared to the expiration of the lock, right?

The author of Redis emphasizes that if the network delay, process pause and other time-consuming anomalies occur in 1-3, it can be detected in step 3 t2-T1. If the expiration time of the lock is exceeded, it is considered that locking will fail, and then it is good to release the lock on all nodes.

If the client believes that the network delay and process pause occurred after Step 3, that is, when the client confirms that it has acquired the lock, there is a problem on the way to operate the shared resource, resulting in the lock failure, then this is not only Redlock’s problem. Any other locking service, such as Zookeeper, They all have similar problems. That’s out of the question.

So the Redis authors conclude:

Redlock will be able to detect any time-consuming problems the client experiences in step 3 before it gets the lock
If an NPC occurs after the client has acquired the lock, Redlock and Zookeeper cannot do anything about it

Therefore, Redis author believes that Redlock can guarantee the correctness of the clock on the basis of ensuring the correctness.

Query the Fencing token mechanism

Redis also questioned the fecing token mechanism proposed by the other party, mainly divided into two questions

First, the solution must require the shared resource server to be able to reject the old token.

If the shared resource server is MySQL, we need to operate MySQL and get an incrementing token from the lock service. Then the client needs to change a row of MySQL with this token. This needs to take advantage of MySQL’s “transaction isolation”.

//Both clients must leverage transactions and isolation for their purposes//Note the token UPDATE conditiontable T SET val = $new_val WHERE id = $id AND current_token < $token
Copy the code

But if you’re writing a file to disk instead of MySQL, or making an HTTP request, then you can’t do anything with this solution, which puts more demands on the resource server that you’re operating on.

Furthermore, since resource servers are mutually exclusive, why distribute locks?

So the Redis authors argue that the scheme is untenable.

Redlock “implements” fecing token

Second, even if Redlock does not provide the fecing token capability, Redlock does provide a random value (UUID) that can be used to achieve the same effect as the fecing token.

The client uses Redlock to get the lock
Before operating on a shared resource, the client marks the VALUE of the lock on the shared resource to be operated
The client processes the business logic, and finally, when modifying the shared resource, determines whether the tag is the same as before.

Again using MySQL as an example, this implementation is as follows

The client uses Redlock to get the lock
Before the client modifies a row in the MySQL table, update the VALUE of the lock to a field in the row (for example, the current_token field)
The client handles the business logic
MySQL > select VALUE from ‘WHERE’

UPDATE table T SET val = $new_val WHERE id = $id AND current_token = $redlock_value
Copy the code

It can be seen that this scheme also achieves the same effect as fecing token mentioned by the other party by relying on MySQL transaction mechanism.

Operating sequence

The fecing token is used by the fecing token on the fecing token. If the fecing token is used by the fecing token on the fecing token, then the fecing token is used by the fecing token. With the fecing token Martin mentioned, because the token is monotonically increasing, the resource server can reject small token requests, ensuring “sequential” operations!

Redis gives a different explanation for this problem, which I find very interesting. He thinks that the essence of distributed lock is to “mutually exclusive”. As long as two clients can ensure that one succeeds and the other fails in the concurrent process, there is no need to care about “sequential”.

Martin has been concerned about this sequential issue in his previous questioning, but Redis’ author holds a different view.

To sum up, Redis authors conclude:

The author agrees on the impact of “clock jumping” on Redlock, but believes that clock jumping can be avoided, depending on infrastructure and operations.
If an NPC occurs before Redlock step 3, the lock will be correct. However, if an NPC occurs after Redlock step 3, it is not only Redlock that has problems, but other distributed lock services as well, so it is not discussed.

Distributed lock based on Zookeeper

In his article, Martin recommends the use of Zookeeper for distributed locking, arguing that it is more secure

How does Zookeeper implement distributed locks?

Clients 1 and 2 both attempt to create “temporary nodes” such as /lock
If client 1 arrives first, the lock succeeds, but client 2 fails to lock
Client 1 operates shared resources
Client 1 deletes the /lock node and releases the lock

Zookeeper does not need to consider the expiration time of locks like Redis. Zookeeper uses a “temporary node” to ensure that client 1 can hold the lock as long as the connection continues after obtaining the lock. If client 1 crashes unexpectedly, the temporary node will be deleted automatically, ensuring that the lock will be released.

Hold hold lock

After client 1 creates a temporary node, how does Zookeeper ensure that the client always holds the lock?

The reason is that client 1 maintains a Session with the Zookeeper server. This Session relies on the periodic heartbeat of the client to maintain the connection.

If Zookeeper does not receive heartbeat messages from the client for a long time, the Session expires and the temporary node is deleted.

An NPC problem in a Distributed lock of Zookeeper

Zookeeper also has NPC problems with distributed locks, taking process pauses as an example

Client 1 succeeded in creating a temporary node /lock and obtained the lock
The process on client 1 paused for a long time. Procedure
Client 1 failed to send heartbeat messages to Zookeeper, and Zookeeper deleted the temporary node.
Client 2 succeeded in creating a temporary node /lock and obtained the lock
Client 1 process suspends, it still thinks it holds the lock (conflict)

Zookeeper cannot ensure security when processes are suspended or network delays are abnormal.

This is what the Redis author mentioned earlier in his rebuttal: if a process pause or network delay occurs after the client has acquired the lock, it is not only Redlock that has a problem, but other locking services have similar problems.

Here we can draw a conclusion:

A distributed lock, in extreme cases, is not necessarily secure.

Advantages and disadvantages of the Zookeeper distributed lock

Advantages of Zookeeper:

There is no need to consider the expiration time of the lock
Watch mechanism. If locking fails, watch can wait for lock release to realize optimistic locking

But its disadvantages are:

Not as good as Redis
High deployment and o&M costs
The client is disconnected from Zookeeper for a long time, and the lock is released

Should I use Redlock or not?

As mentioned above, Redlock will only work if the clock is “correct”, and if you can guarantee this, you can use it.

But keeping the clock right is not easy

First, from the hardware point of view, the clock offset is inevitable. For example, CPU temperature, machine load, and chip material can all cause the clock to shift.
Second, human error is also difficult to completely avoid, operation and maintenance violence to modify the clock, thus affecting the correctness of the system

So, my personal opinion on Redlock is to avoid it, not perform as well as the stand-alone version of Redis, and have higher deployment costs, so I recommend using the master-slave + sentry model for distributed locking first

How does that guarantee correctness?

Use distributed locks correctly

We also know from the above, any distributed lock is not completely guaranteed correctness, so distributed lock is recommended

The use of distributed locks in the upper layer to achieve the purpose of “mutually exclusive”, although the lock will fail in extreme cases, but it can maximize the extent of concurrent requests at the top layer, reducing the pressure on the operation resource layer.
However, for services that require absolutely correct data, you must do a good job in the resource layer, so that the system will not be affected in extreme cases