The origin of distributed locks

Distributed applications often encounter concurrency problems when performing logical processing.

For example, if an operation wants to modify the user’s state, it needs to read the user’s state first, modify it in memory, and then save it back. If such operations are performed by multiple threads at the same time, concurrency problems can occur because the same thread reads and saves state, which are not atomic.

Atomic operations are operations that cannot be interrupted by thread scheduling. This operation, once started, runs until it ends without any context switch thread switching.

In this case, a distributed lock is used to limit the concurrent execution of the program and ensure that only one process can modify the user’s state at a time.

Distributed locking is a very useful primitive in many scenarios, such as the example above, which can be summed up as the fact that different processes must share resources in an exclusive manner.

There are many distributed lock libraries and blogs describing how to implement a distributed lock manager (DLM), but each library is implemented in a different way, with many libraries implementing it in a way that reduces reliability for simplicity and others using slightly more complex designs.

Some reference implementation libraries redis.cn/topics/dist…

Redis implementation of distributed lock

The goal of distributed locking is essentially to occupy a “pit” in the Redis, and when other processes try to occupy it, they either give up or try again later.

A trap is usually captured by a single client using the setnx(Set if not exists) directive. First come, first occupied, then call del to release the pit.

The overall idea is like this, but the practical application will encounter the following problems, in the following problems slowly evolved, and finally got a relatively good solution.

The deadlock problem

delToo late for execution

If there is an exception in the middle of the logic execution, the DEL command may not be called and the lock will be deadlocked, which will never be released.

How to solve

After we get the lock, we add an expiration time to the lock, such as 5s, so that the lock will be automatically released after 5 seconds even if there is an exception in the middle.

expireToo late for execution

But there are problems with that logic. If the server process suddenly hangs between setnx and EXPIRE, perhaps because the machine was powered down or killed by expire, the expire process will not be executed and will cause a deadlock.

How to solve

The root of this problem is that setNx and EXPIRE are two directives rather than atomic directives.

If the two instructions can be executed together, there will be no problem. You might think of Redis transactions as a solution. However, this cannot be done because expire depends on the result of setnx’s execution. If setNx does not grab the lock, expire should not be executed. There is no if-else branch logic in a transaction, which typically executes at one go, either all or none.

To solve this problem, the Redis open source community has sprung up a bunch of libraries dedicated to distributed locks. The implementation method is extremely complex, small white users generally need to spend a lot of energy to understand. If you need to use distributed locks, that means you can’t just use Jedis or Redis-py, but also introduce a library for distributed locks.

In order to deal with this mess, the author added the extended parameter of set instruction in Redis 2.8, so that setnx and EXPIRE instruction can be executed together, which solved the mess of distributed lock.

> set lock:code true nx ex 60 ... 
> do something critical ... 
> del lock:code 
Copy the code

The above instruction is the atomic instruction combined with setnx and EXPIRE, which is what distributed locks are all about.

Timeout problems

Redis distributed locks do not solve timeouts, which can occur if the logic between locking and releasing a lock is executed so long that the timeout limit for the lock is exceeded.

Lock misunderstanding causes concurrency

If thread A successfully obtains the lock and sets the expiration time to 30 seconds, but thread A’s execution time exceeds 30 seconds, the lock expiration is automatically released, and thread B has obtained the lock.

The DEL command is called to release the lock. Thread A and thread B execute concurrently.

It is not allowed for threads A and B to be concurrent. There are two ways to solve this problem:

  • Set the expiration time long enough to ensure that the code logic executes before the lock is released (but this can cause some problems)
  • Add daemons to the thread that acquires the lock, and add a lifetime to the lock that is about to expire but has not been released

Lock in addition to the misunderstanding

If thread A successfully obtains the lock and the expiration time is set to 30 seconds, but thread A’s execution time exceeds 30 seconds, the lock expiration is automatically released, and thread B has obtained the lock.

Thread A calls the DEL command to release the lock, but the lock added by thread B has not been completed. Thread A actually releases the lock added by thread B.

How to solve

By setting the lock identifier of the current thread in value, the value corresponding to the key is verified to determine whether the lock is held by the current thread before deleting the key. Can generate a UUID to identify the current thread, using lua script to verify identification and unlock operations.

// lock String uuid = uuid.randomuuid ().toString().replaceAll("-",""); SET key uuid NX EX 30 do smoething... If (redis. Call ('get', KEYS[1]) == ARGV[1]) then return redis. Call ('del', KEYS[1]) else return 0 endCopy the code

Lua scripts can guarantee atomic execution of multiple instructions in a row

The above code block is the current mainstream Redis distributed lock solution

We use the SET key value NX EX second command to lock, which solves the deadlock problem. Lua script is used to determine the unique lock and then release it to ensure that no misunderstanding of lock division occurs, which solves the timeout problem to a certain extent.

The problem why The solution
The deadlock problem There is no guarantee of atomicity useSET key value NX EX secondThe command
Timeout problems The execution time is too long, or the lock is misinterpreted. Procedure Set expiration time appropriately and ensure uniqueness of value

This usage is fine under a simple single Redis node, but in actual production, you still face the following problems…

The primary and secondary failover is faulty

To ensure the high availability of Redis, master-slave architecture, sentry architecture, and cluster architecture have been developed.

Whether master-slave, sentinel, or cluster architectures, the principle is to asynchronously copy data from the master node from the node. When the primary node fails, the secondary node takes over.

The first client successfully applied for a lock on the master node, but before the lock could be synchronized to the slave node, the master node suddenly hung. Then the secondary node becomes the master node, and the new node does not have the lock inside, so when another client comes to request the lock, it immediately approves it. As a result, the same lock in the system is held by two clients at the same time, causing insecurity.

Because of this, Antirez, the author of Redis, proposed a more advanced distributed lock implementation method based on distributed environment: Redlock.

Redlock algorithm

In a distributed Redis environment, we assume that there are N Redis masters. These nodes are completely independent of each other and there is no master-slave replication or other cluster coordination mechanism. We ensured that locks would be acquired and released on N instances using the same method as under Redis single instances. Now let’s say we have five Redis master nodes and we need to run these Redis instances on five servers at the same time so that they don’t all go down at the same time.

To get the lock, the client should do the following:

  • Gets the current Unix time in milliseconds.
  • Try to get locks from five instances in turn, using the same key and a unique value (such as UUID). When requesting a lock from Redis, the client should set a network connection and response timeout that is less than the lock expiration time. For example, if your lock expires automatically in 10 seconds, the timeout should be between 5 and 50 milliseconds. This prevents the client from waiting for a response when Redis has already failed on the server. If the server does not respond within the specified time, the client should try to obtain the lock from another Redis instance as soon as possible.
  • The client obtains the lock usage time by subtracting the current time from the time it started acquiring the lock (the time recorded in Step 1). The lock is successful if and only if it is taken from most of the Redis nodes (N/2+1, here 3 nodes) and used for less than the lock expiration time.
  • If a lock is obtained, the true validity time of the key is equal to the validity time minus the time used to obtain the lock (as calculated in Step 3).
  • If, for some reason, the lock fails to be acquired (not in at least N/2+1 Redis instances or the lock has been acquired for an extended period of time), the client should unlock all Redis instances (even if some Redis instances are not locked at all). Prevents some node from acquiring the lock but the client does not get the response so that the lock cannot be reacquired for a later period of time.

But this implementation is also controversial:

  • It is also not desirable in actual production because it is a waste of resources to use several completely independent master nodes (these master nodes are not the concept of a master node in a cluster, but rather a single point instance)
  • There are problems with the design itself

Is distributed lock based on Redis safe (on) mp.weixin.qq.com/s?__biz=MzA…

Is distributed lock based on Redis safe in the end (next) mp.weixin.qq.com/s?__biz=MzA…

conclusion

Redis can implement distributed locks to a certain extent, which is flawed and controversial.

Depending on what your lock is used for:

  • For efficiency, coordinate clients to avoid duplication of work. Even if the lock fails occasionally, it is just possible to do some operations more than once without other adverse consequences. Like sending the same email over and over again.
  • For correctness. Lock invalidation should never happen under any circumstances, as this could mean data inconsistency, data loss, file corruption, or other serious problems.

Options:

  • If distributed locking is used for efficiency, allowing occasional lock failures, then the locking scheme using a single Redis node is sufficient, simple and efficient. Redlock is a double feature.
  • If distributed locking is for correctness, then don’t use Redlock. It is not a strong enough algorithm based on the asynchronous model, and its assumptions about the system model contain many dangerous elements (for timing). Moreover, it does not have a mechanism to offer fencing tokens. What technology should be used? Martin believes that a solution like Zookeeper, or a database that supports transactions, should be considered.

Redis is known for its high performance, but there are some difficulties in using it to implement distributed locks to address concurrency.

Redis distributed locks can only be used as a means to alleviate concurrency. If you want to solve the concurrency problem completely, you still need to use other anti-concurrency methods such as databases.

Xiaomi – info. Making. IO / 2019/12/17 /…

Redis. Cn/switchable viewer/dist…

Distributed lock based on Redis juejin.cn/post/684490…

The Redis implementing distributed lock position www.cnblogs.com/zhili/p/red…

Redisson Redis distributed lock N kind of posture mp.weixin.qq.com/s/8uhYult2h…

Redlock: Redis distributed lock the most awesome implementation mp.weixin.qq.com/s?__biz=MzU…

Martin.kleppmann.com/2016/02/08/…

antirez.com/news/101