In the case of single node, thread safety is controlled by synchronization state. In distributed applications, distributed locks are needed to control the correct execution of programs without being affected by concurrency problems.

In a single node, synchronization needs to be controlled by changes in the state of a resource that can be accessed by concurrent threads. In distributed applications, concurrency problems are controlled using a key in Redis that is accessible to all nodes of the application.

Single-node Redis distributed lock

setnx

The setnx command will put the key into Redis if it does not exist, and will not set it if it does.

>setnx lock:distributed true
OK
...
other code
...
>del lock:distributed
Copy the code

The problem with this method is that when other Code is executed, the program fails to execute the DEL command, and the key is not released, resulting in a deadlock.

setnx then expire

To resolve deadlocks, it seems at first glance to use expire to set a timeout for the key.

>setnx lock:distributed true
OK
>expire lock:distributed 5
...
other code
...
>del lock:distributed
Copy the code

This is still problematic because setnx and EXPIRE are not atomic operations and exceptions can occur before the EXPIRE statement is executed. Deadlocks still occur.

set and expire

To solve the problem of non-atomic operations being interrupted, the atomic instruction of SETnx combined with expire was added to Redis 2.8.

>set lock:distributed true ex 5 nx
OK
...
other code
...
>del lock:distributed
Copy the code

This ensures atomicity of the lock and set valid time operation, but there are still problems.

Suppose the business code execution time between locking and releasing the lock exceeds the set valid time, at which point the lock will be released due to timeout. This can lead to two things:

  1. After node B obtains the lock, node A releases the lock on node B.
  2. Other nodes acquire locks, and concurrency problems can occur when critical section code is executed.

Fixed lock release by other threads

Because each node uses the same key during lock, there may be cases where the timeout node releases the lock of the current locked node. In this case, you can set a random value for the locked key. Before deleting the key, check whether the current value of the key is equal to the random value.

val = Random.nextInt();
if( redis.set(key,val,true,5) ){
	...
	other code
	...
	value = redis.get(key);
	if(val == value){ redis.delete(key); }}Copy the code

The above code implements the logic of deleting based on random values, but fetching a value until the delete instruction is not an atomic instruction can still have concurrency problems. Lua scripts are used because lua scripts ensure that multiple instruction atoms are executed consecutively.

if redis.call("get",KEYS[1]) == ARGV[1] then
    return redis.call("del",KEYS[1])
else
    return 0
end
Copy the code

This avoids the problem of locks being released by other threads.

Critical section concurrency problem

The nature of concurrency problems in critical section code is that the execution time of business code is longer than the lock expiration time.

We can refresh the lock time periodically to ensure that the business code is executed within the lock expiration time.

private volatile boolean isFlushExpiration = true;

while(redis.set(lock, val, NOT_EXIST, SECONDS, 20)){
    Thread thread = new Thread(new FlushExpirationTherad());
	thread.setDeamon(true); thread.start(); . other code ... } isFlushExpiration =false;
String deleteScript = "if redis.call("get",KEYS[1]) == ARGV[1] then" 
    + "return redis.call("del",KEYS[1])"
    + "else return 0 end";
redis.eval(deleteScript,1,key,val);
    

private class FlushExpirationTherad implements Runnable{
    @Override
	public void run(a){
        while(isFlushExpiration){
            String checkAndExpireScript = "if redis.call('get', KEYS[1]) == ARGV[1] then " +
                        "return redis.call('expire',KEYS[1],ARGV[2]) " +
                        "else return 0 end";
            redis.eval(checkAndExpireScript,1,key,val,"20");
            // Check for completion every 10 seconds
            Thread.sleep(10); }}}Copy the code

This implementation uses a thread to periodically monitor whether the client has completed execution. The heartbeat detection mechanism can also be implemented on the server (Zookeeper) to ensure service completion.

Therefore, there are three key issues to realize single-node Redis distributed lock:

  1. Get lock and set timeout as atomic operation (Redis2.8Started supported)
  2. Set a random string to ensure that locks are released only on their own (to the corresponding lock)keySet random values)
  3. Determine and release locks must be implemented as atomic operations (luaScript implementation)

Multi-node Redis distributed lock

To ensure high availability, projects are typically configured with Redis clusters in case all clients are unable to acquire locks after a single-node Redis failure.

In a cluster environment, Redis has the failover mechanism. When the Master node goes down, asynchronous Master/slave replication begins, which can occur in the following situations:

  1. Client A gets itMasterLocks on nodes.
  2. MasterNode down, storage lockkeyNot synchronized yetSlaveOn.
  3. SlaveUpgrade the node toMasterNode.
  4. Client B from the newMasterThe lock for the same resource was obtained on the node.

In this case, the security of the lock will be broken. Antirez, the author of Redis, designed the Redlock algorithm for this problem.

Redlock algorithm

When the Redlock algorithm obtains the lock, the client performs the following steps:

  1. Get the current time (start).
  2. N in orderRedisThe node requests a lock. Request locks in a manner associated with a single nodeRedisThe lock is obtained in the same way. In order to make sure that at some pointRedisWhen the node is unavailable, the algorithm can continue to run. The timeout period for obtaining the lock must be set. Ensure that the timeout period is much smaller than the valid period of the lock. This ensures that the client is addressing aRedisAfter a node fails to acquire the lock, the next node can be tried immediately.
  3. Calculate how long it takes to acquire the lock (consumeTime = end-start). If the client is from mostRedisIf the node (>= N/2 + 1) successfully obtains the lock and the lock duration does not exceed the lock validity period, the client considers that the lock is successfully obtained. Otherwise, the lock fails to be obtained.
  4. If the lock is eventually acquired successfully, the lock duration should be reset to the original lock duration minusconsumeTime.
  5. If the lock ultimately fails to be acquired, the client should immediately request allRedisThe node initiates a lock release request.

To release a lock, all Redis nodes need to release the lock, regardless of whether the node successfully acquired the lock. Because the client may successfully obtain the lock from the Redis node, but the communication fails when the node notifies the client, the client will consider that the node fails to lock.

Redlock algorithm achieves higher availability and does not have failover failure problem. However, if a node crashes and restarts, the security of the lock is still affected. Suppose there are five Redis nodes A, B, C, D and E:

  1. Client A obtains the locks of nodes A, B, and C, but fails to obtain the locks of nodes D and E.
  2. Node C crashes and restarts, but the lock added by client A on node C is not persisted and is lost after the restart
  3. After node C is restarted, client B locks C, D, and E, and the lock succeeds.

In this case, both clients A and B have acquired the lock to access the same resource.

The lock loss on node C in step 2 here can be caused by a number of reasons. By default, Redis’s AOF persistence is write to disk once per second (fsync), in which case 1 second of data may be lost. We could also set fsync to fire for every operation, which would affect performance, but even this setting could cause the operation to fail to write due to operating system problems.

In order to solve the lock failure problem caused by node restart, Antirez proposed the concept of delayed restart. That is, when a node crashes, it does not restart immediately, but restarts after the expiration of all keys related to distributed locks. In this way, the existing locks will not be affected after the node is restarted.

Some of the episode

There was a heated debate about Redlock’s security between distributed systems expert Martin Kleppmann and Redis author Antirez. For more on this debate, check out this article on whether Redis based distributed locking is safe. In the end, it is concluded that Redlock is reasonable in the application requiring efficiency, so the Java version of Redlock Redission can be used in Java projects to control multi-node access to shared resources. But there are still extreme cases where Redlock is insecure, and we should be aware of its security deficiencies and consequences. If you need to pursue correctness further, you can use Zookeeper distributed locks.

A link to the

  • Is Distributed Locking based on Redis secure