Some pits and optimization methods in distributed locking practice

Document performance and optimization issues that are not covered elsewhere on the web when using distributed locking. Do not copy your web code.

This paper only discusses the use of Redis as a distributed lock, not considering some other issues of Redis, such as whether it is a stand-alone node, how to master the slave cluster or sentinel, and what is the difference between Redis and distributed middleware implemented by Zookeeper and other distributed consensus algorithms. I choose Redis mainly for its own performance and the complexity of coding. The selection of CP or AP is not considered as a criterion.

When the service by a single deployment into a distributed cluster deployment, involved in the business of some database operations or other possible concurrency issues, are likely because thoughtless code level or loophole, lead to loss of data update, data inconsistency problem, I also encountered this problem in the work. For example, A request to query the database available in the number of 100, B request at this time also query the database available in the number of 100, A request for repairing in the database after 1 to 99, after the request is 1 B change the database available in the number of 99, A loss of data updating problem has taken place at this moment, caused A loss to the company, when the concurrency value is bigger, Request B has a certain delay, other requests may have been modified for more than 10 times, but request B changed the number of available times to 99, which will cause huge losses over a long period of time.

In the solution provided according to some technical documents, I chose Redis to realize distributed lock, and the general architecture is as follows

General principle of implementation:

  1. When the request comes in, a Str with an expiration time of 10s is stored in Redis using the Lua script using the spatial name key and request ID as value
  2. Release the lock after the request is processed. If the node fails, it can be automatically deleted by expiration time without “deadlock”.
  3. When multiple requests occur at the same time, the one competing for the lock will be executed first, and the one not competing for the lock will generate a random waiting time, and then the competition will be launched after the waiting, which can prevent the herd effect caused by a large number of requests competing at the same time

At the beginning of problems now

The above architecture can directly solve the problem of data loss and update, which means that all requests on all nodes will be locked during operation, read and read, read and write directly locked, such a big lock hanging in the air, blocking all requests, system performance linear decline, the greater the concurrency, the greater the competition.

When the Ramp is 1 second, the loop is 2 times, and 57720 wait locks are sent

When the Ramp is 1 second, the loop is 2 times, and 74180 wait locks are sent

CPU usage

The following analysis can be drawn from the tests

  1. Locking solves the problem of data loss
  2. Locking results in a sharp decrease in concurrency by up to 7/s
  3. Locking results in a large number of lock contention, with a single request competing nearly 100 times at the highest concurrency

Optimization of a

The current lock is obviously pessimistic lock, the request will be locked first, can be optimized to optimistic lock, first to determine whether there is a request conflict, if there is no competition, the direct operation, if there is a lock, similar to CAS mechanism, can reduce the cost of the first lock.

Optimization of two

Refine lock granularity, reference Innodb row lock design, this need according to specific business scenarios to consider, in our business, the same user requests can be according to an index or key ID, and then use this type of primary key ID named as lock space, can greatly refine locks, won’t make conflicts between different users’ request.

Optimization of three

Narrow the random waiting time interval. When the competitive lock fails, how long will it wait randomly before launching the competition? Too long waiting time is unnecessary, and too short waiting time will aggravate the failure of competition. My initial value was 50-200ms. Later, after a series of jmeter pressure tests, I confirmed that there were fewer competitors and shorter waiting time in the range of 10-50ms.

Optimization of four

From the perspective of SQL, can we combine such similar accounting operations, do not take out data in the code and then add or subtract, so that we can directly use the row lock of Innodb in Mysql to reduce the blocking of business layer to request.

There are a lot of upgrade optimization programs, but also can use third-party distributed lock framework, provide some support for reentrant lock, specific scene specific optimization, but the online solution can not be rote, otherwise in the code intrusion at the same time can not improve performance.