Abstract:In a single-process system, when there are multiple threads that can change a variable at the same time, it is necessary to synchronize the variable or block of code so that the variable can be modified linearly to eliminate concurrent changes. Synchronization is essentially achieved by locking.

This article is shared from Huawei Cloud Community “Can’t Use Distributed Lock? Start from Zero Based on ETCD Distributed Lock”, the original author: AOHO.

Why do we need distributed locks?

In a single-process system, when more than one thread can change a variable at the same time, it is necessary to synchronize the variable or block of code so that it can be modified linearly to eliminate concurrent changes. Synchronization is essentially done through locks. In order to realize the multiple threads in the same block of code at a time can have only one thread can perform, so need somewhere to make a mark, the mark must be each thread can be seen, when there is no tag can set the tag, the rest of the subsequent threads found has a tag is waiting for the end of the thread that owns tag synchronized code block cancel to try to set up again after marking.

However, in the distributed environment, data consistency is always a difficult problem. Distributed environments are more complex than single-process environments. The biggest difference between distributed and stand-alone environments is that they are not multi-threaded but multi-process. Since multiple threads can share heap memory, they can simply take memory as the marker storage location. Processes may not even be on the same physical machine, so you need to store the tags in a place where all processes can see them.

A common scenario is the seckill scenario where the order service deploys multiple service instances. If there are four seckill products, the first user buys three and the second user buys two, ideally the first user can buy successfully, the second user prompts the purchase failed, and vice versa. However, the actual possible situation is that both users get inventory of 4, and the first user buys 3. Before updating the inventory, the second user places orders for 2 products and updates the inventory of 2, resulting in a business logic error.

In the above scenario, the inventory of goods is a shared variable, and in the case of high concurrency, access to resources needs to be mutually exclusive. In stand-alone environment, such as Java language, there are many concurrent processing related APIs, but these APIs are useless in distributed scenario, because distributed system has the characteristics of multi-thread and multi-process, and distributed in different machines. The synchronized and lock keywords will lose the effect of the original lock. Relying solely on the APIs provided by these languages themselves does not enable distributed locking, so we need to think of other ways to implement distributed locking.

Common lock schemes are as follows:

  • Implement distributed locking based on database
  • Distributed locking is implemented based on ZooKeeper
  • Distributed locking based on cache, such as Redis, ETCD, etc

Let’s briefly introduce the implementation of these locks, and focus on the ETCD implementation of locks.

Database-based locking

There are two ways to realize the database based lock, one is based on the database table, the other is based on the database exclusive lock.

Based on database tables add and delete

The simplest way is to create a lock table that contains the following fields: method name, timestamp, etc.

This is done by inserting an associated record into the table when a method needs to be locked. It is important to note that method names have uniqueness constraints. If multiple requests are submitted to the database at the same time, and the database guarantees that only one operation will succeed, we can assume that the thread that successfully performed the operation holds the lock on the method and can execute the body of the method. Upon completion, the record needs to be deleted.

For the above scheme can be optimized, such as the application of master – slave database, two-way synchronization between data. Once the master fails, quickly switch the application service to the slave. In addition, it can also record the host information and thread information of the machine currently obtaining the lock, so the next time to obtain the lock, first query the database, if the host information and thread information of the current machine can be found in the database, directly assign the lock to the thread, to achieve reentrant lock.

Database-based exclusive locking

We can also implement distributed locking through exclusive locking of the database. The MySQL based InnoDB engine can use the following methods to implement the lock operation:

public void lock(){ connection.setAutoCommit(false) int count = 0; while(count < 4){ try{ select * from lock where lock_name=xxx for update; If (the result is not empty){// Returns the lock; }}catch(Exception E){} // The lock was not obtained until sleep(1000); count++; } throw new LockException(); }

If you add a FOR UPDATE to the end of the query, the database will add exclusive locks to the database table during the query. After an exclusive lock has been placed on a record, no other thread can add an exclusive lock to that record. Others that do not acquire the lock will block on the SELECT statement, with one of two possible outcomes: the lock was acquired before the timeout, or the lock was not acquired before the timeout.

The thread that acquires the exclusive lock acquires the distributed lock. After acquiring the lock, the thread can execute the business logic and release the lock after executing the business.

Summary based on database locks

Both of the above methods rely on a table in the database. One is to determine whether there is a lock by the existence of records in the table, and the other is to implement a distributed lock by the database exclusive lock. Advantage is the direct use of existing relational database, simple and easy to understand; The downsides are the overhead associated with operating the database, performance issues, and exceptions to SQL execution timeouts that need to be considered.

Based on a Zookeeper

ZooKeeper’s temporary node and sequential features enable distributed locking.

When a lock is requested on a method, a unique temporary ordered node is generated on ZooKeeper in the directory of the specified node corresponding to the method. When you need to acquire a lock, you only need to determine whether the node in the ordered node is the one with the smallest ordinal number. When the business logic completes, the lock is released and the temporary node is simply deleted. This approach can also avoid deadlock problems that occur when a lock cannot be released due to a service outage.

Netflix has opened up the ZooKeeper Client Framework Toth, which you can see for yourself. Interprocess Mutex, provided by Toth, is an implementation of distributed locking. The acquire method acquires the lock and the release method releases the lock. In addition, lock release, blocking locks, reentrant locks and other issues can be effectively resolved.

For the implementation of blocking locks, the client can create a sequential node in ZooKeeper and bind a listener Watch to the node. Once a node changes, ZooKeeper notifies the client. The client can check whether the node it created is the smallest of all the current nodes. If so, the client can obtain the lock and execute the business logic.

ZooKeeper’s implementation of distributed locking also has some drawbacks. Distributed locking may not perform as well as cache-based implementations. Because every time in the process of creating and releasing the lock, the instantaneous node must be dynamically created and destroyed to realize the lock function.

In addition, creating and deleting a node in ZooKeeper can only be done by the Leader node and then synchronizing the data to the other nodes in the cluster. In a distributed environment, it is inevitable that there will be network jittering, which will cause the session connection between the client and the ZooKeeper cluster to be interrupted. In this case, the ZooKeeper server will think that the client is suspended and delete the temporary node. The distributed lock can then be acquired by other clients, leading to the issue of inconsistent lock acquisition simultaneously.

Distributed locking is implemented based on cache

Compared with the database based distributed locking scheme, the cache based implementation will perform a little better in terms of performance and access speed is much faster. And many caches can be clustered to solve a single point of problem. There are several types of cache-based locks, such as memcached, Redis, and the rest of this article focuses on implementing distributed locks based on ETCD.

Distributed locking is implemented through ETCD TXN

Implementing distributed locking through ETCD also needs to meet the requirements of consistency, mutual exclusion and reliability. The transaction TXN in ETCD, the LEASE lease, and the watch listening feature enable ETCD to implement the above required distributed locking.

Thought analysis

The transactional features of ETCD help us achieve consistency and mutual exclusion. ETCD transaction feature, IF-Then-Else statement, IF language to determine whether the ETCD server has a specified key, that is, the creation of the key version number of create_revision is 0 to check whether the key already exists, Because the key already exists, its create_revision version number is not 0. If the IF condition is satisfied, then is used to perform the PUT operation, otherwise the else statement returns the result that the lock grab failed. Of course, in addition to determining IF a key was created successfully, you can also create keys with the same prefix and compare revision of these keys to determine which request the distributed lock should belong to.

The client requests that after obtaining the distributed lock, if an exception occurs, it needs to release the lock in time. So we need a lease, and we need to specify the lease time when we apply for the distributed lock. When the lease expires, the lock will be automatically released to ensure the availability of the business. Is that enough? When executing the business logic, if the client initiates a time-consuming operation, if the operation is not completed, the lease time will expire, resulting in other requests to obtain the distributed lock, resulting in inconsistency. In this case, you need to renew the lease, that is, refresh the lease, so that the client can keep heartbeat with the ETCD server.

The specific implementation

Based on the idea of the above analysis, we draw the flowchart of implementing ETCD distributed lock, as shown below:

ETCD distributed lock based on GO language implementation, the test code is as follows:

Func testLock (t * testing.t) {// client config = clientv3.config {Endpoints: []string{"localhost:2379"}, DialTimeout: If client, err = clientv3.new (config); if client, err = clientv3.new (config); err ! = nil { fmt.Println(err) return } // 1. Lease = clientv3.newLease (client) if LeaseGrantResp, err = Lease.grant (context.todo (), 5); err ! = nil {Panic (err)} Leaseid = LeaseGrantResp.id // 2 Auto Renewal // Create a cancellable lease, CancelFunc = Context.withCancel (context.todo ()) // 3. CancelFunc = Context.todo () // 3. Defer cancelFunc() defer lease.revoke (context.todo (), leaseId) if keepRespChan, err = lease.keepalive (CTX,) leaseId); err ! = nil {Panic (err)} // To go func() {for {select {case keepResp = < -keeprespchan: If keepRespChan == nil {FMT.Println(" lease has expired ") goto END} else {FMT.Println(" automatic renewal reply :"); keepResp.ID) } } } END: TXN = kv.txn (context.todo ());}() // If there is no key, if there is no key; Then sets it, If(ClientV3.com pare(ClientV3.CreateVision ("lock"), "=", 0). Then(ClientV3.Opput ("lock", "g"), Clientv3.withLease (leaseID)).else (Clientv3.opGet ("lock")) // commit transaction if txnResp, err = txn.Com MIT (); err ! = nil { panic(err) } if ! Succeeded; txnresp.Succeeded {FMT.Println(" The lock is held :", String (TXNRESP.RESPONSES [0].getResponseRange ().KVS [0].value)) return} // Println(" Processing task ") time.sleep (5 * time.second)}

The expected execution results are as follows:

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * -- PASS: TestLock (5.10s) PASS

In summary, the implementation process of ETCD distributed lock as mentioned above is divided into four steps:

  • The client initializes and establishes the connection;
  • Create a lease, automatically renew the lease;
  • Create transaction, obtain lock;
  • Execute the business logic and finally release the lock.

When you create a lease, you need to create a cancellable lease, mainly in order to be able to release it when you exit. The steps to release the lock are found in the DEFER statement above. When the DEFER lease is turned off, the key corresponding to the distributed lock is released.

summary

This paper mainly introduces the case of distributed locking based on ETCD. First this paper introduces the background and necessity of distributed lock, distributed architecture is different from the monomer architecture, involves many services between multiple instances of the call, across processes under the condition of the programming language used to own concurrency primitives have no way to realize the consistency of the data, so the distributed lock, used to solve the operation in a distributed environment resources. Then two ways of implementing distributed locking based on database are introduced: adding and deleting data tables and exclusive locking of database. ZooKeeper’s temporary node and sequential features also enable distributed locking, both of which have more or less performance and stability drawbacks.

Then this paper focuses on the implementation of distributed lock based on ETCD, according to the characteristics of ETCD, using transaction TXN, lease and watch monitoring to achieve distributed lock.

In our case above, the lock grab fails and the client simply returns. So when the lock is released, or the client holding the lock fails to exit, how can other locks acquire the lock quickly? Therefore, the above code can be improved based on the monitoring feature of Watch. You can try it by yourself.

Click on the attention, the first time to understand Huawei cloud fresh technology ~