1. Business background

Many user-oriented Internet businesses maintain a user data at the back end of the system, and the fast Application center business does the same thing. The Fast Application Center allows users to save fast applications to favorites, records the favorite list of users on the server, and associates the favorite fast application package name with user account ID OpenID.

In order to get through the favorites list of the fast application center and the favorites status of the fast application Menubar, we also recorded the binding relationship between the user account id OpenID and the local identifier local_IDENTIFIER of the client. Since Manubar is owned by the fast application engine and independent of the fast application center, the user account id cannot be obtained through the account system, only the local identifier of the client local_identifier can be obtained, so we can only keep the state synchronization through the mapping relationship between the two.

In concrete implementation, we trigger a synchronization operation when the user starts the fast application center, and the client submits the OpenID and local id of the client to the server for binding. If OpenID does not exist, insert it into the database. Otherwise, update the local_IDENTIFIER field in the corresponding data row (because the user may have logged in to the same VIVO account on two different phones). In the subsequent business process, we can query the corresponding LOCAL_identifier according to OpenID, or vice versa.

However, after the code went live for a while, we noticed that there were many duplicate OpenID records in the T_Account data table. Based on the binding logic described above, this should not theoretically happen. Fortunately, the duplicate data does not affect the update and query scenarios because we have added LIMIT 1 to the query SQL, so the update and query operations for an OpenID actually only apply to the record with the smallest ID.

Second, problem analysis and positioning

Although redundant data has no impact on the actual business, such obvious data problems are certainly intolerable. So we set out to identify the problem.

The first thing to think about is the data itself. A rough look at the T_Account table data shows that about 3% of Openids are duplicates. That is, the case of repeated inserts is occasional, and most requests are handled correctly as expected. We re-read the code and confirmed that there were no obvious logic errors in the implementation.

Let’s take a closer look at the data. We selected several Openids with repeated cases and queried the relevant data records. It was found that the repeated times of these Openids were not the same, some were repeated only once, and some were more. However, we found a more valuable piece of information — the rows with the same OpenID were created at exactly the same time, and the increment ID was continuous.

We guessed that the problem was due to concurrent requests! We simulated concurrent calls to the interface by the client, and repeated insertions did occur, further confirming this assumption. However, when the client logic is that each user synchronizes once at startup, why would the same OpenID concurrent requests occur?

In fact, the actual operation of code is not as ideal as we imagined. There are often some unstable factors in the process of computer operation, such as network environment and server load. These unstable factors may cause the client to fail to send the request. The “failure” here may not mean the real failure, but the whole request time is too long, exceeding the timeout set by the client, so the request is judged as a failure artificially, and the request is sent again through the retry mechanism. That could eventually lead to the same request has been submitted multiple times, and these requests may be a link in the middle is blocked (such as when the server load is too large, processing thread to handle the request, the request into the buffer queue), when the blocking relief after this a few requests may be in a very short period of time by concurrent processing.

This is a typical concurrency conflict problem, which can be simply abstracted as: how to avoid writing duplicate data in concurrent situations. In fact, there are many common business scenarios where this can happen, such as users not being allowed to register with the same user name.

In general, the most intuitive way to deal with this kind of problem is to first run a query and allow the insert only when the database does not have the current data.

Obviously, this process is fine from the perspective of a single request. However, when multiple requests are concurrent, both request A and request B first initiate A query, and both find that the result does not exist, so both perform data insertion, and eventually lead to concurrent conflict.

Third, explore feasible solutions

Now that the problem has been identified, it’s time to start looking for a solution. Faced with this situation, we usually have two choices: let the database handle it, or let the application handle it.

3.1 Database level processing — unique indexes

When using MySQL database and InnoDB storage engine, we can use unique indexes to ensure that the values of the same column are unique. Obviously, in the T_account table, we didn’t create a unique index for the open_id column to begin with. If we want to add a unique index at this point, we can use the following ALTER TABLE statement.

ALTER TABLE t_account ADD UNIQUE uk_open_id( open_id );
Copy the code

Once the open_ID column is uniquely indexed, when the concurrency occurs, either request A or request B will be inserted first, and the other will get A similar error. Therefore, only one openID = XXX record is guaranteed to exist in the T_ACCOUNT table.

Error Code: 1062. Duplicate entry 'xxx' for key 'uk_open_id'
Copy the code

3.2 Application level processing — distributed locking

Another approach is to rely not on the underlying database to provide us with a guarantee of uniqueness, but on the application’s own code logic to avoid concurrency collisions. Application layer assurance is actually a more universal solution, after all, we can not assume that all the data persistence components used by the system have the capability of data uniqueness detection.

So how do you do that? In short, it’s serialization. The reason we run into the problem of duplicating data is that the “detect data already exists” and “insert data” actions are separated. Because these two steps are not atomic, two different requests can pass the first step at the same time. If we can combine these two actions into a single atomic operation, we can avoid data collisions. This is where we need to make this block atomicity by locking it.

The most familiar locking mechanism in the Java language is the synchronized keyword.

public synchronized void submit(String openId, String localIdentifier){
    Account account = accountDao.find(openId);
    if (account == null) {
        // insert
    }
    else {
        // update}}Copy the code

But it’s not that simple. Remember, our program is not deployed on a single server, but on multiple nodes. That is, the concurrency here is not just between threads, but between processes. Therefore, we cannot solve this synchronization problem with a Java language-level locking mechanism, instead we need distributed locking here.

3.3 Tradeoffs between the two solutions

Based on the above analysis, it seems that both schemes are feasible, but in the end we choose the distributed locking scheme. Why don’t we use the first option when it simply requires an index?

Because the existing online data already has duplicate data in the Open_ID column, adding the unique index directly at this time cannot be successful. In order to add a unique index, we must first clean up the existing duplicate data. But the problem is that with online programs running all the time, duplicate data can be generated in a constant stream. Can we find a time period when the user request is inactive to clean up and complete the unique index before the new duplicate data is inserted? The answer, of course, is yes, but this solution requires the coordination of operation, DBA, and development, and due to the business characteristics, the most appropriate processing time is the early morning, such as the dead of night. Even with such drastic fixes, there is no 100 percent guarantee that there will be no new duplicate inserts between data cleansing and index creation. Therefore, a fix based on a unique index looks good at first glance, but it’s still a bit cumbersome.

In fact, the most appropriate time to create a unique index is during the initial design phase of the system to avoid duplicate data problems. However, the die is cast, and in the current situation, we chose the more operational distributed locking solution. Because with this option, we can first go online and add new code for distributed lock repair to block new duplicate data inserts, and then clean up the original duplicate data, so we only need to change the code and go online once. Of course, once the problem is resolved, we can reconsider adding unique indexes to the tables.

So next, let’s look at how the distributed locking based solution can be implemented. Let’s start with a review of distributed locking.

Overview of distributed locking

4.1 What features do Distributed Locks need?

  • In a distributed system, only one thread on one machine can acquire a lock at a time.

  • Highly available lock acquisition and release;

  • High performance lock acquisition and release;

  • Reentrant characteristics;

  • With lock failure mechanism to prevent deadlock;

  • Blocking/non-blocking lock feature.

4.2 What are the implementation methods of distributed lock?

There are three types of distributed lock implementations:

  • Implement distributed lock based on database;

  • Distributed lock based on Zookeeper;

  • Distributed lock based on Redis;

4.2.1 Implementation method based on database

Database-based implementation is to directly create a lock table, through the operation of table data to achieve lock, unlock. MySQL > create a table with a unique index on method_name;

CREATE TABLE `myLock` (
 `id` int(11) NOT NULL AUTO_INCREMENT COMMENT 'primary key',
 `method_name` varchar(100) NOT NULL DEFAULT ' ' COMMENT 'Locked method name',
 `value` varchar(1024) NOT NULL DEFAULT 'Lock information'.PRIMARY KEY (`id`),
 UNIQUE KEY `uidx_method_name` (`method_name `) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Methods in locking';
Copy the code

We can then lock and unlock data by inserting and deleting data:

# lockinsert into myLock(method_name, value) values ('m1'.'1'); Delete from myLock where method_name ='m1';
Copy the code

The database-based approach is simple, but has some obvious problems:

  • There is no lock expiration time, and failure to unlock causes lock records to remain in the database forever, resulting in a deadlock.

  • The lock is non-reentrant because it does not know whether the requestor is the thread currently holding the lock.

  • The database is currently a single point, and if there is a shutdown, the locking mechanism will be completely broken.

4.2.2 Implementation method based on Zookeeper

ZooKeeper is an open source component that provides consistency services for distributed applications. It is a hierarchical file system directory tree structure, which specifies that the names of nodes in the same directory are unique.

ZooKeeper nodes (ZNodes) are classified into four types:

  • Persistent node (node exists after session disconnection)

  • Persist sequential nodes

  • Temporary node (node deleted after session disconnection)

  • Temporary sequential node

When a new Znode is created as a sequential node, ZooKeeper sets the Znode’s path by attaching a 10-bit serial number to the original name. For example, if a Znode with a path/myNode is created as a sequential node, ZooKeeper changes the path to/myNode0000000001 and sets the next sequence number to 0000000002, which is maintained by the parent node. If two sequential nodes are created at the same time, ZooKeeper does not use the same number for each Znode.

Based on the features of ZooKeeper, distributed locking can be implemented as follows:

  • Create a directory myLock;

  • Thread A wants to acquire the lock by creating A temporary sequential node in the myLock directory.

  • Obtain myLock directory of all child nodes, and then obtain smaller than their own brother node, if there is no, the current thread sequence number is the smallest, obtain the lock;

  • Thread B obtains all nodes, determines that it is not the smallest node, and sets it to listen on nodes smaller than its own;

  • Thread A finishes processing and deletes its own node. Thread B listens to the change event and determines whether it is the smallest node. If it is, it acquires the lock.

Because you create temporary nodes, locks can be released when the thread holding them unexpectedly goes down, thus avoiding the problem of deadlocks. In addition, we can also implement the blocking feature through the node queuing listening mechanism, and we can also implement the reentrant lock by carrying the thread identifier in the Znode. At the same time, the availability of distributed locks is guaranteed due to the high availability feature of the ZooKeeper cluster. However, due to the frequent creation and deletion of nodes, Zookeeper is not as effective as Redis.

4.2.3 Implementation method based on Redis

Redis is an open source key-value storage database, which is implemented in memory and has very high performance and is often used as a cache.

The core principle of redis-based distributed lock implementation is: try to set a specific key. If the set is successful (the key does not exist before), it is equivalent to acquire the lock. At the same time, set an expiration time for the key to avoid deadlock caused by thread exit before releasing the lock. After the thread completes the synchronization task, it releases the lock actively through the delete command.

One thing that needs special attention here is how to lock and set the expiration time. Some people do this using setnx + expire, but this is a problem. Suppose the thread acquired the lock after setnx but went down before expire, the lock could not be released. Of course, we can combine the two commands in a single Lua script to achieve atomic commit for both commands.

In fact, we can simply use the set command to directly implement setnx and set the expiration time in a command to complete the lock operation:

SET key value [EX seconds] [PX milliseconds] NX
Copy the code

The unlock operation only requires:

DEL key
Copy the code

Solution based on Redis distributed lock

In this case, we use the redis-based distributed locking method.

5.1 Java implementation of distributed locking

Online Redis because Jedis project adopted framework, and deployed as a cluster pattern, so we based on Redis. Clients. Jedis. JedisCluster encapsulates a RedisLock class, provided with lock and unlock interface.

public class RedisLock {
 
    private static final String LOCK_SUCCESS = "OK";
    private static final String LOCK_VALUE = "lock";
    private static final int EXPIRE_SECONDS = 3;
 
    @Autowired
    protected JedisCluster jedisCluster;
 
    public boolean lock(String openId) {
        String redisKey = this.formatRedisKey(openId);
        String ok = jedisCluster.set(redisKey, LOCK_VALUE, "NX"."EX", EXPIRE_SECONDS);
        return LOCK_SUCCESS.equals(ok);
    }
 
    public void unlock(String openId) {
        String redisKey = this.formatRedisKey(openId);
        jedisCluster.del(redisKey);
    }
 
    private String formatRedisKey(String openId){
        return "keyPrefix:"+ openId; }}Copy the code

In the specific implementation, we set the expiration time of 3 seconds, because the locked task is simple database query and insert, and the server and database are deployed in the same room, under normal circumstances, 3 seconds is enough to fully meet the code execution.

In fact, the above implementation is a rudimentary version of the Redis distributed lock. We did not consider thread reentrant or lock release by other processes in the implementation, but it was sufficient for this business scenario. Assuming that it is extended to more general business scenarios, we can consider adding the specific identity of the current process to value, and do corresponding matching detection in the lock and lock release stages, so as to obtain a more secure and reliable Redis distributed lock implementation.

Of course, frameworks such as Redission also provide a fairly complete encapsulated implementation of Redis distributed locking, and in some demanding business scenarios, I recommend using such frameworks directly. Since this article focuses on the introduction of troubleshooting and problem solving ideas, there is no more introduction to the specific implementation principle of Redisson distribution. Interested partners can find a wealth of information on the Internet.

5.2 Improved code logic

Now we can use the wrapped ReddisLock to improve the original code.

public class AccountService {
 
    @Autowired
    private RedisLock redisLock;
 
    public void submit(String openId, String localIdentifier) {
        if(! redisLock.lock(openId)) {// If the same openId is concurrent and the thread does not grab the lock, the request is discarded directly
            return;
        }
 
        // The lock is acquired and the user data synchronization logic is executed
        try {
            Account account = accountDao.find(openId);
            if (account == null) {
                // insert
            } else {
                // update}}finally {
            / / releases the lockredisLock.unlock(openId); }}}Copy the code

5.3 Data Clearing

A quick final word on finishing touches. Due to the large volume of duplicate data, it is not possible to process it slowly by hand. Therefore, we wrote a scheduled task class to clean up 1000 duplicate Openids every minute to avoid the impact of a large number of query and delete operations on the database performance in a short time. When it is confirmed that the duplicate data has been completely cleared, the scheduled task is stopped and this code is removed in the next version iteration.

Six, summarized

In the daily development process will inevitably be a variety of problems, we have to learn to step by step analysis, find the root cause of the problem; Then try to find feasible solutions within their cognitive scope, and carefully weigh the advantages and disadvantages of various solutions, in order to finally solve the problem efficiently.