E-commerce accident: oversold 100 bottles of Feitian Moutai, due to the improper use of Redis distributed lock

Using distributed locks based on Redis is nothing new these days.

This article is mainly based on the accident analysis and solution caused by Redis distributed lock in our actual project. The purchase order in our project is solved by distributed lock.

Once, the operation did a flying Maotai shopping activity, 100 bottles of stock, but 100 bottles were oversold! Want to know, the scarcity of this flying Maotai on earth!!

The accident was classified as a P0 major accident… You just have to accept it. After the accident happened, CTO named me and asked me to take the lead to deal with it.

All right, go

The scene of the accident

After some understanding, it is learned that this snap up activity interface has never appeared this situation before, but why is it oversold this time?

The reason lies in that the commodities before are not scarce commodities, but this activity is actually Flying Maotai, through the analysis of buried data, all the data are basically doubled, the intensity of the activity can be imagined! Without further ado, directly on the core code, the secret part of the pseudo-code processing…

public SeckillActivityRequestVO seckillHandle(SeckillActivityRequestVO request) {
SeckillActivityRequestVO response;
    String key = "key:" + request.getSeckillId;
    try {
        Boolean lockFlag = redisTemplate.opsForValue().setIfAbsent(key, "val".10, TimeUnit.SECONDS);
        if (lockFlag) {
            // HTTP requests the user service for user-specific validation
            // Verify user activity
            
            // Check the inventory
            Object stock = redisTemplate.opsForHash().get(key+":info"."stock"); assert stock ! = null;if (Integer.parseInt(stock.toString()) <= 0) {
                // The service is abnormal
            } else {
                redisTemplate.opsForHash().increment(key+":info"."stock".- 1);
                // Generate the order
                // Publish the order creation success event
                // Build response VO
            }
        }
    } finally {
        / / releases the lock
        stringRedisTemplate.delete("key");
        // Build response VO
    }
    return response;
}
Copy the code

In the above code, the distributed lock expiration time is 10s to ensure that the business logic has enough execution time. Use try-finally statements to ensure that the lock is released ina timely manner. The inventory is also verified internally by the business code. It looks safe. Don’t worry, keep analyzing…

The cause of the accident

The activity attracted a large number of new users to download and register our APP, among which, there are many wool party, using professional means to register new users to collect wool and brush the sheet.

Of course, our user system has been prepared in advance, access to Ali cloud man-machine verification, three-factor authentication and risk control system developed by the 18 kinds of martial arts, blocking a large number of illegal users. I can’t help but give a thumbs up ~ but for this reason, the user service is always under a high operating load.

As soon as the buying spree began, a flood of customer verification requests hit the customer service. Lead to the user service gateway a brief response delay, some requests the response time of more than 10 s, but as a result of the response of the HTTP request timeout we set is 30 s, this leads to the interface has been blocked in the user check there, after 10 s, distributed lock has failed, at this time there are new requests come in can get a lock, which means the lock is covered.

These blocked interfaces execute lock release logic, which releases locks from other threads, causing new requests to compete for locks. This is a vicious cycle. In this case, we can only rely on the inventory check, but the inventory check is not non-atomic, using the get and compare method, so the tragedy of oversold happened ~~~

Accident analysis

After careful analysis, it can be found that the panic buying interface has serious security risks in high concurrency scenarios, mainly concentrated in the following three areas:

No other system risk tolerant handling

Due to user service strain, gateway response is delayed, but there is no way to deal with it, which is the trigger for oversold.

Distributed locks that look secure are not secure at all

Although the set key value [EX seconds] [PX milliseconds] [NX | XX], but if the thread A execution for A long time before release, the lock is expired, thread B is available at this time of the lock.

When thread A completes, it releases the lock, essentially releasing the lock from thread B. In this case, thread C can acquire the lock, and if thread B completes the lock release, it is actually the lock set by thread C. This is the immediate cause of oversold.

Nonatomic inventory check

Non-atomic inventory verification results in inaccurate inventory verification results in concurrent scenarios. This is the root cause of oversold.

Through the above analysis, the root cause of the problem is that the inventory check relies heavily on distributed locks. Because in the case of distributed lock normal SET, DEL, inventory check is no problem. However, inventory checking is useless when distributed locks are not secure and reliable.

The solution

Now that we know why, we can fix it.

Implement relatively secure distributed locks

Relatively safe definition: set and del are mapped one by one, no existing lock del will occur. From a practical point of view, service security cannot be guaranteed even if set and DEL can be mapped one by one.

Because the lock expiration time is always bounded, unless it is not set or set to a very long time, this can also cause other problems. So it doesn’t make sense.

To implement a relatively secure distributed lock, you must rely on the value of the key. When releasing the lock, the uniqueness of value is used to ensure that the lock will not be deleted. We implement atomic get and compare based on LUA script, as follows:

public void safedUnLock(String key, String val) {
    String luaScript = "local in = ARGV[1] local curr=redis.call('get', KEYS[1]) if in==curr then redis.call('del', KEYS[1]) end return 'OK'""; RedisScript
      
        redisScript = RedisScript.of(luaScript); redisTemplate.execute(redisScript, Collections.singletonList(key), Collections.singleton(val)); }
      Copy the code

We use LUA scripts to secure the unlock.

Implement safe inventory check

If we take a closer look at concurrency, we’ll see that operations like get and compare/ read and save are non-atomic. If we want to implement atomicity, we can also implement it with LUA scripts.

But in our case, since the buying campaign can only place one bottle at a time, it can be implemented based not on LUA script but on the atomicity of Redis itself. Here’s why:

// Redis returns the result of the operation, which is atomic
Long currStock = redisTemplate.opsForHash().increment("key"."stock".- 1);
Copy the code

No, the inventory check in the code is “gilding the lily”.

Improved code

After the above analysis, we decided to create a new DistributedLocker class specifically to handle distributed locks.

public SeckillActivityRequestVO seckillHandle(SeckillActivityRequestVO request) {
SeckillActivityRequestVO response;
    String key = "key:" + request.getSeckillId();
    String val = UUID.randomUUID().toString();
    try {
        Boolean lockFlag = distributedLocker.lock(key, val, 10, TimeUnit.SECONDS);
        if(! lockFlag) {// The service is abnormal
        }

        // Verify user activity
        // Check the inventory based on the atomicity of Redis itself
        Long currStock = stringRedisTemplate.opsForHash().increment(key + ":info"."stock".- 1);
        if (currStock < 0) { // The inventory has been reduced.
            // The service is abnormal.
            log.error("[Snap order] No stock");
        } else {
            // Generate the order
            // Publish the order creation success event
            // Build the response
        }
    } finally {
        distributedLocker.safedUnLock(key, val);
        // Build the response
    }
    return response;
}
Copy the code

Deep thinking

Distributed locks are necessary

After improvement, it can be found that we can also ensure that the inventory will not be oversold by virtue of redis’ atomic deduction. That’s right.

But without this layer of locking, all requests come in through the business logic, which increases the pressure on other systems because they depend on them. This increases performance loss and service instability, which is more than worth the cost. Some traffic can be intercepted to some extent based on distributed locks.

Selection of distributed lock

RedLock has been proposed to implement distributed locking. RedLock is more reliable, but at the expense of performance.

In this scenario, this increase in reliability is not nearly as cost-effective as the increase in performance. For scenarios with high reliability requirements, RedLock can be used.

Think again about distributed locks

The bug needed an urgent fix to go live, so we optimized it and rolled it out immediately after pressing it in our test environment. This optimization proved to be successful, with a slight performance improvement and no oversold in the case of distributed lock failure. But is there room for improvement? Some!

Since the service is clustered, we can spread the inventory evenly across each server in the cluster and broadcast notifications to each server in the cluster. The gateway layer performs a hash algorithm based on the user ID to determine which server to send the request to. This allows for inventory deductions and judgments based on the application cache. Performance has been further improved!

// The message is pre-initialized to achieve efficient thread safety with ConcurrentHashMap
private static ConcurrentHashMap<Long, Boolean> SECKILL_FLAG_MAP = new ConcurrentHashMap<>();
// Pre-set by message. Since AtomicInteger is inherently atomic, HashMap can be used directly here
private static Map<Long, AtomicInteger> SECKILL_STOCK_MAP = newHashMap<>(); . public SeckillActivityRequestVO seckillHandle(SeckillActivityRequestVO request) { SeckillActivityRequestVO response; Long seckillId = request.getSeckillId();if(! SECKILL_FLAG_MAP.get(requestseckillId)) {// The service is abnormal
    }
     // Verify user activity
     // Check the inventory
    if(SECKILL_STOCK_MAP.get(seckillId).decrementAndGet() < 0) {
        SECKILL_FLAG_MAP.put(seckillId, false);
        // The service is abnormal
    }
    // Generate the order
    // Publish the order creation success event
    // Build the response
    return response;
}
Copy the code

By doing this, we don’t need to rely on Redis at all. Both performance and security can be further improved! Of course, this solution does not take into account complex scenarios such as dynamic capacity expansion and shrinkage of the machine, and if these are to be considered, it is better to directly consider a distributed lock solution.

conclusion

Oversold of scarce goods is a major accident. If the number of oversold is large, it will even bring a very serious business impact and social impact to the platform.

After this accident, LET me realize that no line of code in the project should be taken lightly, otherwise in some scenarios, these normal working code will become a deadly killer!

As a developer, when designing a development plan, it is important to consider the plan thoroughly. How to take the plan into consideration? Only keep on learning!

Source | urlify. Cn/MVBvmy