P0 major accident: overselling 100 bottles of Feitian Maotai, the whole project team was in a panic ~

preface

Using distributed locks based on Redis is nothing new these days. This article is mainly based on the accident analysis and solution caused by Redis distributed lock in our actual project.

Background: The purchase order in our project is solved by distributed lock. Once, the operation did a panic buying activity of Flying Maotai, 100 bottles of stock, but it was oversold! Want to know, the scarcity of this flying Maotai on earth!! The accident was classified as a P0 major accident… You just have to accept it. After the accident, CTO named me and asked me to lead the charge to deal with it. Ok, charge ~

The scene of the accident

After some understanding, it is learned that this snap up activity interface has never appeared this situation before, but why is it oversold this time? The reason lies in that the commodities before are not scarce commodities, but this activity is actually Flying Maotai, through the analysis of buried data, all the data are basically doubled, the intensity of the activity can be imagined! Without further ado, directly on the core code, the secret part of the pseudo-code processing…

public SeckillActivityRequestVO seckillHandle(SeckillActivityRequestVO request) { SeckillActivityRequestVO response; String key = "key:" + request.getSeckillId; try { Boolean lockFlag = redisTemplate.opsForValue().setIfAbsent(key, "val", 10, TimeUnit.SECONDS); Object stock = redistemplate.opsForHash ().get(key+":info", "stock"); assert stock ! = null; If (integer.parseInt (stock.toString()) <= 0) {else {redistemplate.opsForHash ().increment(key+":info", "stock", -1); / / / / generated order release order to create successful event / / building response finally VO}}} {/ / releases the lock stringRedisTemplate. Delete (" key "); // build response VO} return response; }Copy the code

In the above code, the distributed lock expiration time is 10s to ensure that the business logic has enough execution time. Use try-finally statements to ensure that the lock is released ina timely manner. The inventory is also verified internally by the business code. It looks safe. Don’t worry, keep analyzing…

The cause of the accident

The activity attracted a large number of new users to download and register our APP, among which, there are many wool party, using professional means to register new users to collect wool and brush the sheet. Of course, our user system has been prepared in advance, access to Ali cloud man-machine verification, three-factor authentication and risk control system developed by the 18 kinds of martial arts, blocking a large number of illegal users. I can’t help but give a thumbs up

But because of this, the user service is always under a high operating load. As soon as the buying spree began, a flood of customer verification requests hit the customer service. Lead to the user service gateway a brief response delay, some requests the response time of more than 10 s, but as a result of the response of the HTTP request timeout we set is 30 s, this leads to the interface has been blocked in the user check there, after 10 s, distributed lock has failed, at this time there are new requests come in can get a lock, which means the lock is covered. These blocked interfaces execute lock release logic, which releases the lock from other threads, causing new requests to compete for the lock

It’s a vicious cycle. In this case, we can only rely on the inventory check, but the inventory check is not non-atomic, using the get and compare method, so the tragedy of oversold happened ~~~

Accident analysis

After careful analysis, it can be found that the panic buying interface has serious security risks in high concurrency scenarios, mainly concentrated in the following three areas:

No other system risk fault tolerant handling due to user service strain, gateway response delay, but no way to deal with it, this is the trigger for oversold.
Seemingly secure distributed lock is not safe while actually adopted the set key value [EX seconds] [PX milliseconds] [NX | XX], but if the thread A performed before release time is longer, the lock is expired, thread B is available at this time of the lock. When thread A completes, it releases the lock, essentially releasing the lock from thread B. In this case, thread C can acquire the lock, and if thread B completes the lock release, it is actually the lock set by thread C. This is the immediate cause of oversold.
Non-atomic inventory check Non-atomic inventory check results in inaccurate inventory check results in concurrent scenarios. This is the root cause of oversold.

Through the above analysis, the root cause of the problem is that the inventory check relies heavily on distributed locks. Because in the case of distributed lock normal SET, DEL, inventory check is no problem. However, inventory checking is useless when distributed locks are not secure and reliable.

The solution

Now that we know why, we can fix it.

Implement relatively secure distributed locks

Relatively safe definition: set and del are mapped one by one, no existing lock del will occur. From a practical point of view, service security cannot be guaranteed even if set and DEL can be mapped one by one. Because the lock expiration time is always bounded, unless it is not set or set to a very long time, this can also cause other problems. So it doesn’t make sense. To implement a relatively secure distributed lock, you must rely on the value of the key. When releasing the lock, the uniqueness of value is used to ensure that the lock will not be deleted. We implement atomic get and compare based on LUA script, as follows:

public void safedUnLock(String key, String val) {
    String luaScript = "local in = ARGV[1] local curr=redis.call('get', KEYS[1]) if in==curr then redis.call('del', KEYS[1]) end return 'OK'"";
    RedisScript<String> redisScript = RedisScript.of(luaScript);
    redisTemplate.execute(redisScript, Collections.singletonList(key), Collections.singleton(val));
}
Copy the code

We use LUA scripts to secure the unlock.

Implement safe inventory check

If we take a closer look at concurrency, we’ll see that operations like get and compare/ read and save are non-atomic. If we want to implement atomicity, we can also implement it with LUA scripts. But in our case, since the buying campaign can only place one bottle at a time, it can be implemented based not on LUA script but on the atomicity of Redis itself. Here’s why:

Long currStock = redistemplate.opsForHash ().increment("key", "stock", -1);Copy the code

No, the inventory check in the code is “gilding the lily”.

Improved code

After the above analysis, we decided to create a new DistributedLocker class specifically to handle distributed locks.

public SeckillActivityRequestVO seckillHandle(SeckillActivityRequestVO request) { SeckillActivityRequestVO response; String key = "key:" + request.getSeckillId(); String val = UUID.randomUUID().toString(); try { Boolean lockFlag = distributedLocker.lock(key, val, 10, TimeUnit.SECONDS); if (! LockFlag) {// service exception} // user activity check // inventory check, Based on atomic redis itself to ensure Long currStock = stringRedisTemplate. OpsForHash (). The increment (key + ": the info", "stock", 1); If (currStock < 0) {if (currStock < 0) { // The service is abnormal. Log. error(" no stock "); } else {/ / / / generated order issued orders to create successful event / / building response}} finally {distributedLocker. SafedUnLock (key, val); } return response; }Copy the code

Deep thinking

Distributed locks are necessary

After improvement, it can be found that we can also ensure that the inventory will not be oversold by virtue of redis’ atomic deduction. That’s right. But without this layer of locking, all requests come in through the business logic, which increases the pressure on other systems because they depend on them. This increases performance loss and service instability, which is more than worth the cost. Some traffic can be intercepted to some extent based on distributed locks.

Selection of distributed lock

RedLock has been proposed to implement distributed locking. RedLock is more reliable, but at the expense of performance. In this scenario, this increase in reliability is not nearly as cost-effective as the increase in performance. For scenarios with high reliability requirements, RedLock can be used.

Think again about distributed locks

The bug needed an urgent fix to go live, so we optimized it and rolled it out immediately after pressing it in our test environment. This optimization proved to be successful, with a slight performance improvement and no oversold in the case of distributed lock failure. But is there room for improvement? Some! Since the service is clustered, we can spread the inventory evenly across each server in the cluster and broadcast notifications to each server in the cluster. The gateway layer performs a hash algorithm based on the user ID to determine which server to send the request to. This allows for inventory deductions and judgments based on the application cache. Performance has been further improved!

// Preinitialize by message, Private static ConcurrentHashMap<Long, Boolean> SECKILL_FLAG_MAP = new ConcurrentHashMap<>(); // Pre-set by message. Private static Map<Long, AtomicInteger> SECKILL_STOCK_MAP = new HashMap<>(); . public SeckillActivityRequestVO seckillHandle(SeckillActivityRequestVO request) { SeckillActivityRequestVO response; Long seckillId = request.getSeckillId(); if(! Seckill_flag_map. get(requestseckillId)) {// Service exception} // User activity check // inventory check if(SECKILL_STOCK_MAP.get(seckillId).decrementAndGet() < 0) { SECKILL_FLAG_MAP.put(seckillId, false); // Service exception} // generate order // issue order creation success event // build response return response; }Copy the code

By doing this, we don’t need to rely on Redis at all. Both performance and security can be further improved! Of course, this solution does not take into account complex scenarios such as dynamic capacity expansion and shrinkage of the machine, and if these are to be considered, it is better to directly consider a distributed lock solution.

conclusion

Oversold of scarce goods is a major accident. If the number of oversold is large, it will even bring a very serious business impact and social impact to the platform. After this accident, LET me realize that no line of code in the project should be taken lightly, otherwise in some scenarios, these normal working code will become a deadly killer! As a developer, when designing a development plan, it is important to consider the plan thoroughly. How to take the plan into consideration? Only keep on learning!

Source: juejin. Cn/post / 6854573212831842311

P0 major accident: overselling 100 bottles of Feitian Maotai, the whole project team was in a panic ~

Related Posts

One of the load balancing algorithms – Golang

Hundreds of millions of traffic site performance optimization methodology steps

Gateway flow control device: Ingress/Nginx flow control with AHAS