The performance of the entire project team was deducted...... due to a P0 critical incident caused by Redis distributed lock Please careful

background

The purchase order in our project is solved by distributed lock. Once, the operation did a panic buying activity of Flying Maotai, 100 bottles of stock, but it was oversold!

Want to know, the scarcity of this flying Maotai on earth!! The accident was classified as a P0 major accident… You just have to accept it. The performance of the whole project team has been deducted ~~

After the accident, the CTO named me and asked me to lead the charge. Okay, go

The scene of the accident

After some understanding, it is learned that this snap up activity interface has never appeared this situation before, but why is it oversold this time?

The reason lies in that the commodities before are not scarce commodities, but this activity is actually Flying Maotai, through the analysis of buried data, all the data are basically doubled, the intensity of the activity can be imagined!

Without further ado, directly on the core code, confidential part of the pseudo-code processing:

public SeckillActivityRequestVO seckillHandle(SeckillActivityRequestVO request) {SeckillActivityRequestVO response; String key = "key:" + request.getSeckillId; try { Boolean lockFlag = redisTemplate.opsForValue().setIfAbsent(key, "val", 10, TimeUnit.SECONDS); Object stock = redistemplate.opsForHash ().get(key+":info", "stock"); assert stock ! = null; If (integer.parseInt (stock.toString()) <= 0) {else {redistemplate.opsForHash ().increment(key+":info", "stock", -1); / / / / generated order release order to create successful event / / building response finally VO}}} {/ / releases the lock stringRedisTemplate. Delete (" key "); // build response VO} return response; }Copy the code

In the above code, the distributed lock expiration time is 10s to ensure that the business logic has enough execution time. Use try-finally statements to ensure that the lock is released ina timely manner.

The inventory is also verified internally by the business code. It looks so safe! Don’t worry, keep analyzing…

The cause of the accident

The activity attracted a large number of new users to download and register our APP, among which, there are many wool party, using professional means to register new users to collect wool and brush the sheet.

Of course, our user system has been prepared in advance, access to Ali cloud man-machine verification, three-factor authentication and risk control system developed by the 18 kinds of martial arts, blocking a large number of illegal users.

I can’t help but give a thumbs-up here, but for that reason, the user service is always under high operational load.

As soon as the buying spree began, a flood of customer verification requests hit the customer service.

As a result, the user service gateway has a short response delay. The response time of some requests exceeds 10s, but we set it to 30s due to the response timeout of HTTP requests.

This causes the interface to remain blocked in the user verification area until 10 seconds later, when the distributed lock is no longer valid, a new request comes in to obtain the lock, that is, the lock is overwritten.

These blocked interfaces execute lock release logic, which releases locks from other threads, causing new requests to compete for locks. This is a vicious cycle.

In this case, we can only rely on the inventory check, but the inventory check is not non-atomic, using the get and compare method, so the tragedy of oversold happened ~~~

Accident analysis

After careful analysis, it can be found that the panic buying interface has serious security risks in high concurrency scenarios, mainly concentrated in the following three areas:

(1) No other system risk tolerance processing

Due to user service strain, gateway response is delayed, but there is no way to deal with it, which is the trigger for oversold.

② Seemingly secure distributed locks are not secure at all

Although the set key value [EX seconds] [PX milliseconds] [NX | XX], but if the thread A execution for A long time before release, the lock is expired, thread B is available at this time of the lock.

When thread A completes, it releases the lock, essentially releasing the lock from thread B. In this case, thread C can acquire the lock, and if thread B completes the lock release, it is actually the lock set by thread C. This is the immediate cause of oversold.

③ Non-atomic inventory check

Non-atomic inventory verification results in inaccurate inventory verification results in concurrent scenarios. This is the root cause of oversold.

Through the above analysis, the root cause of the problem is that the inventory check relies heavily on distributed locks. Because in the case of distributed lock normal SET, DEL, inventory check is no problem.

However, inventory checking is useless when distributed locks are not secure and reliable.

The solution

Now that we know why, we can fix it.

Implement relatively secure distributed locks

Relatively safe definition: set and del are mapped one by one, no existing lock del will occur.

From a practical point of view, service security cannot be guaranteed even if set and DEL can be mapped one by one.

Because the lock expiration time is always bounded, unless it is not set or set to a very long time, this can also cause other problems. So it doesn’t make sense.

To implement a relatively secure distributed lock, you must rely on the value of the key. When releasing the lock, the uniqueness of value is used to ensure that the lock will not be deleted.

We implement atomic get and compare based on LUA script, as follows:

public void safedUnLock(String key, String val) { String luaScript = "local in = ARGV[1] local curr=redis.call('get', KEYS[1]) if in==curr then redis.call('del', KEYS[1]) end return 'OK'""; RedisScript<String> redisScript = RedisScript.of(luaScript); redisTemplate.execute(redisScript, Collections.singletonList(key), Collections.singleton(val)); }Copy the code

We use LUA scripts to secure the unlock.

Implement safe inventory check

If we take a closer look at concurrency, we’ll see that operations like get and compare/ read and save are non-atomic. If we want to implement atomicity, we can also implement it with LUA scripts.

But in our case, since the buying campaign can only place one bottle at a time, it can be implemented based not on LUA script but on the atomicity of Redis itself.

Here’s why:

Long currStock = redistemplate.opsForHash ().increment("key", "stock", -1);Copy the code

No, the inventory check in the code is “gilding the lily”.

Improved code

After the above analysis, we decided to create a new DistributedLocker class specifically for distributed locks:

public SeckillActivityRequestVO seckillHandle(SeckillActivityRequestVO request) {SeckillActivityRequestVO response; String key = "key:" + request.getSeckillId(); String val = UUID.randomUUID().toString(); try { Boolean lockFlag = distributedLocker.lock(key, val, 10, TimeUnit.SECONDS); if (! LockFlag) {// service exception} // user activity check // inventory check, Based on atomic redis itself to ensure Long currStock = stringRedisTemplate. OpsForHash (). The increment (key + ": the info", "stock", 1); If (currStock < 0) {if (currStock < 0) { // The service is abnormal. Log. error(" no stock "); } else {/ / / / generated order issued orders to create successful event / / building response}} finally {distributedLocker. SafedUnLock (key, val); } return response; }Copy the code

Deep thinking

Is distributed lock necessary

After improvement, it can be found that we can also ensure that the inventory will not be oversold by virtue of Redis’ atomic deduction.

That’s right. But without this layer of locking, all requests come in through the business logic, which increases the pressure on other systems because they depend on them.

This increases performance loss and service instability, which is more than worth the cost. Some traffic can be intercepted to some extent based on distributed locks.

② Selection of distributed lock

RedLock has been proposed to implement distributed locking. RedLock is more reliable, but at the expense of performance.

In this scenario, this increase in reliability is not nearly as cost-effective as the increase in performance. For scenarios with high reliability requirements, RedLock can be used.

③ Think again about the need for distributed locks

The Bug needed an urgent fix to go live, so we optimized it and rolled it out immediately after pressing it in our test environment.

This optimization proved to be successful, with a slight performance improvement and no oversold in the case of distributed lock failure.

But is there room for improvement? Some! Since the service is clustered, we can spread the inventory evenly across each server in the cluster and broadcast notifications to each server in the cluster.

The gateway layer performs a hash algorithm based on the user ID to determine which server to send the request to. This allows for inventory deductions and judgments based on the application cache.

Performance has been further improved:

// Preinitialize by message, Private static ConcurrentHashMap<Long, Boolean> SECKILL_FLAG_MAP = new ConcurrentHashMap<>(); // Pre-set by message. HashMapprivate static Map<Long, AtomicInteger> SECKILL_STOCK_MAP = new HashMap<>(); . public SeckillActivityRequestVO seckillHandle(SeckillActivityRequestVO request) {SeckillActivityRequestVO response; Long seckillId = request.getSeckillId(); if(! Seckill_flag_map. get(requestseckillId)) {// Service exception} // User activity check // inventory check if(SECKILL_STOCK_MAP.get(seckillId).decrementAndGet() < 0) { SECKILL_FLAG_MAP.put(seckillId, false); // Service exception} // generate order // issue order creation success event // build response return response; }Copy the code

By doing this, we don’t need to rely on Redis at all. Both performance and security can be further improved!

Of course, this solution does not take into account complex scenarios such as dynamic capacity expansion and shrinkage of the machine, and if these are to be considered, it is better to directly consider a distributed lock solution.

conclusion

Oversold of scarce goods is a major accident. If the number of oversold is large, it will even bring a very serious business impact and social impact to the platform.

After this accident, LET me realize that no line of code in the project should be taken lightly, otherwise in some scenarios, these normal working code will become a deadly killer!

As a developer, when designing a development plan, it is important to consider the plan thoroughly. How to take the plan into consideration? Only keep on learning!

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

The performance of the entire project team was deducted…… due to a P0 critical incident caused by Redis distributed lock Please careful

Is distributed lock necessary

② Selection of distributed lock

③ Think again about the need for distributed locks

The performance of the entire project team was deducted…… due to a P0 critical incident caused by Redis distributed lock Please careful

Is distributed lock necessary

② Selection of distributed lock

③ Think again about the need for distributed locks

Related Posts

Eight simple and effective ways to improve the grade of design — study design

As a technical person, talk about their own life perception

How do I use Traits in PHP and Laravel