preface

Recently on a whim, think of some time ago the company held offline seckill activity is not ideal, want to study the optimization of seckill system. At that time, there were more than 200 members at the event. Due to our previous lack of experience, the APP page of the users went blank and got stuck during the instant kill for various reasons. The business department wants to throw the phone in our development face…… I was a recent graduate and a new employee, and I was afraid to speak out. Now the expansion gradually, it is time to redesign a set of seckill system……

Problem analysis

Experienced students may think that the technology of the company is too low when they see that all 200+ members have white screen and are stuck. In fact, the company’s system architecture is still very good, the big guy built a set of SpringCloud components, are relatively new versions. This seckill activity failure is really not before such experience, a lot of code consideration is not in place, access to the database too many times. Although the number of members is 200, members boast several microservices from entering the seckill page, clicking on seckill goods, and then ordering seckill, and the access to the database is far more than 200.

After code analysis, our company’s seckill is actually the same process as the ordinary order, but added a seckill ID field. Ordering process in a transaction in a variety of verification, cross-service invocation, locking inventory, insert order information, logistics distribution information…… And so on. Such cross-service and multiple access to the database obviously cannot meet the instantaneous traffic characteristics of seckill business. Just like the picture below

The above is the rough process of placing an order. In fact, it is more complicated than this. Within each microservice, other services are called to access the database… Because a user is at least a thread, when the number of users is too large, the number of threads will be many, and service resources are limited. When resources are insufficient, subsequent user requests will wait for the server to release resources to process, and users will feel stuck. Once the user feels the card, they are more likely to go back and refresh the page, or click submit order again, and then the request comes back again, and then access the database, and they will get more card. Each time a request accesses the database, it takes more time to process, so the server can’t handle the request, the front end doesn’t respond, and the user’s screen gets stuck.

There is also a key reason is that kill query goods, also go database, and because of the company business particularity, members see goods are different in different areas of the other business, leading to kill the query goods is more slow, low system throughput, finally can lead to user hang, process and the above order. The bottom line is to increase system throughput and get the server to release resources as quickly as possible.

In view of the above problems: In dealing with the huge traffic of seckill system at a moment like this, the core problems existing in the system architecture are:

  • It does not follow the principle of single service, and puts the seckill function in the order service. If the pressure of seckill system is too high, the normal order business will be affected
  • As with ordinary orders, the volume of traffic rushes in and locks down the inventory, leading to many invalid requests that take up resources
  • Seckill order links are not encrypted, giving professional teams a chance to take advantage
  • A large number of operations directly operate on the database, frequent disk IO, low system throughput, and possibly even database failure

Above are several core problems, solve these problems, basically can achieve a better seckill system. The following is a comprehensive analysis and solution to seckill system problems:

Single duty of service

We all know about seckill. Instant traffic. If the seckill function is included in the order service, in case the seckill occupies too many resources, or the service is directly suspended by the seckill function, the normal order business will also be affected. So Seckill has to deploy microservices separately

Heavy flow handling

The huge traffic is not only the normal request of the majority of users, but also the frequent click of users, malicious users, malicious attacks and so on. If not handled properly, there is a high probability that the request will not reach the inventory deduction and the microservice cluster will not be able to handle it. For the huge flow, it is necessary to take appropriate measures to limit the flow. Common current limiting methods are as follows:

The front-end current-limiting

We want to reduce some unnecessary traffic as much as possible on the basis of huge traffic. The front button will be ash when the activity does not start, and it will not be clicked. After the activity starts, the frequency of users to click the button will be limited. This removes a lot of unnecessary requests from normal users.

Nginx current-limiting

Front-end traffic limiting can only prevent normal users, but some malicious users have some development knowledge, through the web page to obtain the URL to simulate the request, here you can use Nginx to limit the number of visits per second to the same IP.

limit_req_zone $binary_remote_addr zone=one:10m rate=20r/s; Limit access to the same IP address to 20 times /S

The gateway current-limiting

Estimate the maximum number of requests the cluster can support based on the existing gateway cluster, seckill service cluster, and then limit the traffic. GateWay Traffic limiting For example, the GateWay can limit traffic for IP addresses, users, and interfaces. In this case, you can select traffic limiting for seckill interfaces. Common gateway traffic limiting algorithms include leaky bucket algorithm and token bucket algorithm. We choose to use token bucket algorithm to limit traffic, because this algorithm can allow sudden instantaneous traffic processing. GateWay built a filter factory configuration RequestRateLimiterGatewayFilterFactory using it, it need to rely on Redis

<dependency> <groupId>org.springframework.boot</groupId> <artifatId>spring-boot-starter-data-redis-reactive</artifactId>  </dependency>Copy the code

Configuration:

server: port: 40000 spring: cloud: gateway: routes: - id: sec_kill_route uri: lb://hosjoy-b2b-seckill predicates: - Path=/seckill/** filters: - name: RequestRateLimiter args: key-resolver: '#{@apikeyResolver}' # replenishRate: replenishRate from the Spring container with the stream limiting Bean Redis-rate-limiter. ReplenishRate: Redis-rate-limiter. BurstCapacity: 3000 # Total capacity of token (maximum number of requests allowed in 1 second) Application: name: hosjoy-b2b-gateway redis: host: localhost port: 6379 database: 0Copy the code

Code:

@Bean
KeyResolver apiKeyResolver() {
    return exchange -> Mono.just(exchange.getRequest().getPath().value());
}
Copy the code

If the company has used Ali’s Sentinel component, Sentinel can also be used in the gateway layer to limit the current, which has powerful functions and is convenient for monitoring, circuit breaker and downgrade. It is very convenient to use!

Cluster instance expansion

There is no traffic that the number of service instances cannot solve, if so, then continue to add service instances…… In this way, the traffic limiting effect can be achieved in a disguised way by increasing the number of service instances corresponding to the large traffic.

Responding to malicious requests

I believe you have all heard that there are some “professional teams” who make money by stealing moutai and other commodities on behalf of others. For this kind of team, not only do they have multiple IP’s, they may even have multiple accounts! It is to evade the risk control system by purchasing normal users’ accounts at a low price through various channels. It’s like this for this kind of team

For such a “professional team”, the flow limit alone can not completely solve the problem, you think, it may happen, these malicious requests are processed, grab the goods, but the majority of users do not grab, this problem is very serious. In this case, we can take two options:

Seckill link encryption

In the pass parameters of seckill controller, we generally accept the ID of the course and the ID of the commodity. We can add another password parameter matching the commodity. The process can continue only if the password is correct, otherwise the number of recorded password errors will increase. This password is randomly generated during the second kill and the product hit the shelves, even the people who developed this feature did not know!

@PostMapping("/sec-kill") public void secKill(@RequestParam("secId") Long secId,@RequestParam("productId") Long productId,@RequestParam("password") String password){ SecProductResponse secProduct = (SecProductResponse) redisTemplate.opsForValue().get("secId:productId"); If (secproduct.getPassword ().equals()) {//... } else { stringRedisTemplate.opsForValue().increment("black:secId:productId:userId"); }}Copy the code

Blacklist filtering

Check whether the user is on the blacklist at the gateway layer. If yes, the end.

String s = stringRedisTemplate.opsForValue().get("secId:black:userId"); if(s ! = null && Integer.parseInt(s) >= maxCount){ return; }Copy the code

Prevent repeat purchases

In general, seckill activities are limited to purchases made by the same user, and if the user has already made a purchase, then the user should not continue to purchase. While there are restrictions on the front end, there are also restrictions on the back end to prevent professionals. We can do this using Redis

Boolean flag = stringRedisTemplate.opsForValue().setIfAbsent("secId:productId:userId", 1, time, TimeUnit.SECONDS);
if(flag){
    //...
}
Copy the code

Oversold control, inventory deduction

Oversold seckill products is a terrible thing, because seckill products themselves are very good. In order to attract traffic to a very low second to kill the price of sales, even at a loss. Once oversold, the company or business may lose money. In order to control oversold, we can add locks when inventory is deducted. But here must add distributed lock, the use of local lock can not be controlled, specific reasons can be referenced, Redis distributed lock && Redisson uses distributed lock to avoid concurrent inventory deduction overselling. However, the system throughput will be reduced. We used to have cross-service inventory reduction, distributed locks, and access to the database, which is not appropriate for the volume of seckill requests. Now we can optimize the seckill interface to access the database without cross-service access, so using distributed locking can solve this problem as well. But since it’s a lock, there’s resource consumption. Is there a solution that doesn’t use distributed locking? Therefore, we can consider this problem from another perspective. The stock of seckill goods is generally limited in quantity, and the stock of seckill goods is much smaller than the sellable stock of goods. We can save the number of the seckill inventory in advance in Redis, and then use Redis to pre-deduct the inventory. Once the inventory is deducted, it will return that the seckill is over and it has been stolen. That way, we put in as many requests as we have in stock here, and the rest of the invalid requests are returned. Not only does it prevent oversold, but it also limits the volume of traffic, so the throughput is extremely high compared to the original rush and queue inventory reduction model. We can use Redis distributed semaphore implementation, here can use Redisson to do the specific code implementation

RSemaphore semaphore = redissonClient.getSemaphore("secId:productId"); boolean success=false; Try {success = semaphore. TryAcquire (1, 50, TimeUnit. MILLISECONDS); } catch (InterruptedException e) { log.error(e); return; } if(success){// generate the order number // send a message to MQ}Copy the code

In fact, the distributed semaphore can also be regarded as a kind of distributed lock, but its performance is extremely high, the acquisition of a semaphore is almost 0-1 ms, basically does not affect the system throughput.

Traffic peak clipping

Through the above levels, the number of requests to the order service is the same as the number of instant kill items in stock. Assume that 1 million people grab 400 Moutai, so there are 400 requests to call the order service, 400 concurrent order, because there are a series of business processing, concurrent access to the database, in fact, back to the original mode. Access the database from the seckill interface, so the throughput is very low, and the database may be suspended. We should let all seckill interface operations go to Redis. Here we can use message queues to do that, why use message queues? Use MQ to peak, flattening consumption to create orders, spreading out peak traffic. Because message queues can cause message loss under strong concurrency, see RabbitMQ reliability, duplicate consumption, Sequentially, and Message Backlog solutions

The database is divided into tables and libraries

Generally speaking, the above can achieve a better seckill system effect, if the company data volume is large, the business is very complex. Even MQ asynchronous consumption access to the database is not a solution, then use read/write separation, read and write library separate, effectively reduce database stress. The database can also be divided into tables and libraries to improve single-table concurrency and disk IO read/write performance.

Solve the above problems, seckill process is basically OK, in fact, the above pseudocode is very simple, real implementation, the code is not complex, just to a reasonable design scheme, the shielding filtering request on the shielding filtering, should not access the database can not access. Let’s see the flow chart of these links in detail

Merchandise on shelves/inventory rollback

Putting products on the shelves is actually very simple, we just need to put the information we need into Redis. However, different companies have different businesses. For example, our company’s business model is B → B → C. Seckill products and activities have regions, that is to say, an activity may occur. So in this case, we need to store a copy of the second kill field information in all permitted areas, as follows

/ / events information redisTemplate. OpsForValue (). The set (" province: cityId: secId ", "data"); / / commodity information redisTemplate. OpsForValue (). The set (" secId: productId ", "data"); / / inventory information redisTemplate. OpsForValue (). The set (" stock: secId: productId: password ", "data");Copy the code

How many Redis keys, you might say…… Indeed, if there are more games and more areas to select, it is necessary to save a lot of keys. If you do the math, it looks like there are 293 cities in China in 2016, if you count 300. Suppose there are 30 seckill events in the last three days, each with 10 items. Then the total number of keys required can be calculated as city sessions + sessions goods + sessions inventory = 300 * 30 + 30 * 10 + 30 * 10 = 9600

According to three days before and after the scan three days and then multiplied by three, also less than 30,000 key. Do you think that’s a large quantity? The number of Redis storage keys is:

Redis can handle up to 2^32 keys, and was tested in practice to handle at least 250 million >keys per instance. Every hash, list, set, and sorted set, can hold 2^32 elements. In other words your limit is likely the available memory in your system.

Officials say Redis can theoretically store 2^32 keys, but in practice tests an instance can store at least 250 million keys. Last sentence: your limit is the amount of memory available on your system…… And that’s just one instance of Redis. So don’t underestimate Redis, their website claims that the performance is extremely high, read speed is 110,000 times per second, write speed is 81,000 times per second. Also, if an Internet company in today’s cache world has very little use for a cache middleware as awesome as Redis, then generally the number of business users is limited. However, it is important to note that once a business is using Redis as a cache middleware in large numbers, there are at least three things that must be prevented: Cache avalanche, cache penetration, cache penetration, and data consistency

Because seckill activity has a business scenario that is not sold out, although this is somewhat embarrassing…… However, we have to consider that the unsold inventory should be returned to the inventory table after the end of the sessions.

As shown in the figure above, configure the scheduled task to scan the sessions to be killed in the last three days regularly, and then put them on the shelf. Be careful not to repeat them on the shelf. Listing is mainly to save the above information to Redis, and then send a delay message for the products at the end of each session. It is judged that if the semaphore is not 0 in the consumer, the seckill activity is not sold out, and the inventory needs to be returned, and then the semaphore in Redis is deleted.

Seckill commodity query

Due to seckill activity query frequent, huge traffic, do not go to the database query commodity information. All query operations go to Redis, be careful not to return the commodity password field before the campaign starts.

One thing to note here is that because the buy button on the page is dimpered before the activity begins, you need to request the server for the product password one second before the seckill begins. Let’s say there are 100,000 people ready to snap it up, that’s 100,000 requests to the server. In fact, one hundred thousand requests to no problem, because since you have one hundred thousand people ready to snap up, there must be one hundred thousand requests to the server, if you feel here one hundred thousand requests to the server is not very good, then your seconds kill interface is not the same to put one hundred thousand requests to the server? So the key issue is not the number of requests, but the staggered peak of requests. What that means is that your front-end can’t have 100,000 clients sending requests in exactly the same millisecond time, say, 2021-05-01 00:00:00 there’s a second kill activity, so the front-end can send requests at 2021-04-30 23:59:58 or 59, but you have to do it in milliseconds, 1 s = 1000 ms, the front end can be in this 1000-2000 milliseconds staggered 100,000 requests, so 100,000 requests are not in the same millisecond level of time, the server pressure will be smaller, and the server is Redis query, the response time should be 10-20 ms. After getting the commodity password, judge whether the current time has reached the start time of seckill, if it is to restore the button state, if not to wait for the time to restore the button on the line.

Kill the process

The following is a detailed diagram of the seckill process, describing the problems to be considered and the solutions for each node in order

Pseudo-code for seckill process:

@Autowired private RedisTemplate<String,Object> redisTemplate; /** * secKill process ** / @postMapping ("/ SEC -kill") public void secKill(@requestParam ("secId") Long secId,@RequestParam("productId") Long productId,@RequestParam("password") String password){ SecResponse sec = (SecResponse) redisTemplate.opsForValue().get("secId"); LocalDateTime now = localDatetime.now (); If (now.isafter (sec.getStartTime()) && now.isbefore (sec.getendTime ())){SecProductResponse secProduct = (SecProductResponse) redisTemplate.opsForValue().get("secId:productId"); If (secproduct.getPassword ().equals(password)){// Verify secproduct.getPassword () Duration Duration = Duration.between(sec.getStartTime(), sec.getEndTime()); int random = (int)(Math.random() * 100); long period = duration.getSeconds() + random; Boolean flag = stringRedisTemplate.opsForValue().setIfAbsent("secId:productId:userId", "1", period, TimeUnit.SECONDS); if(flag ! = null && flag) {/ / check have bought RSemaphore semaphore. = redissonClient getSemaphore (" secId: productId: password "); Boolean acquire = semaphore.tryacquire (num,50, timeunit.milliseconds); Boolean acquire = semaphore.tryacquire (num,50, TimeUnit. If (acquire){// acquire inventory String orderNo = generateOrderNo(); / / generated order number rabbitTemplate. ConvertAndSend (" hosjoy - b2b - secKill ", "routingKey", "data"); / / send a message} else {/ / if you don't get the, delete have bought the identity (don't delete also will be a problem) stringRedisTemplate. Delete (" secId: productId: userId "); } } catch (InterruptedException e) {} } } else { stringRedisTemplate.opsForValue().increment("black:secId:productId:userId"); }}}Copy the code

The above is the general seconds to kill process code, but also I think a better seconds to kill process, after the design, please colleagues to show (cough cough, in fact, I want to install a X, SHH! For a moment. There are two main differences between my thinking or design approach and his

  • Implement the data structure of Redis inventory
  • When is instant kill

He uses the Redis List data structure to store inventory, for example, leftPush 100 product ids if there are 100 inventory. Then the inventory is deducted through POP. I have compared the distributed Semaphore structure with the List structure. Both can be implemented and are very convenient to use. There is also an INCR and decR which can be used to increment and decrement, but they are only allowed to kill one item by default. If the business allows seckill to buy multiple items, then the List and DECr must be controlled with distributed locks, which will reduce the throughput of the system. Since List can only pop one element at a time, decr can be deducted, but it can be reduced to negative values. Assuming that user A kills 5 pieces in seconds, the inventory now only has 4 pieces, and user B kills 2 pieces in seconds, theoretically, A fails to kill in seconds, but user B should succeed in killing in seconds. If distributed lock is not added, user A reduces the inventory to -1. It also failed. That’s a problem. So… Distributed semaphore is awesome!

Another difference is that he has the concept of queuing after a purchase to check repeat purchases, and I’m checking with setIfAbsent. The difference doesn’t really matter, it’s when it’s done.

My design idea

In my design, as long as the user attempts to acquire a semaphore success, even seconds kill success, but it can actually tell the user need not return immediately, it is better to let the user mobile phone continued to circle told him after 1-2 seconds seconds kill success, because the MQ have certain time to send a message to the consumer success, if users know immediately kill success, and the order is still in generating, May bring a bad user experience to the user. After 1-2 seconds, the MQ message consumption is completed and the order is generated successfully. At this time, the user receives seckill successfully, and the order is also generated successfully. NICE!

So you might wonder, what if the customer generates an order and reports an error? I have to say, this is a question that must be considered, after all, the consumption of MQ is uncertain. Here of course I also consider this situation, if the consumption fails first take a retry, if the retry three times still failed, that indicates that there is a code problem caused by the order generation failure, record the error message, and then manual query error, restore the user order can be. After all, this is a small probability of things, there will not be a bunch of orders consumption failure, right? What’s more, people are originally in the seckill service to grab the inventory, since they grabbed me even if he seckill succeeded, the order due to other reasons generated failure, I give him manually generated orders, to ensure the final consistency, otherwise how to explain to the user?

Colleagues design ideas

But his colleague said that it should not be designed like this. It should be designed for the user to grab the semaphore only to have a second chance to kill, and the specific second kill success depends on the result of the order service consumption. If the order service consumption fails, it will roll back the seconds kill inventory to Redis and let other users grab it, because the service verification may fail and the user is not eligible to purchase. I have to say that he is always thoughtful about problems. I have grown a lot since I started to follow him to do projects. He is really a powerful boss!

However, my original design did not take into account that there are business verification users are not eligible to buy goods, why there are users are not eligible to buy goods…… What kind of business scenario is this? Why show him seckill if he can’t buy it? . But when you think about it, it’s a bit of a problem to determine seckill results based on order generation results.

Existing problems

  • Assume that 1 million people rob 400 Maotai, originally all robbed after you remind the user seconds to kill goods have been robbed. But the order service failed to consume messages 399 and 400, and the order was rolled back and the inventory was rolled back to Redis. If it is because the business verification fails, then I think it would be better to prevent the user from seeing this activity, or it would be better to remind the user that he is not qualified to participate before grabbing the seckill opportunity
  • About 3-5 seconds after consuming messages 399 and 400, you roll it back. Normal users just started to see did not grab, may have gone, this is likely to happen less sell.
  • If the error is not caused by the business verification problem, but the code problem, then the order is rolled back, I feel that this user is a little sad ah, it is clearly a system problem, but let the user take the blame……
  • If the user has the order rolled back due to code problems and then goes to the kill items page and sees the inventory remaining kill again, then fail again, kill again, fail again…… So the cycle goes on, I feel his heart is broken…… But the probability is small

See here everyone may feel, I rely on this blogger too shameless, pick others thorns, do not consider their own problems

Emmm how can I be such a person……

The problem with my plan

  • You need someone to pay attention to the SEC kill activity, although the probability of an error is relatively small, but if the order service reports an error, you need someone to generate/restore the order as soon as possible, which takes a lot of manpower. If the order is resumed and the user does not pay at last, then this human resources is equivalent to wasted ah…
  • No payment in the design logic counts the user seckill success, which may not be acceptable to the leader, if let the user pay first, the payment is completed before seckill success, and then to generate an order, so the leader should be very much agree with…… This seems to be ok, but whether there is any problem in the actual implementation details has not been studied. After all, the payment on Tmall and Taobao is only made after the order is completed, and the second version will be updated.

Personally, I think everyone’s plan may have some limitations and problems. After all, there is no perfect plan. We can only choose a more suitable design plan according to the actual business situation or all colleagues in the company to discuss, or make optimization on this basis.

The work experience of the blogger is still shallow, there are mistakes in the article welcome to correct, learn together!

If this post helped you, be sure to like and follow. Your support is the motivation of my creation!