Seconds to kill this topic is now a plite topic, but because it is close to the annual Double 11, and found that some time ago, whether Ali or Tencent some big factory is in fact frequently asked about this scene, so it is ready to come up and say.

In terms of scale, seckill can be divided into large and small seconds. Large second refers to the special festival, such as Double 11, with large scale, ultra-low price and large flow, while small second generally refers to the activities of some time types configured by the merchants themselves, which are put on the shelves at the designated time. From the form can also be divided into a single period of seconds and multiple seconds to kill. But in this scenario, we’re generally talking about a single time mass SEC.

Second kill design to face the pressure and difficulty of several points:

  1. How to ensure the stability of the system under high traffic and concurrency? If the peak QPS reaches hundreds of thousands, how can the pressure system be designed to ensure that it is not broken in the face of huge flow?

  2. How to ensure the final consistency of data? For example, inventory can not oversold, oversold that loss is either the business or the platform, users do not carry this pot anyway, oversold 325 reservations this year.

Of course, when it comes to such a large activity, we also need to take into account the statistical analysis of the data. We can’t always finish the activity, so we don’t know how the effect will be.

System architecture

Assuming that the peak QPS of this year’s Double 11 is estimated to be 500,000 (I’m just pulling it off), and according to our usual experience, a single machine of 8C8G can reach about 1000 QPS, then theoretically speaking, we only need 500 machines to withstand it, so we can’t afford to be arbitrary? This is the only way to go out and turn right.

Traffic filtering

In essence, there are a lot of users involved in seckilling, but the number of goods is limited, and there are not many users to really grab, so the first step is to filter out most of the invalid traffic.

  1. The Button on the front page is grey before the activity starts to prevent invalid clicks from generating traffic before the activity starts
  2. Front end add verification code or answer, prevent instant ultra high flow, can play a very good effect of the wrong peak, now the verification code variety, some of the question bank to do a primary school question, and the question bank update frequently, it is difficult to crack violence. Of course, I know there is a manual way to play the code, but this is also a need for time, unlike the machine unlimited brush your interface.
  3. Activity verification, since it is an activity, then the activity of participating users, participating conditions, user whitelist and so on to do a layer of verification interception, there are other such as user terminal, IP address, the number of activities, blacklist user verification. For example, the activity is mainly aimed at user verification of APP terminal, then users of other terminals will be blocked according to the parameters. The number of times that users participate in the activity can be verified according to IP, MAC address, device ID and user ID. The blacklist will intercept some abnormal users such as flemen party according to their usual activity experience.
  4. Illegal request interception, do the above interception if there are users can bypass the limit, that has to say too great X. For example, there is a limit on answering questions from midnight on Double 11, so normal people need 1 second to answer questions, even if they have been single for 30 years, I think their speed can not exceed 0.5 seconds, so they can completely intercept requests that happen to be 0 o ‘clock or within 0.5 seconds.
  5. Flow limiting. Suppose we kill 10,000 goods in a second, we have 10 servers, and the QPS of a single server is at 1000, then theoretically it can be finished in one second. For micro-services, we can do flow limiting configuration to avoid unnecessary pressure caused by subsequent invalid traffic hitting the database. For traffic limiting, there is another barrier traffic limiting method, which is pure luck traffic limiting method. It is to randomly offset a period of time within the agreed request start time. The offset for each request is different.

Performance optimization

After filtering out invalid traffic, you may have filtered out 90% of invalid requests, and the remaining valid traffic will greatly reduce the pressure on the system. Then it is necessary to optimize the performance of the system.

  1. Page static, commodities involved in seckill activities are generally known, can be static processing for active pages, cache to CDN. Suppose we have a page size of 300K, what is the traffic of 10 million users? These requests to request back-end server, database, pressure can be imagined, CDN user requests do not go through the server, greatly reducing the server pressure.
  2. Activity preheating, the activity inventory for the activity can be independent, not common commodity inventory sharing service, activity inventory before the start of the activity to load into Redis, query all go cache, the final deduction of inventory again depending on the situation.
  3. Independent deployment. If sufficient resources are available, a set of environment can be separately deployed for seckill activities. Some logic that may be useless can be removed from this environment, such as scenarios that do not need to consider using coupons, red envelopes and giving bonus points after placing an order, or these scenarios can be issued asynchronously and uniformly after the activity ends. This is just an example, in fact there is a lot of useless business code that you can strip away for seckill activities alone, which can improve performance.

After these two steps, we should end up with a funnel flow.

oversold

In addition to the service stability under high concurrency and high traffic, the remaining core is probably how to ensure inventory is not oversold, or to ensure the final consistency. Generally speaking, there are two ways to place orders and stock:

  1. Order namely buckle inventory, this is the most conventional most of the practice. However, you may encounter the situation mentioned in the second point in the activity.

  2. Pay button to complete inventory, this design is I met in the hotel industry, cheap room after released by scalpers take inventory lead to normal users can’t order placing, cattle can use a slightly higher price then sell to the user a profit, so there will be a success in some activities to pay only after take inventory. However, this approach is complicated to implement and may result in a large number of invalid orders, which is not suitable for a seckilling scenario.

In view of the second kill suggestion to choose the way to purchase inventory, the implementation is relatively simple and common practice.

plan

  1. First check if the Redis cache inventory is sufficient
  2. Inventory is deducted before order data is placed to prevent orders from generating oversold problems with no inventory
  3. When the inventory is deducted, the database inventory is deducted first, then the Redis inventory is deducted, so that no matter which exception occurs in the same transaction, it will be rolled back. There is a problem that redis may be successful because of network problems return failure, transaction rollback, resulting in database and cache inconsistency, so that the actual sell less, can be put to the next round of seconds kill.

This can solve the problem to some extent, but there may be other problems. For example, when a large number of requests fall on the same inventory record for update, a large number of lock competition caused by row locking will lead to a sharp decline in database TPS, and the performance can not meet the requirements.

Another method is queuing, queuing in the service layer, for the same commodity ID, that is, the database is an inventory record to make a memory queue, serialization to reduce inventory, can relieve the concurrency pressure of the database to a certain extent.

Quality assurance

To ensure system stability and prevent your system from being killed, some quality control has to be done.

  1. Fusible current limiting demotion, old story, current limiting based on pressure measurement, you can use Sentinel or Hystrix. And there should be downswitch on both ends.
  2. Monitoring, everything on, QPS monitoring, container monitoring, CPU, cache, IO monitoring and so on.
  3. Drill, large seconds kill beforehand drill little, can’t rashly on it.
  4. Check, plan, check the amount and quantity of stock order after the event, whether it is oversold? Is the amount normal? It’s all necessary. Plans can be downgraded in an emergency.

Data statistics

The activity is done, how should the data be counted?

  1. The front-end buried point
  2. Data market, through the backend service with the monitoring system can intuitively see some activities of the market monitoring and data
  3. Offline data analysis, post-activity data can be synchronized to offline data warehouse for further analysis and statistics

conclusion

In general, in the face of a huge amount of traffic, our approach is to first filter out invalid traffic through various conditions, and then optimize the performance of the existing system, such as static page, preheating of inventory goods, and can also be isolated from other environments through independent deployment. Finally, you need to solve the problem of cache consistency and inventory oversold under high concurrency to prevent a large number of concurrent crashes in your database.

A complete activity is a complete link from the front end to the back end, with pre-rehearsal work in the middle and data analysis after the event are essential links.