Record a bloody lesson…

background

Recently, I was in charge of a ticket snatching demand in the company, which was my first contact with such a task with instantaneous high concurrency. First, introduce the overall functional background:

  • The company has a relatively large App platform, which is used as the entrance to snatch tickets for this event
  • The event lasts for ten days, and 1,000 tickets are released every day. Those who can’t get all the tickets on that day will accumulate to the next day
  • You can grab one, two, three tickets at a time
  • Given that there are many uncertain factors in activity flow during the epidemic, both the product and TM are not optimistic about activity flow

The above is the general background of this demand.

Demand analysis

Get demand, words don’t say much, talk chi chi open dry.

Contains rob tickets, seconds kill, such as the demand of the word, the first reaction is to reduce inventory with redis, then confirm and TM activity may concurrency, it can support 300 concurrent (but in fact I at that time was not a very specific for the concurrency concept, just heard the TM and relaxed tone, feel this is not a high quantity, then, I paid the price for my ignorance.

Need to implement

So I decided not to introduce Redis and just de-inventory in Mysql.

Start by creating an inventory table (ignoring some non-key fields) :

CREATE TABLE `ticket` (
  `id` int(32) NOT NULL auto_increment COMMENT 'primary key',
  `inventory` int(32) NOT NULL COMMENT 'inventory'.PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Ticket Inventory Table';
Copy the code

As the company uses Mybatis plus, update its own optimistic locking mechanism, so there is no need to worry about the number of concurrent deduction inventory is not correct.

Then I innocently save the inventory in a statement, set retry mechanism, for loop a thousand times, update failed to query again, failed to update a thousand times, failure.

Then the function test, the function of course no problem, the logic is very simple.

After the functional test, you can start to pressure test the interface normally. At this time, there is still a week or so before the function goes online. If you think that the pressure measures any problems, you can change it in time and lie flat. However, some incidents occurred at this time, and some functional changes were made to the requirements. Moreover, the leader felt that it was inappropriate to put the ticket snatching business in the original App platform. After all, the original App was a content that had been coupled enough, so the one-time business should not be coupled again, so as not to affect the original business.

Then carry on the function split, the original written good grab ticket business to carry to a new framework, this chi chi chi, even drag the pressure test to the day before the line!

Pressure test

Get ready for the first round of manometry

Operation and maintenance partners started to run pressure test program, 300 concurrent, the application died…

The analysis reason

When a single piece of inventory data is updated concurrently, the amount of concurrency is actually very low, because rows will be locked during the update. Once the amount of concurrency is increased, there will be a large number of requests waiting for the lock, the response time will be longer, and there will be a large number of connection database timeout, dragging down the application.

So what now?

There are two options:

  • To switch to redis

  • Split inventory in Mysql

The first option is certainly a must if time is available. But now there is a problem: no time! In fact, if snatching tickets is a deduction, then REDis is completely in time to change, but now the demand is that there are one ticket, two tickets, three tickets three situations, with redis to achieve, the change may be relatively large. In order not to affect the online, can only choose to split the inventory.

Chi Chi chi began to change, 1000 tickets, broken into 100 inventory records, each record 10.

When checking inventory, find all records with inventory > 0 and update one. But now because the inventory is split, there may be a situation: the inventory of a single record is insufficient, but the total inventory is sufficient, so additional logic is written to judge whether the inventory of a single item is sufficient.

Test, function is ok.

Get ready for a second pressure test

Let’s start with 50 concurrent applications, the application doesn’t hang, but RT is still high, so TPS is low

Then consider splitting the inventory record again and change it to 1000 lines with 1 inventory per item. Press again, 100 concurrent, application dies again…

The analysis reason

Operation and maintenance students through monitoring, the line lock is still very much. To save inventory is divided into multiple copies of multiple records, deduct the inventory routing, it enlarges the concurrency value, but still can’t avoid a lot of to access the database to update the inventory, because of there are default order query inventory, so the update inventory, actually most of the thread, or among competing for the query results in in the previous record. Now that you know why, it’s easy to change:

// Is there a record that directly deducts votes
List<TicketInventoryDO> ticketInventoryForOne = ticketInventories.stream().filter(e -> e.getInventory() >= numbers).collect(Collectors.toList());
if (ticketInventoryForOne.size() > 0) {
    // Create a set without repeating random numbers
    int index = random.nextInt(ticketInventoryForOne.size());
    // Select records randomly
    TicketInventoryDO ticketInventory = ticketInventoryForOne.get(index);
    ticketInventory.setInventory(ticketInventory.getInventory() - numbers);
    if (!this.updateById(ticketInventory)) {
        // If the update fails, throw an exception and roll back}}else {
    // Check whether the total inventory is sufficient
    int inventory = ticketInventories.stream().mapToInt(TicketInventoryDO::getInventory).sum();
    int reduceNum = numbers;
    Set<Integer> indexSet = new HashSet<>();
    if (inventory >= numbers) {
        while (true) {
            // Inventory reduction completed
            if (reduceNum == 0) {
                break;
            }
            int index = -1;
            while (true) {
                index = random.nextInt(ticketInventories.size());
                if(! indexSet.contains(index)) { indexSet.add(index);break;
                }
            }
            TicketInventoryDO ticketInventory = ticketInventories.get(index);
            // If the inventory is greater than or equal to the inventory to be deducted
            if (ticketInventory.getInventory() >= reduceNum) {
                ticketInventory.setInventory(ticketInventory.getInventory() - reduceNum);
                reduceNum = 0;
            } else {
                // If the inventory in this entry is smaller than the inventory to be deducted
                reduceNum = reduceNum - ticketInventory.getInventory();
                ticketInventory.setInventory(0);
            }
            if (!this.updateById(ticketInventory)) {
                // If the update fails, throw an exception and roll back}}}else {
        // The total inventory is insufficient}}Copy the code

The purpose of this logic is to randomly select a record from the query result set to update, so as to avoid a large number of threads competing for the same record lock.

Test, function is ok.

Prepare for round three pressure test

Let’s do 50 concurrent runs

You can see that RT is significantly lower, TPS is up, and there are no database timeouts.

Press 100 more times

You can see RT is up, TPS is down, there are a few database connection timeouts, it feels like this wave has reached the bottleneck. Since it was already 1:30 at 2 o ‘clock at night, the leader thought that the amount of concurrency for this activity was enough, and the interface plus the current limit should be able to withstand it. Mainly, the first wave of formal ticket rush should be supported early the next day, and the functional pressure test is over here.

It was a lot less than the 300 concurrent support proposed by TM. I was really ashamed. I still remember chatting with TM about the flow of this activity with a relaxed and carefree look when I accepted the demand.

After the function was launched, the first few days passed smoothly (the inventory data initialization error on the first day led to 20 tickets being oversold on the first day… This is a careful problem, low error, redozen eighty board), to grab tickets on the last day, suddenly a large number of traffic influx, the application dozen hung… A flurry of repairs…

The reason for the subsequent recheck was that only the traffic limiting was added to the ticket snatching interface at that time. Indeed, it can protect the interface from being destroyed by traffic. However, this is an application, and this application can be more than one interface. Get to the end of the day, because the activity has already started, a large number of users in the field are query their ticket information, lead to grab tickets interface has not reached the standard of current limiting, began to appear timeout (because resources are occupied by the other request), and then there are the avalanche effect, interface is becoming more and more application of overtime and 叒 叕 hung up the…

And because a lot of slow SQL, lead to query inventory become very slow, many users after click rob ticket, see the interface don’t respond, open the back into the page for rob tickets operation, leading to some users take more than one case, because the original on idempotence database do is no longer effective, because the first request data simply haven’t put in storage…). . The idempotent check here will be discussed in a follow-up article.

Ps: There is another ticket snatching interface, directly using Redis to reduce inventory, pressure test easily up to 200 concurrent. However, the inventory deduction of the ticket snatching interface is one by one, and will not accumulate, and it is easier to achieve using Redis, the general idea is:

Create a key in Redis, and set the timeout period (according to your needs, it is usually the same day inventory, then the timeout period of the key can be set to 24 hours), every time to grab tickets, Use the getAndIncrement() method in Redis’s RedisAtomicLong to increment your library. When the redis key exceeds your library’s value, return no tickets. The code is as follows:

/** * redis increments **@param key key
 * @paramLiveTime Expiration time (hours) *@returnSince the number * /
private Long incr(String key, long liveTime) {
    RedisAtomicLong entityIdCounter = new RedisAtomicLong(key, redisTemplate.getConnectionFactory());
    long increment = entityIdCounter.getAndIncrement();

    // Set the expiration time initially
    if (increment == 0 && liveTime > 0) {
        entityIdCounter.expire(liveTime, TimeUnit.HOURS);
    }
    return increment;
}
Copy the code

(Follow up with research on how to take advantage of redis support to deduct multiple tickets from inventory, and what to do if you run into demand that can accumulate if you can’t get all the tickets, to play through inventory reduction at once.)

conclusion

Experienced the baptism of the grab tickets, so that I concurrent knowledge of the awe of the heart and more a bit… The knowledge you read before is all theory. You have to really experience something to really grow:

  • You need to really think about the rationality of the architecture (heavy traffic services, whether to split)
  • It’s important to do stress testing as early as possible (I know a lot of programmers have confidence in their code, but you don’t know if your code will hold up until you’ve been through stress testing, and the closer you get to live, the worse your changes will be).
  • There are 99.99% problems with rushing features
  • Interface traffic limiting must take into account the resource usage of other application functions
  • Redis is preferred for destocking operations with high concurrency (so to say, pass relational databases without hesitation)