background

Hello, I’m Tong Brother.

Last night when I went home from work on the subway, the boss suddenly called me. The slow response of the production environment of system B affected the use of system A. Tens of thousands of little guys could not receive orders, and about 300,000 orders were stuck.

I got home around 8:30 and joined the club right away.

restart

When I joined the association, there have been colleagues to help positioning, as the saying goes, restart can solve 80% of the problem, if the restart can not solve, it must be the number of restart is not enough, bah, no, restart can not solve, it is really to positioning.

As it turns out, a wave of pressure testing after the restart is still not helpful, 1000 concurrent, the average response time is 3-4 seconds, several times in a row.

To upgrade the configuration

Restart seems to be ineffective, enter the second phase – upgrade configuration, 2 4-core 8G instance upgrade to 6 8-core 16G, database configuration also doubled, can be solved with money, we generally do not invest too much manpower ^^

As it turned out, the configuration didn’t help much, 1000 concurrent, and the average response time for the pressure test was still 3-4 seconds.

That’s interesting.

At this point, Tong Brother I intervene.

Check the monitor

When I went online and looked at the monitoring, the instance CPU, memory, disk, network IO, and JVM heap memory usage seemed to be fine. It was a real headache.

Local pressure measurement

We divided it into two waves, one to prepare local pressure measurement, and the other to continue the analysis. After the local pressure measurement, we found that the local environment, single machine, 1000 concurrent, no problem, no gross problem, the average response basically maintained within a few hundred milliseconds.

It seems that there is really no problem with the service itself.

Code walkthrough

Really don’t have any choice, take out the code, a group of men watching code, research students to us business logic, of course, he has been, to scold dead, what break code, in fact, before the red elder brother intervention, they have to turn a wave of code, there is a place to put the redis command scan changed keys *, buried in a hole here, But, that’s not the main issue right now, we’ll talk about that later.

Code day all the way down and found that there are a lot of redis operations, there is a for loop inside in the redis get commands, the other is conventional database operations, and all the indexes, so the preliminary screening, database, there should be no problem, the main problem is likely to be concentrated in redis this, call too frequently.

Add the log

After checking the code, except that scan was changed to keys * (I do not know this), basically there is no problem, add log, add log in small sections, OK, restart the service, pressure test a wave.

Of course, nothing changes. Analyze the log.

Through the logs, we can see that the redis calls are sometimes very fast and sometimes very slow, which looks like the connection pool is insufficient, i.e. a batch of requests go first and a batch of requests are waiting for idle Redis connections.

Example Modify the number of Redis connections

The springboot user interface (CBC) is the default user interface (CBC). The springboot user interface (CBC) is the default user interface (CBC). The springboot user interface (CBC) is the default user interface (CBC).

The average response time went down from 3 to 4 seconds to 2 to 3 seconds, which is not obvious. Keep increasing the number of connections, because we are 1000 concurrent, and each request has many redis operations, so there must be a wait. This time we dry up the number of connections to 1000, restart the service, and press a wave.

As it turns out, not significantly.

View the Log again

At this point, there is no good solution, so we go back to the log and look at the time of the redis operation. We find that 99% of the GET operations return very quickly, basically within 0.5 milliseconds, but there are always a few that return 800 900 milliseconds.

We thought we were done with redis.

However, pressure test several times, the time has not been up.

Very helpless, at this time, has been more than 3 o ‘clock in the middle of the night, the leadership of the huawei cloud shouted.

Huawei Cloud Troubleshooting

Finally, we called huawei cloud related personnel together to check the problem, of course, they are reluctant, but, who let us give money ^^

The person in charge of Huawei cloud got the experts of Redis together to help us look at the indicators of Redis. Finally, it was found that the bandwidth of Redis was full, and then the traffic limiting mechanism was triggered.

They’re temporarily tripling the bandwidth of Redis. Let’s do another pressure wave.

Holding a grass, the average response time drops to 200-300 milliseconds !!!!

Really is to hold a grass, this is a little pit, you limit the flow even, bandwidth full also don’t alarm about it.

This is a real pain in the ass.

At this point, we thought the problem was solved and the leaders went to bed

On the production

Now that the cause of the problem has been found, it is time to put pressure on production

We asked huawei cloud experts to triple the size of the bandwidth they produced.

Pull a Hotfix branch from the production commit, close the signature, restart the service, and test out a wave.

Screwed, the production environment is worse, the average response time is 5-6 seconds.

Test environment we changed the connection pool configuration, production environment or Jedis, change it, go wave.

It doesn’t really work. It’s still 5 or 6 seconds.

That’s a pain in the ass question.

Check the monitor

Check the monitoring of Redis in Huawei cloud, the bandwidth and flow control are normal this time.

This time, the CPU is abnormal. The CPU of Redis directly soared to 100% during the pressure test, resulting in slow application response.

Wake up huawei Cloud Redis experts again

It’s already more than four o ‘clock in the morning, we have no idea, Huawei cloud redis experts, you get up again!

We woke up redis experts of Huawei Cloud again to help us analyze the background and found that 140,000 scans were performed within 10 minutes

Concern princess tong elder brother read source code, view more good articles!

The evil of the scan

I asked the r&d staff where SCAN was used, and found that every request would call scan to get a key starting with a certain prefix, scan 1000 pieces of data every time, check the total number of Redis keys, about 110,000, that is to say, one request would need scan100 times, 1000 concurrent. As we know, scan and keys * in Redis need to perform full table scan, which consumes CPU. 140,000 scan operations directly put the CPU into the sky.

Why didn’t the test environment CPU fly?

By comparing the total number of keys in the test environment and production environment, there are only 900 keys in the test environment and only one scan or keys * per request. There are no wool problems.

Why do production environments have so many keys?

Ask the developer why there are so many keys in the production environment and no expiration time is set?

R and D personnel said that set, is another colleague to write the code, open the code, is really a magic code, specific code I will not be convenient to post out, there are based on the conditions to determine whether to set the expiration time, after analysis, in most cases, did not set the expiration time successfully.

Current solution

At this time, it is already 4:30 in the morning, although we are still very excited, but after the leadership decision, temporarily do not move, because, at present, system A has suspended the call system B, so, at this time, system B can be said to flow almost zero, we during the day and then divided into two stages to fix this problem.

The first step is to clean up the data in the production environment redis, keeping only a small portion of the necessary data.

In the second step, change the data at the beginning of a prefix to hash to reduce the scan scope.

Ok, this production accident investigation is over here. Tong Brother will continue to follow up the follow-up.

conclusion

This production event is slightly different from previous events, which can be summarized as follows:

  1. In the past, it was the CPU, memory, disk and JVM problems of the application service itself, but the bandwidth and traffic limiting of Redis was the first time we met.
  2. On huawei cloud, a lot of things have not been skilled, including monitoring indicators, but also need to slowly explore;
  3. Redis must disable keys and scan commands, and most keys should be set to expire!

Well, that’s about all for this event. Tong Elder brother will continue to follow up any new situation. Of course, it’s better not to have any new situation ^^