2020-10-10 Online problem survey notes

scenario

During the National Day period, the online wechat terminal system could not be used normally. According to the log investigation, error code 42001 appeared, indicating that the accessToken stored in REDis was invalid, which led to the system abnormality. After deleting the outdated accessToken cached in REDis, the system could be restored to normal use.

Thinking misarrangement

First, according to the observation at the code level, after the token is obtained, it is stored for 6000 seconds, while the official wechat document describes that the returned accessToken is valid for 7200 seconds. Therefore, the token in the system cache will be invalidated before the wechat system is invalidated. But why does the token become invalid in this system?

1. Whitelist issues

The first step is to find the exception of the log (PS: we are using grafana log system, it is very convenient to use, you can learn about it, the official website: grafana.com).

At first, we thought it was the server that sent the request, but it was not in the wechat whitelist. However, according to our check on the wechat public platform, the IP of this server was set on the wechat public platform, so there is a question again, why is this? The key is that some request tokens are successfully obtained by the server, while others fail

Continue to observe, the above log appears, it is found that the frequency of acquiring token is not around 6000 seconds, some of them take more than ten minutes to re-acquire the situation, see here is there a problem with our Redis? Or did someone manually delete redis somewhere else or did someone remove tokens somewhere else in the code?

First of all, I searched the code and found no deletion of token in Redis. Manual operation is impossible because it is an online server.

Finally, from the perspective of business, our system serves multiple wechat public platforms ~, and the above problems are easily solved. The reason for the whitelist problem is that a newly added platform did not set our server IP in its whitelist, and the frequency of obtaining token was not correct. It is also because multiple public platforms may conduct token acquisition within a relatively small period of time.

Of course, this can only explain the problem of whitelist and frequency. Why token will be invalid for no reason still needs to be investigated.

2. Distributed lock problem

According to the description of the colleague who deleted the token to solve the problem at that time, the system had an anomaly around 1:00 p.m. on 7th. He deleted the expired token at 1:30 p.m. and then the system returned to normal. Therefore, he observed the log record of 7th obtaining the token as follows.

The appID of 194 obtained the token for the first time on 7th at 1.32pm. Before this, there was no record of 7th obtaining the token at all. According to the truth, the invalid time of 6000 seconds, the colleague who deleted the token should not have seen the token in Redis at all. Without looking closely at the token situation, the remaining time and so on, it was directly deleted. Then how did this token exist in Redis for so long?

Let’s take a look at what happened on number 6 before number 7. Right

At 8 o ‘clock in the evening of 6th, the appID 194 had 12 requests to obtain token in the same second, and the thread numbers were all different, which indicated that multiple users had operated in the same second, and the token in redis was invalid at this time, but because of the code block in the system code to obtain token, Redis distributed lock processing, according to the logic should not appear in the same second multiple resource requests.

Unless there is a problem with Redis, online Redis uses the master-slave mode of Ali Cloud, with one master and one slave. Check the fluctuation and alarm of Redis at 8 o ‘clock. The alarm did not happen, and the fluctuation only jumped, accounting for 5.x percent.

There is no alarm for the master/slave switchover, so there is only synchronization delay. But the command for locking is set… Nx, like this, is a write command, all the commands should be sent to the master server, there is no synchronization delay problem, or after acquiring the lock, the synchronization delay occurred when reading from the cache, but did not read from the slave library?

753 was the first to get the lock and request the token until 15sec.872 did not complete the token acquisition process, but in the meantime, the subsequent 11 requests directly entered the token acquisition code, which indicated the asynchronous delay problem. It is that our lock is not locked, causing subsequent requests to come straight in.

But even so, without locking, the token stored in Redis may not be a valid token and will automatically disappear after 6000 seconds. Why no request to obtain token after 8 o ‘clock? The problem broke out the next day, and my colleague went to Redis to delete the token, and the problem was solved. Why did the last token stored at 8 o ‘clock remain in Redis for so long?

First of all, due to historical problems in the framework, ace cache is used to set the token.

The setting value and time are not atomic operations, but two-step operations. We wonder whether redis jitter during this period caused the failure of the second operation. Put in the value of, but no expiry time set?

In the log system, error logs of this period are searched, but no error logs related to Redis are found.

Finally take a closer look at the implementation code. int time = jedis.ttl(key).intValue() + expire; When the frame sets a value, it retrieves the previous valid time and adds it to the current value.

In other words, 12 times, 12 * 6000 = 72000 seconds, expired after 20 hours ~!

This token will be stored in our Redis for 20 hours. In fact, it will be invalid after 2 hours. This explains why the problem broke out at noon the next day. However, my colleague still saw the token in Redis and deleted it. Actually, the token was stored last night.

The problem here is close to the truth and all is revealed, because the distributed lock is not controlled, leading to concurrent requests for tokens, and the expiration time of the iterative accumulation, there is a chain reaction.

Why is the distributed lock not locked?

The question why the token was stored for so long was explained, but why did the distributed lock fail?

The setNx code is not atomic, because it is a master-slave architecture, so there is no data synchronization and so on.

We use the redis MONITOR command to MONITOR the commands received by Redis:

It turns out that this code doesn’t have an atomic operation.

Since the operation is atomic, is redis itself a problem, resulting in no lock. To the point where you are suspicious of middleware code.

Finally, I saw a problem in some data to remind me, roughly meaning that several people request to lock something at the same time, at this time I looked at the lock code:

This is the problem, and if you look at this carefully, you might immediately understand,

Yes, the finally block deletes the lock directly, and does not care if the current thread has acquired the lock.

That is to say, student A came in and got the lock. At this time, he was doing his own thing.

Student B came to find that he did not get the lock, so he deleted the lock directly.

When C came in, he found that no one was using the lock, so he went in and used it.

Student A is actually using it,

But the lock was deleted by B,

And so on and so forth,

Someone got the lock and it was immediately deleted by someone else,

So at this point a lot of people have the lock, and it’s happening and it’s not being controlled,

And when I set, I accumulated time, and that’s what caused the problem.

The correct processing should be to use a mark to record whether the current user has obtained the lock or not. Only the current user can release the lock after obtaining the lock. Other people who have not obtained the lock are not qualified to release the lock, otherwise there will be chaos. It’s basically unlocked as we wrote it before.

To quote Java concurrent programming, a robust application should have the same response for multiple users as for single users.

conclusion

The occurrence of this problem is mainly a distributed lock problem implemented by its own business code. It neglected the judgment of permissions when releasing the lock and made an error.

Jedis.ttl (key) = jedis.ttl(key) = jedis.ttl(key) = jedis.ttl(key) = jedis.ttl(key) = jedis.ttl(key) If the key does not exist, the return is -2. If the key is invalid, the sum of -2 + 6000 = 5998.

The accident let their troubleshooting errors in all aspects of the ability to grow, more awareness of concurrency, at the same time their own later also want to more in-depth understanding of the source code and other people’s design ideas and principles, so as to quickly locate and troubleshoot related problems.

Here is my own implementation of a distributed lock: github.com/qiaomengnan… , suitable for the stand-alone version of Redis or one master, one slave, one master, multiple slave version of Redis for use.

At the same time, I recommend klock distributed lock based on Redisson implementation, friendly support for redis mode such as cluster Sentinel: github.com/kekingcn/sp…

Recommended for boss, Redisson: github.com/redisson/re… , a perfect distributed lock framework.

2020-10-10 Online problem survey notes

scenario

Thinking misarrangement

1. Whitelist issues

2. Distributed lock problem

Why is the distributed lock not locked?

conclusion

Related Posts

Springboot Series 7: Ways to use Shiro custom filters and tokens in Springboot projects

Learn Dubbo- Basics -SPI use from scratch

Python code completion tool, improve efficiency farewell!