Zheng Jimin, joined the team of Domestic hotel Quotation Center in August 2019, mainly responsible for quotation related system development and architecture optimization. Strong interest in high concurrency and high availability, with daily orders of ten million distributed system high availability construction experience. Like to study algorithms, ACMICPC program design competition twice entered the Asian preliminary competition. I won the first prize in the first Hackathon Competition in Qunar.


background

In September 2019, we experienced several cache-related failures in a row:

1. DBA operation and maintenance errors led to the clearing of the core basic data stored in Redis. Due to the failure to provide quotation normally, ATP (sudden drop of order quantity) failure occurred. After that, it took half an hour to write the data back to Redis through the scheduled task, and the fault recovery occurred.

2. The crawler traffic on the PC side enters the back end, and the Redis connection pool of the application is filled up. A large number of synchronous Redis requests wait 500ms to obtain connections, resulting in the Tomcat thread pool of the application being filled up, the service being dragged to death, and the PC side business cannot be normally provided. The Redis server side was completely stress-free.

There are many similar cache-related failures, which are not listed here. During the failure review, we realized that there were a number of core scenarios that used redis caching as core dependencies and storage, and that we did not prevent or handle redis problems in these scenarios.

Since our core business relies heavily on Redis, in order to prevent the recurrence of similar failures and to prepare and prevent them, we have carried out special governance on cache.


Management plan

1. High availability governance: This is the most important, but has nothing to do with the high availability deployment of Redis itself. The core starting point is that the high availability indicators of the business should not be completely dependent on the components in use, and the failure of components does not mean the failure of the business.

First, quick recovery is recommended. In general, this solution is more suitable for basic data sets. You can clean data in a short period of time through scheduled tasks or manually triggered interfaces. During the cleaning, priority is given to recovering hotspot data. The short time here is expected to be completed within 2min for scenes that can affect ATP, and within 10min for scenes that can affect user experience.

Second, consider making multiple copies of your core data. The core business scenario can consider caching data in different Redis namespace clusters or in multiple cache components (Redis + TAIR, Redis + Memory), so that when one of them fails, it can be quickly recovered by switching to an alternate cache.

After that, consider manual downgrading. Do not pass the cache, or replace it with another channel. Lossless degradation is preferred, and lossy degradation can be considered if necessary.

Finally, we also found that “APPLICATION A write, application B, application C read” situation exists in the system, which requires upstream and downstream to communicate with the final plan, and recommended the use of quick recovery, users can make additional preparations according to the need.

2. Parameter tuning: this mainly refers to the optimization of the time and number of threads used in redis configuration.

According to the monitoring, we found that in most scenarios, the time to read and update the cache through Redis is about milliseconds (including the connection time). In practice, the use of redis in many scenarios ignores the proper configuration of these parameters, and many of them are found to be copied in a few hundred milliseconds of an example in a certain year.

In this case, we require that every redis configuration in the application be checked and rationalized.

3. Add some governance details

1) Memcache is replaced by Redis: The redis component has more advantages than memcache, mainly leaving the operation and maintenance of the cache to the company DBA.

2) Unified configuration file format: at present, there are many configuration files on many system lines, which is very troublesome to find. In case of failure, the corresponding configuration should be found quickly.

3) Perfect monitoring: ensure that the call magnitude and time of Redis for each business scenario (including the magnitude and time of exceptions) can be found in the monitoring system.


The governance process

1. Comb out the cache used by the core scenario. This section mainly deals with the core scenarios of the cache, the impact of faults, and the data magnitude.

2. Determine the overall governance plan. After having a general plan, each system leader in the group will review the details of the plan and add the neglected details into the overall plan. This can be done in parallel with combing, and the final governance scheme can be determined after combing is completed.

3. Review the governance details of each scenario. The development, application, and QA leads review the governance details for each scenario and clarify standards.

4. Develop and self-test according to the governance scheme determined by discussion in each scenario. If any problem is found during the process, modification can be discussed and the new scheme can be implemented.

5. Sorted out emergency manual of fault scenario scheme and application dimension during development.

6. Test and test. The development will explain the governance scenarios and solutions to QA again, which will be verified according to the compiled manual and rehearsed in the beta environment.

7. Go online and make monitoring panel. After the code is online, make monitoring panels based on application dimensions for daily drills and quick viewing in case of faults.

8. Practice online. During periods of low business, the adjustments in the emergency manual are verified online, the problem points are improved, and time is found to continue the exercise until the expectations are met.


Achievements and Summary

At present, most P1 systems in the group have completed the cache governance and exercise, which took more than 60 personal days. The developers involved in the process have learned many details of Redis in depth and deepened their understanding of Redis.

The overview at the beginning of cache governance deepened the team’s understanding of the system, and the resulting wiki was very helpful to other students and new students.

The monitoring panel generated by cache governance helps to quickly locate faults during routine inspection.

The emergency manual produced by cache governance can greatly reduce the duration of failure in the face of actual failure.

It is worth noting that the picture data of the hotel in the DB of the recent basic data group was accidentally written dirty, which indirectly led to the dirty writing of the data in our Redis (this part of the data was triggered by the user). After the data in the DB was restored to normal, we directly adjusted the switch to obtain the picture data in the basic service DB by calling the Dubbo interface with the reserved degradation means, and the fault was recovered within 1min after operation.