Summary: Redis developers must read, learn the lesson!

  • Redis KEYS command caused RDS database avalanche, RDS crashed twice, resulting in the loss of millions of funds
  • Author: Chen Haoxiang

FundebugReproduced with authorization, copyright belongs to the original author.

Recently, online accidents occur frequently. On September 19, 2018, AN online repository deletion event occurred in SF Express, which will not be introduced here.

Here I will tell you about the recent accident in our company, and how to avoid it, and how to deal with optimization.

There are many indirect reasons, technology can not keep up with the development of business, from daily millions to tens of millions is a big leap forward, the company’s system optimization processing priority is not high, the shortage of technical development personnel

The first outage

At one point on September 13, 2018, RDS instance connections for one of the company’s serification projects soared to 100% CPU, rejecting all requests for service from other applications

The whole process is as follows:

  • Monitor the alarm, indicating that THE CPU usage of RDS reaches more than 80%, and the DBA intervenes to prepare the KILL slow SQL
  • Within 1 minute, no significant blocking SQL was found and the CPU continued to rise to 99%
  • Within 5 minutes, a large number of applications alarm and denial of service, RDS monitoring shows a large number of slow SQL, contact the server database provider for assistance
  • Perform database master/standby switchover within 8 minutes (services will be damaged, but there is no way to locate the problem)
  • Within 9 minutes, some services recovered, but the backlog of callback messages for some service orders exceeded 20w, and the CPU usage of the standby database continued to rise
  • Within 15 minutes, the CPU usage of the standby database exceeds 97%. Services are interrupted again. The switchover is performed to the active database and traffic limiting is performed
  • For 20 minutes, close the traffic entry for some secondary applications
  • The CPU usage of the primary database returns to normal within 25 minutes
  • Within 30 minutes, the traffic limiting application is gradually enabled and disabled
  • All applications returned to normal within 35 minutes
  • The next is to set up an emergency team with the server database provider emergency optimization may appear slow SQL, although it may solve some slow SQL, but this time did not locate the specific problem, also for a few days after the occurrence of downtime event again fowed

The accident impact

The service of a service-oriented project can not be used for dozens of minutes, resulting in the reduction of hundreds of thousands of orders and the loss of millions of funds.

Cause analysis,

At that time, the specific cause was not identified, but the following reasons are also part of the situation that may cause the outage.

The business growth of a servitization project is very fast. In the peak period, the DATABASE QPS exceeds 35,000, and the system is under high load.

If several SQL full table scans are executed at the same time in peak hours, the database pressure increases sharply, application timeouts increase, front-end applications time out, users retry, and traffic surges, forming an avalanche effect.

The main reason is poor SQL query performance with some old projects, and the use of the master library, which has a greater impact on the database. The DATABASE QPS is too high, but the cache scheme has not been implemented due to human resources, so the priority of slow SQL should be improved

The improved scheme

  • Create a database account for each application and use it strictly according to the specifications
  • The cache optimization solution will be implemented immediately. Slow SQL problems will be dealt with first, and slow SQL problems that have been discovered will be dealt with in a centralized manner (the query time exceeds 1S).
  • Upgrading the Database Configuration
  • Migrate non-core business to the new RDS instance

The second outage

Since the cause of the previous outage was not found, this outage was expected.

September 19, 2018, same recipe, same taste. In the same RDS, CPU spikes to 100%, followed by denial of service and outage. Of course, with the first experience, the direct switch between master and slave restored all the business within a few seconds, but it still seriously affected the company’s business and image.

Cause analysis,

When business resumed, the company called an emergency study meeting, which, of course, I was too senior to attend. The company’s senior management, senior technical architects, DBAs, and project leaders met together.

In this meeting, after checking the logs of each project and the background monitoring data, I found that when the CPU of that RDS database soared, the memory of one Redis database was nearly 100%, and then sharply decreased. The first outage was similar.

The next step is to contact the server database provider, call all the commands of the Redis in the last week, and finally find that a key * was run at that time point… * command. An engineer of the company executed the fuzzy matching command of keys to clear useless keys, but did not consider that fuzzy matching of keys * caused Redis lock, resulting in Redis lock, CPU surge, resulting in timeout of all call links and stuck, waiting for the end of the few seconds of Redis lock. All the request traffic was sent to the RDS database, causing an avalanche and bringing the database down.

The improved scheme

  • All online operations can be carried out only after passing o&M, and o&M departments gradually and quickly withdraw all permissions
  • Add Redis instance for separation
  • If you need to use the keys re command, use the scan command instead

conclusion

The two accidents that occurred in this case were entirely caused by human manipulation, and if the engineer had looked at the Redis development specification, he would have advised disabling keys. In addition, the command operation on the wire must be carried out after operation and maintenance evaluation. It is estimated that the engineer is a senior employee who has the authority to operate directly.

In addition, the company’s business development is really fast, technology can not keep up, which is very dangerous, greatly increased the probability of downtime.

In the case of small business volume, the operation of the engineer is completely no problem, after all, there is not much concurrency. But now, with the development of the company, the business volume has doubled and doubled, but the expansion of technology has not grown so fast.

On the other hand, the company is short of technical staff. Most of them are doing new functions while maintaining old projects, but there are a lot less staff for reconstruction and optimization of projects, and the priority of project optimization is not high, which is also a big reason. Similar situations are likely to occur, and the new service-oriented construction is imminent.

Finally, you can’t be too careful with any command you use online, because you can’t afford to have an accident caused by one of your symbols.

Redis development recommendations

Finally, some development specifications and suggestions of Redis are attached

1. Separate hot and cold data. Do not put all data into Redis

Although Redis supports persistence, Redis data storage is all in memory, which is expensive. Suggestions according to the business will only high-frequency heat stored in data Redis [QPS is greater than 5000], for low frequency data can use MySQL/ElasticSearch cold/directing a disk-based storage way, not only save memory cost, and small amount of data in the operating speed is faster, more efficient!

2. Store different service data separately

Do not put all irrelevant business data into one Redis instance. It is recommended that new business apply for a separate instance. Because Redis is single-threaded processing, independent storage can reduce the impact of interoperation of different services and improve the response speed of requests. At the same time, it also avoids large expansion of single instance memory data, and can restore service faster in case of abnormal conditions! In the actual use process, the biggest bottleneck of Redis is usually THE CPU. Because it is a single thread job, it is easy to run a full logical CPU. You can use Redis agent or distributed solution to improve the CPU utilization of Redis.

3. Set the timeout period for the Key to be stored

If your application uses Redis as a Cache Cache, make sure to set a timeout for the keys stored in the Cache. If you do not set these keys, the server memory usage will continue to grow, resulting in a huge waste of time. As time goes by, the memory usage will increase until the server memory usage reaches the upper limit. In addition, the Key timeout length should be based on the business comprehensive evaluation, not the longer the better!

4. Compress and store large text data that must be stored

When writing large text (+ over 500 bytes) to Redis, be sure to compress and store it! When large text data is stored in Redis, in addition to bringing huge memory consumption, when the page view is high, it is easy to fill up the network card traffic, thus causing all the services on the whole server to be unavailable, and triggering the avalanche effect, resulting in the breakdown of each system!

5. Online Redis disallows Keys regular matching

Redis is single-threaded processing, with a large number of keys online, resulting in very low operation efficiency [time complexity is O(N)]. Once executed, this command will seriously block the normal requests of other commands on the line, and will directly cause the crash of Redis service under the condition of high QPS. If you have similar requirements, run the scan command instead.

6. Reliable message queuing services

Redis List is often used for message queue services. Suppose the consumer program crashes immediately after it retrives a message from the queue, but since the message has been removed and not processed properly, the message is considered lost, which may result in business data loss or inconsistent business state.

To avoid this, Redis provides the RPOPLPUSH command, which atomically removes the message from the main message queue and inserts it into the backup queue until the consumer completes its normal processing logic before removing it from the backup queue. You can also provide a daemon that, when it finds that messages in the backup queue are out of date, can be put back into the main message queue so that other consumer programs can continue processing them.

7. Perform full operations on collection structures such as Hash and Set

When using HASH structure to store object attributes, there are only a limited number of fields at the beginning, and HGETALL is often used to obtain all members, which is also very efficient. However, with the development of business, fields will be expanded to hundreds or even hundreds. If HGETALL is used at this time, the efficiency will decrease sharply and the network adapter will be filled frequently [time complexity O(N)]. In this case, it is recommended to split the HGETALL into multiple Hash structures based on services. Or if most of the operations are getting all the attributes, you can serialize all the attributes to a STRING store! The same is true when you use SMEMBERS to manipulate the SET structure type!

8. Use different data structure types based on service scenarios

Redis currently supports more types of database structures: String, Hash, List, Set, Sorted Set, Bitmap, HyperLogLog and geospatial, etc., need to select appropriate types according to business scenarios.

Common examples are: String can be used as a plain K-V, count class; Hash can be used as an object, such as a commodity or broker, to contain information with multiple attributes. Lists can be used as message queues, fan/follow lists, etc. Set can be used for recommendations; Sorted Set can be used for leaderboards and so on!

9. Naming conventions

Although Redis supports multiple databases (32 by default, more can be configured), all but the default library 0 require an additional request. So it’s probably smarter to use prefixes as namespaces.

In addition, when using prefixes as namespaces to separate different keys, it is best to use global configuration in the program and avoid directly writing prefixes in the code. This is not maintainable.

For example: System name: Service name: Service data: others

Note, however, that the name of the key should not be too long, as clear as possible, easy to understand, need to weigh yourself

10. Do not run the monitor command online

Do not use the monitor command in the production environment. In the case of high concurrency, the monitor command may cause memory explosion and affect Redis performance

11. Disable large strings

The core cluster disables 1MB string keys (although Redis supports 512MB strings). If a 1MB key is written 10 times per second, 10MB of network IO will be written.

12. Redis capacity

It is recommended that the memory size of a single instance be less than 10 to 20GB. You are advised to limit the number of keys contained in the redis instance to 1kw. If the number of keys in a single instance is too large, expired keys may not be reclaimed in time.

13. The reliability

Periodically monitor the health status of redis: Use various Redis health monitoring tools to periodically return the Redis info. Use connection pooling (long links and automatic reconnection) for client connections whenever possible.