Recently, online accidents on the Internet occur frequently. An online repository deletion event occurred in SF Express on 20180919, which will not be introduced here.

Here I will tell you about the recent accident in our company, and how to avoid it, and how to deal with optimization.

There are many indirect reasons, technology can not keep up with the development of business, from daily millions to tens of millions is a big leap forward, the company’s system optimization processing priority is not high, the shortage of technical development personnel

The first outage

At a certain point in 20180913, the connection of RDS instance of a servitization project of the company soared, the CPU rose to 100%, and all requests for service from other applications were rejected. The whole process was as follows:

1. Monitor the alarm, and the CPU usage of RDS reaches more than 80%. The DBA intervenes and prepares the KILL slow SQL

2. Within 1 minute, no obvious blocking SQL was found, and the CPU continued to rise to 99%

3. Within 5 minutes, a large number of alarms were applied and service was denied. The MONITORING of RDS showed a large number of slow SQL, so contact the server database provider for assistance

4. Perform active/standby database switchover within 8 minutes (services will be damaged, but there is no way to locate the problem)

5. Within 9 minutes, some services are recovered, but the accumulation of callback messages for some service orders exceeds 20w, and the CPU usage of the standby database continues to rise

6. Within 15 minutes, the CPU usage of the standby database exceeds 97%. Services are interrupted again

7. Close traffic inlets for secondary applications for 20 minutes

8. Within 25 minutes, the CPU usage of the primary database recovers

9. Enable and disable the traffic limiting application within 30 minutes

10. All applications are restored within 35 minutes

11. The next is to set up an emergency team with the server database provider to urgently optimize the slow SQL that may appear. Although it may solve some slow SQL, it did not locate the specific problem this time, and fowed the event of downtime again a few days later

The accident impact

The service of a service-oriented project can not be used for dozens of minutes, resulting in the reduction of hundreds of thousands of orders and the loss of millions of funds

Cause analysis,

At that time, the specific cause was not identified, but the following reasons are also part of the situation that may cause the outage. The business growth of a servitization project is very fast. In the peak period, the DATABASE QPS exceeds 35,000, and the system is under high load. If several SQL full table scans are executed at the same time in peak hours, the database pressure increases sharply, application timeouts increase, front-end applications time out, users retry, and traffic surges, forming an avalanche effect. The main reason is poor SQL query performance with some old projects, and the use of the master library, which has a greater impact on the database. The DATABASE QPS is too high, but the cache scheme has not been implemented due to human resources, so the priority of slow SQL should be improved

The improved scheme

1. Create a database account for each application and use the account strictly according to the specifications

2. The cache optimization solution is implemented immediately. Slow SQL problems are dealt with first, and slow SQL problems that have been discovered are dealt with in a centralized manner (the query time exceeds 1S).

3. Upgrade the database configuration

4. Migrate non-core business to the new RDS instance

The second outage

Because the reason of the last outage was not found, so this outage is predictable 20180919, still the same “formula”, or the original “taste”. In the same RDS, CPU spikes to 100%, followed by denial of service and outage. Of course, with the first experience, the direct switch between master and slave restored all the business within a few seconds, but it still seriously affected the company’s business and image

Cause analysis,

When business resumed, the company called an emergency study meeting, which, of course, I was too senior to attend. The company’s senior management, senior technical architects, DBAs, and project leaders met together. In this meeting, after checking the logs of each project and the background monitoring data, I found that when the CPU of that RDS database soared, the memory of one Redis database was nearly 100%, and then sharply decreased. The first outage was similar. The next step is to contact the server database provider, call all the commands of the Redis in the last week, and finally find that a key * was run at that time point… * command. An engineer of the company executed the fuzzy matching command of keys to clear useless keys, but did not consider that fuzzy matching of keys * caused Redis lock, resulting in Redis lock, CPU surge, resulting in timeout of all call links and stuck, waiting for the end of the few seconds of Redis lock. All the request traffic was sent to the RDS database, causing an avalanche and bringing the database down.

The improved scheme

1. All online operations must be approved by o&M before they can be implemented, and the O&M department gradually and quickly withdraw all permissions

2. Add Redis instances and separate them

3. If you need to run the keys command, run the scan command instead

conclusion

The two accidents that occurred in this case were entirely caused by human manipulation, and if the engineer had looked at the Redis development specification, he would have advised disabling keys. In addition, the command on the cable operation, be sure to rear can operate, the operational assessment to estimate the engineer is older employees, permissions, and then directly will operate in addition, the company’s business development really fast, technology behind, this is very, very dangerous, greatly increased the outage probability under the condition of the business is not big, The engineer is completely ok, concurrent are, after all, is not big, but now, with the development of the company, business have multiplied exponentially, the expansion of the technology is not as fast as growth of the company’s technology is understaffed and on the other hand, the vast majority of people are maintenance of old projects and make a new function, but for the reconstruction of the project optimization, less hands a lot, The priority of project optimization is not high, which is also a big reason. Similar situations are likely to occur, and the new service-oriented construction is imminent

And finally, you can’t be too careful with any command you use online because you can’t afford to have an accident caused by one of your symbols

Redis development recommendations

Finally, some development specifications and suggestions of Redis are attached

1. Separate hot and cold data. Do not put all data into Redis

Although Redis supports persistence, Redis data storage is all in memory, which is expensive. Suggestions according to the business will only high-frequency heat stored in data Redis [QPS is greater than 5000], for low frequency data can use MySQL/ElasticSearch cold/directing a disk-based storage way, not only save memory cost, and small amount of data in the operating speed is faster, more efficient!

2. Store different service data separately

Do not put all irrelevant business data into one Redis instance. It is recommended that new business apply for a separate instance. Because Redis is single-threaded processing, independent storage can reduce the impact of interoperation of different services and improve the response speed of requests. At the same time, it also avoids large expansion of single instance memory data, and can restore service faster in case of abnormal conditions! In the actual use process, the biggest bottleneck of Redis is usually THE CPU. Because it is a single thread job, it is easy to run a full logical CPU. You can use Redis agent or distributed solution to improve the CPU utilization of Redis.

3. Set the timeout period for the Key to be stored

If your application uses Redis as a Cache Cache, make sure to set a timeout for the keys stored in the Cache. If you do not set these keys, the server memory usage will continue to grow, resulting in a huge waste of time. As time goes by, the memory usage will increase until the server memory usage reaches the upper limit. In addition, the Key timeout length should be based on the business comprehensive evaluation, not the longer the better!

4. Compress and store large text data that must be stored

When writing large text (+ over 500 bytes) to Redis, be sure to compress and store it! When large text data is stored in Redis, in addition to bringing huge memory consumption, when the page view is high, it is easy to fill up the network card traffic, thus causing all the services on the whole server to be unavailable, and triggering the avalanche effect, resulting in the breakdown of each system!

5. Online Redis disallows Keys regular matching

Redis is single-threaded processing, with a large number of keys online, resulting in very low operation efficiency [time complexity is O(N)]. Once executed, this command will seriously block the normal requests of other commands on the line, and will directly cause the crash of Redis service under the condition of high QPS. If you have similar requirements, run the scan command instead.

6. Reliable message queuing services

Redis List is often used for message queue services. Suppose the consumer program crashes immediately after it retrives a message from the queue, but since the message has been removed and not processed properly, the message is considered lost, which may result in business data loss or inconsistent business state. To avoid this, Redis provides the RPOPLPUSH command, which atomically removes the message from the main message queue and inserts it into the backup queue until the consumer completes its normal processing logic before removing it from the backup queue. You can also provide a daemon that, when it finds that messages in the backup queue are out of date, can be put back into the main message queue so that other consumer programs can continue processing them.

7. Perform full operations on collection structures such as Hash and Set

When using HASH structure to store object attributes, there are only a limited number of fields at the beginning, and HGETALL is often used to obtain all members, which is also very efficient. However, with the development of business, fields will be expanded to hundreds or even hundreds. If HGETALL is used at this time, the efficiency will decrease sharply and the network adapter will be filled frequently [time complexity O(N)]. In this case, it is recommended to split the HGETALL into multiple Hash structures based on services. Or if most of the operations are getting all the attributes, you can serialize all the attributes to a STRING store! The same is true when you use SMEMBERS to manipulate the SET structure type!

8. Use different data structure types based on service scenarios

Redis currently supports more types of database structures: String, Hash, List, Set, Sorted Set, Bitmap, HyperLogLog and geospatial, etc., need to select appropriate types according to business scenarios. Common examples are: String can be used as a plain K-V, count class; Hash can be used as an object, such as a commodity or broker, to contain information with multiple attributes. Lists can be used as message queues, fan/follow lists, etc. Set can be used for recommendations; Sorted Set can be used for leaderboards and so on!

9. Naming conventions

Although Redis supports multiple databases (32 by default, more can be configured), all but the default library 0 require an additional request. So it’s probably smarter to use prefixes as namespaces. In addition, when using prefixes as namespaces to separate different keys, it is best to use global configuration in the program and avoid directly writing prefixes in the code. This is not maintainable. For example: System name: business name: business data: other But note that the name of the key should not be too long, as clear as possible, easy to understand, need to weigh yourself

10. Do not run the monitor command online

Do not use the monitor command in the production environment. In the case of high concurrency, the monitor command may cause memory explosion and affect Redis performance

11. Disable large strings

The core cluster disables 1MB string keys (although Redis supports 512MB strings). If a 1MB key is written 10 times per second, 10MB of network IO will be written.

12. Redis capacity

It is recommended that the memory size of a single instance be less than 10 to 20GB. You are advised to limit the number of keys contained in the redis instance to 1kw. If the number of keys in a single instance is too large, expired keys may not be reclaimed in time.

13 reliability

Periodically monitor the health status of redis: Use various Redis health monitoring tools to periodically return the Redis info. Use connection pooling for client connections whenever possible (long links and automatic reconnection)