The other day, our company made a mess. When the key to a happy end of the game, a bunch of people to open the app ready for something, the results had never felt so much attention to the function of the instant happiness to vertigo, triggered the fusing, as a result, a lot of interest in open the app to see the result to forced to brush the ten minutes of wild parkour for three days ago, The person in charge of the content is so angry that he calls her names.

Although I am not in charge of this business, I have talked with relevant people about the situation and had a feeling about it, thus I have this article.

  


1. Why cache?

Why do we mention caching so often when we have high concurrent requests? The most immediate reason is the performance disadvantage of disk IO and network IO compared to memory IO by a factor of hundreds.

Do a simple calculation, if we need some data, the data read from the database disk needs 0.1 s, from the switch to 0.05 s, then each request to complete a minimum of 0.15 s (of course, in fact, disk and network IO nor so slowly, and here are just examples), the database server can only respond to 67 requests per second. If the data is in native memory and only takes 10us to read, it can respond to 100,000 requests per second.

Caching improves processing efficiency by storing frequently used data closer to the CPU to reduce data transfer time. This is the point of caching.

  


2. Where to use cache?

Everywhere. Such as:

· When we read data from the hard drive, the operating system actually reads nearby data into memory in addition

· For example, when the CPU reads data from memory, it also reads a lot of additional data to the cache level

· A batch of data stored in buffer between input and output is sent and received in a unified manner, rather than processed byte by byte

This is the system level. At the software system design level, caching is also used in many places:

· Browsers cache elements of the page so that they don’t have to download data (such as large images) from the Internet when they repeatedly visit the page

· Web services pre-deploy static things on the CDN, which is also a cache

· The database caches the query, so the second query is faster than the first

· In-memory databases (such as Redis) choose to store a large amount of data in memory instead of hard disk. This can be considered a large cache, only caching the entire database

· The application stores the results of the most recent computations in local memory, and if the next incoming request is the same, it skips the computations and returns the results

  


3. Analysis of the accident

Returning to the question at the beginning of this article, how is the system designed? The bottom layer is the database, and a layer of Redis is placed in the middle. The data required by the front business system is taken directly from Redis, and then the results are calculated and returned to APP. The synchronization of the database and REDis is also guaranteed by the program to avoid redis penetration, to prevent a large number of requests from redis can not be found in the program, and then rush to check the database, directly crushing the database. From this point of view, this step is actually done ok.

But there are two problems with this system:

1. Although the data required by the business system are all stored in Redis, they are stored separately. What does it mean? For example, when I make a request, the background first goes to Redis to get the title, then the author, then the content, then the comments, then the number of retweets and so on… As a result, the foreground requests redis once and the background requests redis dozens of times. When the concurrency is high, the pressure is amplified by more than ten times, and the redIS response and network response will inevitably slow down.

2. In fact, the service providers also realize that this situation may occur, so they set up a circuit breaker mechanism and set up a cache pool to store some spare data. If the primary service times out, the data will be returned directly from the cache pool. However, they didn’t think fully about the design. The data expiration time of this alternative pool was too long, and there were data updated three days ago in the pool, which eventually resulted in a large number of users’ viewing of the wild ecological video three days ago…

At this point, I wonder if readers are aware of one of their most fatal problems: the business system has no consideration for local caching (that is, caching in business server memory). For example, in an app like ours, once a large number of users flood in at the same time, they are bound to rush to a few contents. This kind of highly concentrated, high-frequency and minimal data access does not need to be specialized for each user, which is like writing “please cache me” on the face.

At this time, if we can make a local cache on the business side and directly store the calculated data locally, then the pressure on the network and REDis will be greatly reduced and the fuse will not be triggered on the spot.

  


4. Talk about the cache pits

Caches are useful, but they can also create a lot of holes:

  


The cache to penetrate

Cache penetration is when a request is received, but the request is not in the cache, so you have to look it up in the database and put it in the cache. There are two risks. One is that there are many requests to access the same data at the same time, and then the business system sends all these requests to the database. The second is that someone maliciously constructs a data that does not logically exist and then sends the request in large numbers, so that each request is sent to the database, possibly causing the data to hang.

How do you deal with that? For malicious access, one idea is to do verification in advance, filter out the malicious data directly, do not send to the database layer; The second idea is to cache empty results, that is, the query does not exist data is still recorded in the cache, this can effectively reduce the number of queries database.

What about non-malicious access? This is done in conjunction with cache breakdown.

  


Cache breakdown

The failure of one of the above mentioned data, and then a lot of requests are sent to the database can actually be classified as cache breakdown: for hot data, when the data fails, all the requests are sent to the database to update the cache, and the database is overwhelmed.

How do you prevent this problem? One idea is a global lock, where all requests for access to a particular piece of data share a lock, and the one that gets the lock is eligible to access the database, while other threads must wait. But now the business is distributed, the local lock can not control other servers also wait, so we need to use the global lock, such as using Redis setNx to achieve global lock.

Another idea is to actively refresh data that is about to expire. This can be done in many ways, such as starting a thread to poll data, dividing all data into different cache intervals, periodically refreshing data between partitions, and so on. This second idea is again related to the cache avalanche that we’ll talk about next.

  


Cache avalanche

Cache avalanche is when we set the same expiration date for all the data, and then at some historical moment, the entire cache expires, and then all of a sudden all the requests hit the database, and the database crashes.

The solution is either divide and conquer, divide smaller cache interval, expiration interval; Alternatively, add a random value to the expiration time of each key to avoid simultaneous expiration and refresh the cache.

  


Cache refresh

When it comes to flushing the cache, there are pits. For example, at one of my previous jobs, there was a big event, and in the middle of it, all the advertising space suddenly went blank. Later traced the reason, all the advertising materials are in the cache, and then a program, specifically responsible for refreshing the cache, each time the current material of the full refresh.

The bad part is this whole thing. Because the big event when the flow is huge, advertising update pressure is also very big, responsible for providing updates to the material of the program collapsed. The program that flushed the cache received a Null return result when requested. And then, to my delight, the refresh program used this null to clear the entire cache, invalidating all of the AD material.

In summary, if you want to do a good job of caching a high concurrency system, you need to consider various aspects of the design carefully, any small negligence can cause the system to crash.