Cache penetration, cache breakdown, and cache avalanche, you really get it?

preface

For those of you who are engaged in back-end development, caching has become an essential part of your project.

Yes, caching gives our system a significant performance boost. But it can also cause a lot of unexpected problems if you don’t use it well or if you don’t have experience with it.

Today we are going to talk about three major issues that can arise if caching is introduced into a project. See if you fell for it?

1. Cache penetration problem

In most cases, the purpose of caching is to reduce the pressure on the database and improve system performance.

1.1 How do we Use caching?

In general, if there is a user request, check the cache first, and if there is data in the cache, return it directly. If it does not exist in the cache, it looks up the database again. If it exists in the database, it puts the data in the cache and returns it. If it does not exist in the database, failure is returned.

The flow chart is as follows:This is a picture that you guys are all too familiar with, because this is how most caches are used.

Recently, I accidentally got a copy of the notes written by a big boss of BAT factory. All of a sudden, I got through to both my supervisor and my supervisor. I felt that the algorithm was not as difficult as I imagined.

BAT boss wrote the brush notes, let me get the offer soft

1.2 What is cache penetration?

But if the following two special cases occur, such as:

The id requested by the user does not exist in the cache.
Malicious users forge non-existent ids to initiate requests.

The result of such a user request is that the data is not retrieved from the cache each time, the database needs to be queried, and the data is not found in the database and cannot be placed in the cache. That is, each time the user request comes in, the database will be queried.

The red arrow on the chart shows the route taken each time.

Obviously, the cache didn’t work at all, as if it was being penetrated, and the database was being accessed every time.

This is what we call the cache penetration problem.

If the cache is breached at this point and the number of requests to the direct database is very high, the database may fail under the strain. Meowed.

So the question is, how to solve this problem?

1.3 Verification Parameters

We can verify the user ID.

For example, your legal ID is 15XXxxxx, starting with 15. If the user passes an ID starting with 16, for example, 16232323, the parameter verification fails and related requests are directly intercepted. In this way, some malicious forged user ids can be filtered out.

1.4 Bloom filter

If we have a small amount of data, we can put all the data in the database into a map in memory.

This is a very quick way to tell if the data is in the cache or not. If it does, give it access to the cache. If it does not exist, the request is rejected.

But if you have too much data, tens of millions or hundreds of millions of data, and you put it all in memory, it’s going to take up too much memory.

So, is there a way to reduce memory space?

A: That’s where the Bloom filter comes in.

The underlying bloom filter stores data using an array of bits, whose elements default to 0.

When the bloom filter is first initialized, it evaluates all the existing keys in the database through a series of hash algorithms (e.g., triple hash algorithm). Each key evaluates multiple positions and sets the value of the elements in those positions to 1.Later, when a key request is received, the same hash algorithm is used to calculate the location.

If the element value is 1 in multiple locations, the key already exists in the database. This allows you to continue.
If the element value is 0 in more than one location, the key does not exist in the database. You can then reject the request and return directly.

Using bloom filters does solve the cache penetration problem, but it also introduces two problems:

There has been a miscarriage of justice.
Data update problems exist.

Why is there a miscarriage of justice?

As I mentioned above, when initializing data, we hash out positions for each key multiple times, and then set the value of the element at those positions to 1.

But we all know that hash algorithms have hash collisions, which means that different keys might compute the same position.In the figure above, a hash conflict occurs at the subscript 2 position, where key1 and key2 compute the same position.

If you have tens or hundreds of millions of data, hash collisions in bloom filters can be very noticeable.

If a user key has been hashed several times, its element value has been initialized to 1 by other keys. The key does not exist in the database, but the Bloom filter confirms that it does.

If bloom filter detects the existence of a key, miscalculation may occur. If a key does not exist, it must not exist in the database.

In general, the error rate of Bloom filter is relatively low. Even if a small number of misjudged requests directly access the database, but if the number of visits is not large, the impact on the database is not big.

In addition, if you want to reduce the misjudgment rate, you can increase the hash function appropriately. The 3-hash in the figure can be increased to 5.

In fact, the most fatal problem with Bloom filters is that if the data in the database is updated, the Bloom filter needs to be updated synchronously. But it and database are two data sources, there may be inconsistent data.

For example, a new user is added to the database, and the user data needs to be synchronized to bloom filtering in real time. However, the synchronization failed due to a network exception.The user request was rejected because the Bloom filter did not have data for the key. But this is a normal user, also byintercept.

Obviously, there are some businesses that can’t tolerate normal users being blocked. So, the Bloom filter depends on the actual business scenario, it helps us solve the cache penetration problem, but it also introduces new problems.

1.5 Cache null values

It uses a Bloom filter, although it can filter out a lot of non-existent user ID requests. But in addition to increasing the complexity of the system, it causes two problems:

The Bloom filter may filter a small number of normal users’ requests by mistake.
If the user information changes, it needs to be synchronized to the Bloom filter in real time, otherwise there will be problems.

So, in general, we rarely use bloom filters to solve cache penetration problems. There is an even simpler solution: cache null values.

When a user ID is not found in the cache or in the database, the user ID needs to be cached, but the value is empty. In this way, subsequent requests with the same user ID can fetch empty data from the cache and return it directly, without having to check the database again.

The optimized flow chart is as follows:The key point is that the result is placed in the cache regardless of whether the data is retrieved from the database, but if no data is retrieved, the value in the cache is empty.

2. Cache breakdown problem

2.1 What is cache breakdown?

Sometimes, when we access hot data. For example, we buy a popular product at a certain mall.

In order to ensure the speed of access, normally, the mall system will put the product information in the cache. But if at some point, the product reaches its expiration date and becomes invalid.

At this point, if a large number of users request the same item, but the item is invalid in the cache, all of a sudden these user requests are directly directed to the database, which may cause the database to be overburdened and directly hang.

The flow chart is as follows:So, how to solve this problem?

2.2 lock

The root cause of database stress is that too many requests are accessing the database at the same time.

Wouldn’t it solve the problem if we could restrict access to productId’s database item information to one request at a time?

A: Yes, we can use the way of locking, to achieve the above function.

The pseudocode is as follows:

try {
  String result = jedis.set(productId, requestId, "NX"."PX", expireTime);
  if ("OK".equals(result)) {
    returnqueryProductFromDbById(productId); }}finally{
    unlock(productId,requestId);
}  
return null;
Copy the code

Lock database access to prevent multiple requests for the same productId from accessing the database at the same time.

Then you need a piece of code to put the results from the database back into the cache. There are a lot of ways to do it, but I’m not going to expand it here.

2.3 Automatic renewal

Cache breakdown is caused by the expiration of the key. So, let’s change the way of thinking, before the key is about to expire, automatically renew it, OK?

A: Yes, we can renew the specified key automatically with job.

For example, we have a sorting feature that sets the cache expiration time to 30 minutes. However, a job is executed every 20 minutes to automatically update the cache and set the expiration time to 30 minutes.This will ensure that the class cache will not fail.

In addition, in many requests for third-party platform interfaces, we often need to first call an interface that obtains a token, and then use that token as a parameter to request the real business interface. Generally, the acquired token has a validity period, such as 24 hours after expiration.

If we have to call the token interface every time we request the other party’s business interface, it is obviously cumbersome and not very good performance.

In this case, we can cache the token obtained for the first time and obtain the token from the cache when requesting the service interface of the other party.

At the same time, a job refreshes the token and resets the expiration time of the token every 12 hours.

2.4 The cache is not invalid

In addition, for many popular keys, it is possible to make them permanent without setting an expiration date.

For example, we can not set the expiration time in the cache because there are not many ids of popular goods participating in seckilling activities.

Before the second kill activity began, we used a program to query the data of goods from the database in advance, and then synchronized to the cache to do preheating in advance.

After the seckill activity has finished for some time, we can manually delete these useless caches.

3. Cache avalanche problem

3.1 What is cache avalanche?

We’ve already talked about cache breakdown.

Cache avalanche is an updated version of cache breakdown, where a hot key fails, and cache avalanche, where multiple hot keys fail at the same time. It seems that the problem is even worse if there is a cache avalanche.

There are currently two types of cache avalanche:

There are a lot of hot caches and invalidation at the same time. Can result in a large number of requests to access the database. And the database is likely to fail directly because of the pressure.
The cache server is down. It may be a hardware problem or a network problem in the machine room. In short, the entire cache becomes unavailable.

The bottom line is that there are a lot of requests to access the database directly through the cache.So how do you solve this problem?

3.2 Expiration time plus random number

To solve the cache avalanche problem, we need to avoid simultaneous cache failures in the first place.

This requires us not to set the same expiration times.

You can add a random number of 1 to 60 seconds to the expiration time.

Actual expiration time = Expiration time +1~60Random number of secondsCopy the code

In this way, even if the expiration time of multiple requests is set at the same time in the case of high concurrency, there will not be too many identical expiration keys due to the existence of random numbers.

3.3 high availability

In view of the situation that the cache server is down, some high availability architectures can be made in the early stage of system design.

For example, if redis is used, you can use sentinel mode or cluster mode to avoid the situation that the entire Redis service is unavailable due to the failure of a single node.

After the Sentinel mode is used, when a master service goes offline, the slave service of the master is automatically upgraded to the master service and continues to process requests instead of the offline Master service.

3.4 Service Degradation

What if the Redis service still fails after making a high availability architecture?

This is where service downgrades come in.

We need to configure some default pocket data.

There is a global switch in the program. For example, if 10 requests fail to obtain data from Redis within the last minute, the global switch will be turned on. Subsequent new requests get the default data directly from the configuration center.A job is required to retrieve data from redis at regular intervals. If the data is retrieved twice in the last minute (this parameter can be determined by yourself), turn off the global switch. For subsequent requests, data can be retrieved from Redis normally again.