This post was first posted by Yanglbme on GitHub’s Doocs (stars over 30K). Project address: github.com/doocs/advan…

The interview questions

What is redis avalanche, penetration and breakdown? What happens when Redis crashes? How does the system deal with this? How to handle redis penetration?

Interviewer psychoanalysis

In fact, this is a very important question to ask about cache because cache avalanche and cache penetration are two of the biggest cache questions that either don’t come up or if they do come up it’s a fatal question, so the interviewer will definitely ask you.

Analysis of interview questions

Cache avalanche

For system A, assuming A daily peak of 5,000 requests per second, the cache would have been able to handle 4,000 requests per second at peak times, but the cache machine unexpectedly went down completely. The cache is down, and 5000 requests per second are sent to the database, so the database can’t handle it, so it’s going to give an alarm, and then it’s down. At this point, if no special solution is adopted to handle the failure, the DBA is anxious to restart the database, but the database is immediately killed by the new traffic.

This is cache avalanche.

About 3 years ago, a well-known Internet company in China lost tens of millions of dollars due to a cache accident, which led to an avalanche and the collapse of all the background systems.

Caching avalanche before and after solutions are as follows:

  • Ex ante: Redis high availability, master slave + Sentinel, Redis Cluster, avoid total crash.
  • Issue: Local EhCache + Hystrix stream limiting & degrade to avoid MySQL being killed.
  • After: Redis persistence, once restarted, automatically load data from disk, fast recovery of cached data.

The user sends A request. After receiving the request, system A checks the local EhCache first. If the request is not found, system A checks redis. If neither EhCache nor Redis exists, check the database and write the result in the database to EhCache and Redis.

The flow limiting component, can set the request per second, how many can pass through the component, the rest of the request did not pass, what to do? Go down! You can return some default values, either as a reminder, or blank values.

Benefits:

  • The database is never dead, and the flow limiting component ensures that only requests pass per second.
  • As long as the database doesn’t die, that means that two out of five requests can be processed by the user.
  • As long as 2/5 of the requests are processed, your system is not dead. For the user, it may be a few clicks, but a few more clicks, can be a page.

The cache to penetrate

For system A, let’s say 5,000 requests per second, and 4,000 of those requests turn out to be malicious attacks by hackers.

Those 4,000 attacks by the hacker, they’re not in the cache, and every time you go to the database, they’re not.

Here’s an example. The database ID starts at 1, and the hacker sends all request ids with negative numbers. This way, there will be no cache, and the request will be queried directly from the database every time. Cache penetration in this malicious attack scenario would kill the database.

Each time system A does not find A value in the database, it writes A null value to the cache, such as set-999 UNKNOWN. Then set an expiration time so that the next time the same key is accessed, the data can be fetched directly from the cache before the cache expires.

Cache breakdown

Cache breakdown refers to a situation where a key is very hot and accessed frequently and is in centralized and high concurrency. When the key fails, a large number of requests will break through the cache and directly request the database, just like cutting a hole in a barrier.

The solutions in different scenarios are as follows:

  • If the cached data is almost never updated, you can set the hotspot data to never expire.
  • Unless the cached data update frequently, and the whole process of cache refresh under the circumstances of less time consuming, you can use redis, they are distributed middleware based distributed mutex, or local mutex to ensure that only a small amount of request to the database and rebuild the cache, other threads can access after the lock is released to the new cache.
  • If the cache data is frequently updated or the cache refresh process takes a long time, the timed thread can be used to proactively rebuild the cache before the cache expires or postpone the cache expiration time to ensure that all requests can access the corresponding cache.

Welcome to follow my wechat public account “Doocs Open Source Community” and push original technical articles as soon as possible.