preface

Several crash analysis solutions for Redis include what are Redis avalanche, penetration, and breakdown? What happens when Redis crashes? How does the system deal with this? How to handle Redis penetration? I’ll take you through it in a minute.

Cache avalanche

For system A, assuming A daily peak of 5,000 requests per second, the cache would have been able to handle 4,000 requests per second at peak times, but the cache machine unexpectedly went down completely. The cache is down, and 5000 requests per second are sent to the database, so the database can’t handle it, so it’s going to give an alarm, and then it’s down. At this point, if no special solution is adopted to handle the failure, the DBA is anxious to restart the database, but the database is immediately killed by the new traffic.

This is cache avalanche.

About 3 years ago, a well-known Internet company in China lost tens of millions of dollars due to a cache accident, which led to an avalanche and the collapse of all the background systems.

Caching avalanche before and after solutions are as follows:

  • Ex ante: Redis high availability, master slave + Sentinel, Redis Cluster, avoid total crash.
  • Issue: Local EhCache + Hystrix stream limiting & degrade to avoid MySQL being killed.
  • After: Redis persistence, once restarted, automatically load data from disk, fast recovery of cached data.

The user sends A request. After receiving the request, system A checks the local EhCache first. If the request is not found, system A checks Redis. If neither EhCache nor Redis exists, check the database and write the result in the database to EhCache and Redis.

The flow limiting component, can set the request per second, how many can pass through the component, the rest of the request did not pass, what to do? Go down! You can return some default values, either as a reminder, or null.

Benefits:

  • The database is never dead, and the flow limiting component ensures that only requests pass per second.
  • As long as the database doesn’t die, that means that two out of five requests can be processed by the user.
  • As long as 2/5 of the requests are processed, your system is not dead. For the user, it might be a few clicks, but a few more clicks, and the page will come back.

The cache to penetrate

For system A, let’s say 5,000 requests per second, and 4,000 of those requests turn out to be malicious attacks by hackers.

Those 4,000 attacks by the hacker, they’re not in the cache, and every time you go to the database, they’re not.

Here’s an example. The database ID starts at 1, and the hacker sends all request ids with negative numbers. This way, there will be no cache, and the request will be queried directly from the database every time. Cache penetration in this malicious attack scenario would kill the database.

Each time system A does not find A value in the database, it writes A null value to the cache, such as set-999 UNKNOWN. Then set an expiration time so that the next time the same key is accessed, the data can be fetched directly from the cache before the cache expires.

Cache breakdown

Cache breakdown refers to a situation where a key is very hot and accessed frequently and is in centralized and high concurrency. When the key fails, a large number of requests will break through the cache and directly request the database, just like cutting a hole in a barrier.

The solutions in different scenarios are as follows:

  • If the cached data is almost never updated, you can set the hotspot data to never expire.
  • Unless the cached data update frequently, and the whole process of cache refresh under the circumstances of less time consuming, you can use Redis, they are distributed middleware based distributed mutex, or local mutex to ensure that only a small amount of request to the database and rebuild the cache, other threads can access after the lock is released to the new cache.
  • If the cache data is updated frequently or the cache refresh process takes a long time, the timed thread can be used to actively rebuild the cache before the cache expires or postpone the expiration time of the cache to ensure that all requests can always access the corresponding cache.