• How A Cache Stampede Caused One Of Facebook’s Biggest Outages
  • By Sun-Li Beatteay
  • Translation from: The Gold Project
  • This article is permalink: github.com/xitu/gold-m…
  • Translator: Hoarfroster
  • Proofread by: kamly and JalanJiang

On September 23, 2010, Facebook experienced one of its worst outages to date. Facebook was shut down for four hours during the incident. The situation was so bad that engineers had to take Facebook offline to restore it.

Facebook wasn’t as big as it is today, but it still had more than a billion users, and its outages didn’t go unnoticed. People just complain or joke about it on Twitter.

! [image: www.businessinsider.com/how-we-weat…

So what caused Facebook’s outage? According to the official analysis after the incident:

Today we changed a configuration by mistake. This means that every client can see the misconfiguration and try to fix it. Because the repair operation involved querying the database cluster, the cluster was quickly overwhelmed with hundreds of thousands of queries per second.


Incorrect configuration changes resulted in a large number of requests being sent to their database. This request stampede is appropriately called a cache stampede. This is a common problem plaguing the tech industry, and it has led to outages at many companies, such as the Internet Archive in 2016. Many large apps struggle with it every day, such as Instagram and DoorDash.

What is a cache stampede?

A cache stampede occurs when multiple threads attempt to access a cache in parallel. If the cache value does not exist, the thread will simultaneously attempt to retrieve the data from the source. The source is usually a database, but it can also be a Web server, a third-party API, or anything else that returns data.

One of the main reasons a cache stampede is so destructive is that it can lead to a vicious failure cycle:

  1. A large number of concurrent threads cache misses, causing them to all request the database.
  2. The database crashes due to excessive CPU peak and leads to timeout errors.
  3. When the timeout is received, all threads retry their requests — causing another stampede.
  4. And so on.

You don’t have to have a Facebook audience to suffer. The cache stampede has nothing to do with user size, so it bothers both startups and tech giants.


How to prevent cache stampede?

It’s a good question, and one I asked myself after learning about Facebook outages. Not surprisingly, since 2010, developers have done a lot of research on preventing cache stampedes. I read all these studies.


In this article, we’ll look at different strategies to prevent and mitigate cache stampedes. After all, we don’t want to make amends.

Adding more caches

A simple solution is to add more caches. While this may seem counterintuitive, it’s similar to how an operating system works.

The operating system uses a cache hierarchy in which each component caches its own data, so access is accelerated.

We can follow a similar pattern in our application by incorporating memory caches (Layer 1 (L1) caches). Any remote cache will be treated as Layer 2 (L2).

! Photo credit: [https://medium.com/@DoorDash/avoiding-cache-stampede-at-doordash-55bbf596d94b](https://medium.com/@DoorDash/avoiding-ca che-stampede-at-doordash -55bbf596d94b)

This is particularly useful for preventing stampedes during frequent data access. Even if the key on the Layer 2 cache expires, some Layer 1 caches may still store the value. This limits the number of threads that need to recalculate the cache value.

However, there are trade-offs to this approach. Caching in-memory data on the application server can lead to out-of-memory issues if you are not careful, especially if large amounts of data are cached.

Moreover, this caching strategy is still susceptible to what I call follower stampeding.

An example of a follower stampede is when celebrities upload new photos or videos to their social media accounts. When all of your followers are notified of new content, they scramble to see it. Because the content is too new to be cached, it leads to a dreaded cache stampede.


So what can be done about the follower stampede?

Locks and Promise

In essence, a cache stampede is a state of contention — multiple threads competing for a shared resource. In this case, the shared resource is the cache.

Common in highly concurrent systems, one way to prevent race conditions on shared resources is to use locks. While locks are typically used for threads on the same machine, there are ways to use distributed locks for remote caching.

By placing a lock on the cache key, only one caller can access the cache at a time. If the cache key is lost or expired, the caller can generate and cache the data while holding the lock. Any other process that tries to read from the same cache key must wait until the lock is idle.

Using locks solves the contention problem, but it creates another problem. How do you handle all the threads waiting for the lock to be released?

Have you ever used spin-lock mode and let the thread continuously poll the lock? This creates a busy-wait scenario.

Did you let the thread sleep for any length of time before checking to see if the lock was idle? That’s where you get the stampede problem.

Did you introduce recoil and jitter to prevent the stampede problem? That may work, but there is a more general problem. The thread that owns the lock must recalculate the value and update the cache key before releasing the lock.

This process may take a while. Especially if the value is expensive to calculate or there are network issues. This can still cause outages if the cache exhausts its available pool of connections and user requests are discarded.

Fortunately, some top companies are using a simpler solution: Promise.

How does Promise prevent spin-locking

To quote the Stammer problem and Promise from Instagram’s engineering blog:

On Instagram, when we set up a new cluster, we had a cache stampede because the cluster’s cache was empty. Then we use a Promise to help solve this problem: instead of caching the actual value, we cache a Promise that eventually provides the value. When we use our cache atomically and a miss occurs, we don’t immediately go to the back end, but instead create a Promise and insert it into the cache. Then this new Promise starts working on the back end. The benefit of this is that other concurrent requests won’t be missed, because they’ll find the existing Promise — and all those concurrent workers will be waiting for a single back-end request.

! [image: instagram-engineering.com/thundering-…

No spin-locking is required by caching the Promise rather than the actual value. Getting the first thread that the cache does not hit uses an atomic operation (for example, Java’s computeIfAbsent). All sequential fetch requests return a Promise immediately.

We still need to use locks to prevent multiple threads from accessing the cache key. But assuming that creating a Promise is a near-instant operation, the length of time the thread stays in the spinlock is negligible.

This is DoorDash’s way of avoiding the cache stampede.

But what if recalculating the cache value takes a relatively long time? Even if threads can get the cached Promise immediately, they still have to wait for the asynchronous process to complete before returning a value.


While this situation does not necessarily count as an interrupt, it affects tail latency and the overall user experience. If keeping tail latency low is important to your application, there is another strategy to consider.

Advance recalculation

The idea behind Early re-computation (also known as Early Expiration) is simple. The value is recalculated and the expiration time is extended before the cache key officially expires. This ensures that the cache is always up to date and that a cache miss never occurs.

The simplest implementation of early recalculation is a background process or scheduled task. For example, suppose you have a cache key whose expiration time expires in one hour, and it takes two minutes to calculate the value. A scheduled task can run for five minutes before the end of an hour and extend the expiration time by another hour after updating.

While the idea is simple in theory, there is an obvious downside. Unless we know exactly which cache keys will be used, we need to recalculate each key in the cache. This can be a very laborious and costly process. It also needs to maintain another moving part, which can’t be easily traced if it fails.

For these reasons, I couldn’t find any examples of this early recalculation in a production environment. But there are exceptions.

The probability is recalculated early

In 2015, a team of researchers published a white paper called Optimal Probabilistic Cache Stampede Prevention. In it, they describe an algorithm that optimizes the prediction of when cache values are recalculated before cache expiration.

There’s a lot of math in the research paper, but the algorithm boils down to this:

currentTime - ( timeToCompute * beta * log(rand()) ) > expiry
Copy the code
  • currentTimeIs the current timestamp.
  • timeToComputeIs the time required to recalculate the cache value.
  • betaIt’s a non-negative value greater than 0. It defaults to 1, but can be configured.
  • rand()Is a function that returns a random number between 0 and 1.
  • expiryIs the future timestamp that the cache value is set to expire.

The idea is that every time a thread gets it from the cache, it runs this algorithm. If it returns true, the thread will voluntarily recalculate the value. The closer you get to the expiration, the more likely this algorithm will return true.

While this strategy is not the easiest to understand, it is fairly simple to implement and does not require any additional moving parts. It also does not need to recalculate every value in the cache.

The Internet Archive began using this method after an outage during the 2016 presidential debate. This RedisConf17 presentation tells this story in more depth and gives a good overview of how early recalculations of probability worked. I highly recommend watching this video


However, the early recalculation assumed that there was a recalculation value – it would not alone prevent followers from trampling. To do this, we need to combine it with locks and promises.

What can be done to stop the ongoing stampede

One of the reasons Facebook’s cache stampede was so destructive was that even if engineers found a solution, they couldn’t deploy it because the stampede was still going on.

From the post-mortem:

Even worse, every time the client makes an error trying to query one of the databases, it interprets it as an invalid value and deletes the corresponding cache key. This means that the flow of queries continues even after the original problem is resolved. As long as databases are unable to service some requests, they cause more requests to themselves. We entered a feedback loop that did not allow database recovery.

The reality is that there is no guarantee that prevention will work forever – we also need mitigation. Defensive programming dictates that there should be a plan in place in case stampedes bypass the limits we set.

Fortunately, there is a known pattern for dealing with this problem.

fuse

The idea of using fuses in programming is not new. In Michael Nygard’s 2007 Release It! After that, it became popular. As Martin Fowler wrote in his article CircuitBreaker:

The basic idea behind fuses is very simple. You wrap a protected function call in a fuse object that monitors for failures. Once the failure reaches a certain threshold, the fuse is fused, and all further calls to the fuse return an error, not to the place protected by the fuse.

Fuses are passive, which means they won’t prevent downtime, but they will prevent cascading failures. It provides a kill switch when things get out of hand. If Facebook had used fuses, they could have avoided taking the entire site offline.


Admittedly, fuses weren’t all that popular in 2010. Today, there are several libraries with fuses, such as Resilience4j, Istio, and Envoy. Some organizations use these services in production, such as Netflix and [Lyft](www.getambassador.io /resources/ mechanism-effect-envoy – Lyft – Mate-Klein /).

What lessons has Facebook bought?

I’ve talked a lot in this article about different strategies for tackling cache stampeding and how other tech companies are using them. But what about Facebook itself?

What lessons have they learned from the failure, and what safeguards have they put in place to prevent it from happening again?

Their engineering posts, behind the scenes: Broadcast live video to millions of people discussing the improvements they’ve made to the architecture. It discusses things we’ve already discussed, such as caching hierarchies, but it also includes some novel approaches, such as HTTP request merging. This article is worth a read, but if you’re short on time, this video provides a comprehensive overview.

Facebook has arguably learned from their past mistakes.


Some think

While I believe in understanding how a cache stampede can wreak havoc on a system, I don’t think every technical team has to add these measures right away. How we choose to handle the caching stampede will depend on the use case, architecture, and traffic load of our project.

However, when we find ourselves struggling with the stampede problem, then knowing about the cache stampede and knowing possible solutions will benefit us in the future.

If you find any errors in the translation or other areas that need improvement, you are welcome to revise and PR the translation in the Gold Translation program, and you can also get corresponding bonus points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


Diggings translation project is a community for translating quality Internet technical articles from diggings English sharing articles. The content covers the fields of Android, iOS, front end, back end, blockchain, products, design, artificial intelligence and so on. For more high-quality translations, please keep paying attention to The Translation Project, official weibo and zhihu column.