How to use caching correctly in distributed systems? Don't introduce a time bomb into your project!

Author: ZacharyZF

Source: Cross-boundary architect

“Write DB first or cache”? As soon as you start using caching, this is the first thing you need to seriously think about, otherwise disaster awaits you…

Write DB first or cache first?

A program can have no cache, but it must have a database. This is a common belief, so the importance of databases is always at the forefront of your subconscious mind.

DB first, then cache

If you don’t think about it, you might say, well, the database failed, so the natural cache doesn’t have to; Database operation is successful, then operation cache, no problem.

But how to solve the case of database operation success, cache operation failure?

This is mostly the case with out-of-process caches such as Redis and memcached, where the probability of failure increases due to network factors.

There is also the option of taking a transaction with the database and rolling it back if the cache operation fails. The general meaning of the code is as follows:

Is that a sure thing? And it isn’t. In addition to the increased strain on the database due to the introduction of transactions, rollback DB failures may occur in extreme scenarios. Is it a headache?

The solution to this problem is to write cache with delete instead of set. In this case, the cost of one more cache miss is exchanged for the rollback DB failure.

As shown below:

As shown in the figure, even if ROLLBACK fails, the old values are reloaded from db by a cache miss.

Aside: There’s actually a technical term for this — Cache Aside Pattern.

For easy memorization, you can remember it in conjunction with the CAP theorem for distributed systems, called the “CAP pattern for caching.”

Does it look good? Can we get started?

▲ Pictures from the network, all copyright belongs to the original author

If your database is not highly available, this is fine. However, if the database is made highly available, data synchronization between the master and slave databases is involved, which presents a new problem.

Side note: So don’t over-pursue the cool technology, you may lose more than you gain, asking for trouble.

What’s the problem? If the data is not synchronized to the “slave”, the cache miss fetched the old value from the “slave”.

The first way to deal with it is simple and brutal. Periodically read data from the library and set it to the cache if it is found to be different from the cache.

But this approach is a bit of a band-aid. Constant periodic reads from the database, not to say a large consumption of resources, this interval frequency is not good to define a more appropriate unified standard, too short, will lead to the number of repeated reads increased, too long, and will lead to the cache and database inconsistent time becomes longer.

Therefore, this solution is only suitable for scenarios where there are only two or three places in the project that need this kind of processing, and it is not suitable for situations where the data will be changed frequently. Because in scenarios with high data modification frequency, there may even be cases where the timing mechanism consumes more resources than the main program.

In general, a more general approach is to use the lower level approach described in the next section, which is to “deal with any problems” and perform an additional delete cache or set cache operation when the “slave” completes synchronization.

This, while not a 100% solution to transient data inconsistencies, minimizes the amount of time that dirty data exists (ultimately determined by master/slave synchronization) and significantly reduces unnecessary resource consumption.

You might say, “No, I can’t stand that much time.” There is, but it will increase the pressure on the “main repository”. The cache is loaded by force-reading the “main library” for a short period of time after the database write.

How do you do that? You have to rely on a shared storage, either a database or a distributed cache like we’re talking about.

You then temporarily store one in shared storage after the transaction commits

{ key = dbname + tablename + id，value = null，expire = 3s }

Delete cache again.

begin trans    var isDbSuccess = write db;    if(isDbSuccess){                var isCacheSuccess = delete cache;        if(isCacheSuccess){            return success;        }        else{            rollback db;            returnfail; }}else{        returnfail; } catch(Exception ex){ rollback db; }end trans// Do the temporary storage here, {key,value,expire}. delete cache;Copy the code

In this way, when a cache miss occurs during “read data”, the system checks whether the temporary data exists and forces the “primary database” to fetch data within 3 seconds.

As you can see, different solutions have their own advantages and disadvantages, and need to be carefully weighed according to the specific scenario.

Cache first, then DB

Most scenarios in your work will have a low tolerance for data accuracy, so “cache first, then DB” is generally not recommended because memory is volatile. The problem arises when the operation cache succeeds and the operation DB fails.

At this point, the latest data is only available in the cache, what to do? Write to the database again and again on a separate thread?

This scheme works up to a point, but is not suitable for scenarios that require high data accuracy, because once the cache is down, the data is lost!

Off-topic: Even if you choose this option, make sure there are only one retry thread, otherwise there will be “concurrent write” problem with “ABBA”.

If you use delete cache, you can do it.

Yes, but only if the program accessing the cache does not create concurrency.

As long as your program is running in multiple threads, if there is concurrency, it is possible that the “read” thread has not written to the database due to the cache miss.

As shown below:

As a result, even with delete cache, there is either a lock (and a distributed lock in the case of multiple clients) or data inconsistencies are inevitable.

It is important to note that if the database is also highly available, even with lock, you still need to consider the same problems that may arise due to the time difference between master and slave synchronization mentioned above.

Of course, “cache first, DB later” is not worthless. This works well in scenarios where writing speed is extremely demanding but data accuracy is not

To summarize, high availability of a database is usually introduced later in the system’s development than caching, so in the absence of high availability of a database, I recommend that you use the “DB first cache” approach, and delete the cache operation instead of set, so that you can basically rest easy.

However, if the database is “high available” and the team has grown to a certain size, do the binlog subscription honestly.

If you use distributed caching, do you still need local caching? . So let’s look at this problem.

Do you want local cache?

Before we answer that question, what is the most important value of a distributed system?

It’s “scalable to infinity”, as long as the heap hardware can handle business growth. One of the things that underlies this is that the program is “stateless.”

So if you want to introduce caching to speed up, but also to achieve “stateless”, it is by distributed caching.

Therefore, problems that can be solved by distributed caching should be avoided by local caching. Otherwise, introducing distributed caching is much less useful.

There are a few scenarios where local caching can be useful, but we need to be careful to identify it. There are three main scenarios:

Data that changes infrequently. (Like one that updates every day or even several days)
Need to support very high concurrency. (E.g. second kill)
Scenarios that tolerate data accuracy performance. (page views, comments, etc.)

However, I recommend that you avoid introducing local caching as much as possible, except for the second scenario. The reason for this is explained below.

The fundamental problem is how to achieve consistency between local (in-process), distributed (out-of-process), and database caches after the introduction of local caches.

Data consistency between local cache, distributed cache and DB

If it’s a single point application, it’s easy to leave local caching at the end.

You might say what if local cache changes fail? Exceptions like repeating keys or something. Then you can reflect on why this kind of data can be successfully written into the database…

However, one of the big problems with local caching is that while one node is fine, how can data be synchronized between multiple local cache nodes?

There are two ways to solve this problem: either the receiving node notifies other nodes of the change (either via RPC or MQ), or a consistent hash allows requests from the same source to be pinned to the same node.

The latter prevents local cache data from being duplicated on different nodes, avoiding this problem in the first place.

However, these two schemes go to extremes. The cost of the former is too high. For example, if thousands of nodes need to be notified, the cost is unacceptable. The latter consumes too much resources and is prone to uneven distribution of pressure.

Therefore, the former can be considered when the system is small, while the latter will be chosen when the system is large.

There is also a more modest scheme to reduce the accuracy of the data in exchange for cost. It is to set the cache timing expiration or timing to pull the latest data from the downstream distributed cache.

This is the same logic as the timing mechanism mentioned in the previous “DB first cache”, but the disadvantage is that there will be a longer period of data inconsistency.

To summarize, the data consistency solution of local cache can completely solve the solution with the help of consistent hash, but the cost is relatively high. So make a conscious decision to do local caching if you don’t have to.

conclusion

All right, let’s sum it up.

This time, I spent a lot of time discussing with you the question of “write DB first or cache first” and taking you through the layers and explaining the different solutions bit by bit.

Then I discussed with you the meaning of “local cache” and how to do a good job of data consistency on the basis of “distributed cache” and “database”, which is mainly the data synchronization between multiple local cache nodes.

I hope it inspires you.

This caching exercise is a good example of how a refinement of something requires more refinement, but it also brings new complexity.

So as a technical person, you need to consider how to weigh all the time, rather than echoing others.

END

Long press the qr code below to pay immediate attention to [Tanuki technology Nest]

Top technical experts from Alibaba, JD.com, Meituan and Bytedance are in charge

Create a “temperature” technology nest for IT people!

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

How to use caching correctly in distributed systems? Don’t introduce a time bomb into your project!

Write DB first or cache first?

DB first, then cache

Cache first, then DB

Do you want local cache?

Data consistency between local cache, distributed cache and DB

conclusion

How to use caching correctly in distributed systems? Don’t introduce a time bomb into your project!

Write DB first or cache first?

DB first, then cache

Cache first, then DB

Do you want local cache?

Data consistency between local cache, distributed cache and DB

conclusion

Related Posts

Developers conference invitation letter | first tencent cloud + community

Let’s talk about chat history storage

IP access to Linux machine display denied