Cache is the weapon in high concurrency system architecture, through the use of caching, system can easily manage tens of thousands of concurrent access request, but at the same time of enjoying the advantages of caching, how to guarantee the data consistency of the database and cache, has always been a difficult problem, in this article to share how to safeguard the cache consistency problems in system architecture.

An overview of the

Before we introduce how to solve the problem of database and cache consistency, let’s take a look at two problems — What is database and cache consistency (What) and Why is database and cache consistency (Why).

What are the data consistency issues between databases and caches

Let’s start by understanding what the data consistency problem we’ve been talking about is. CAP theory is believed to be familiar to everyone who is engaged in distributed system development. C stands for Consistency, A stands for Availability, P stands for Partition tolerance, CAP theory states that these three elements can only be realized at most at the same time, but not at all. Consistency is defined here as whether all data backups in a distributed system have the same value at the same time.

Therefore, we can understand the data in database and cache as two copies of data, and the data consistency between database and cache is equivalent to how to ensure the data consistency between database and two copies of data in cache.

Why do database and cache data consistency problems occur

In business development, we generally use the four properties of database transactions (ACID) to ensure data consistency. In the distributed environment, because there is no guarantee of similar transactions, it is easy to have partial failures, such as database update success, cache update failure, or cache update success, database update failure, etc., summarize the reasons that will lead to database and cache data inconsistency.

network

The default network is unstable in distributed systems. Therefore, under the CAP theory, it is generally believed that failure caused by network is inevitable, and CP or AP is generally selected in system design, which is the reason. Operating databases and caches involves network I/O, and it is easy to fail some requests due to network instability, resulting in inconsistent data.

concurrent

In a distributed environment, requests are processed concurrently by multiple server nodes without explicit synchronization. Look at the following example, suppose that there are two concurrent requests and update the database in the field, A process to update the field A is 1 and update the cache is 1, 2 update process field is 2 A to 2 and update the cache, as in the case of concurrent cannot guarantee the timing, will appear below this kind of situation, the end result is A database field has A value of 2, The value in the cache is 1. The database is inconsistent with the cached data.

Process 1 Process 2
Time points T1 Update database field A = 1
Time T2 Update database field A = 2
The point in time T3 Update cache KEY A = 2
Time T4 Update cache KEY A = 1

Mode of read and write cache

In engineering practice, there are several common patterns for read and write caching.

Cache Aside

Cache Aside is probably the most common pattern used to update databases and caches in a lot of business code. Its main logic is shown below.

Determine the request type and perform different processing for read and write requests:

Write request: update the database and invalidate the cache after success.

Read request: first queries whether the cache matches the data. If the cache matches the data, the database is directly returned. If the cache does not match the data, the database is queried.

This pattern is relatively simple to implement and seems logistically sound. Read request logic is typically implemented in Java using AOP to avoid code duplication. Cache Aside may cause data consistency problems in concurrent environments, such as the table below.

Read requests Write requests
Time points T1 The value of field A in the query cache is not matched
Time T2 Query database to get field A=1
The point in time T3 Update database field A = 2
Time T4 Failure of the cache
Time point T5 Set the value of cache A to 1

Read requests query field A cache, but failed to hit, and then query the database field. A value of 1, at the same time write requests will update to 2, the value of the field A being concurrent requests, write requests invalidation cache in operation prior to the operation of the set in the cache read request, set the fields in the request to reading A cache is correct value is 1 and failed to failure, This results in dirty cache data, which will always be wrong if the cache expiration time is not set.

Read Through

Read Through mode is very similar to Cache Aside mode, except that in Cache Aside mode, if a Read request fails to hit the Cache, you need to implement your own logic to query the database and update the Cache. In Read Through mode, you don’t need to worry about that logic. We only deal with the Cache service, which implements the loading of the Cache. For example, the Guava Cache is commonly used in Java. See the following code.

LoadingCache<Key, Graph> graphs = CacheBuilder.newBuilder()
       .maximumSize(1000)
       .build(
           new CacheLoader<Key, Graph>() {
             public Graph load(Key key) throws AnyException {
               returncreateExpensiveGraph(key); }}); .try {
  return graphs.get(key);
} catch (ExecutionException e) {
  throw new OtherException(e.getCause());
}
Copy the code

In this code we use Guava’s CacheLoader to load the cache for us. In a Read request, if a Cache Miss occurs when the GET method is invoked, the Cache is loaded by the CacheLoader. Our code only graphs the object, and we don’t care about the underlying Cache loading details.

There is no logical difference between the Read Through mode and the Cache Aside mode, except that the code in the Read Through mode is much cleaner, so again, In Cache Aside mode, concurrent database and Cache data are inconsistent.

Write Through

The logic of the Write Through mode is similar to that of the Read Through mode. In the Write Through mode, all Write operations are cached, and subsequent logic is executed based on whether the Write operations match the cache.

Write-through: write is done synchronously both to the cache and to the backing store.

The definition of the Write Through mode on Wikipedia emphasizes that in this mode, Write requests are synchronized to the cache and database, and only Write requests to both cache and database are considered successful. The main logic is shown below.

Write Through mode only updates the Cache in read requests when a Cache Miss occurs. When a Cache Miss occurs, a write request will not update the Cache, but directly write to the database. If a Cache Miss occurs, the Cache will be updated first, and the Cache will write the data back to the database. The cache itself writes data back to the database. Here is an example of Ehcache. In Ehcache, the CacheLoaderWriter interface implements the Write Through mode. This interface defines a series of Cache lifecycle hook functions, including the following two methods:

public interface CacheLoaderWriter<K.V> {

    void write(K var1, V var2) throws Exception;

    void writeAll(Iterable<? extends Entry<? extends K, ? extends V>> var1) throws BulkCacheWritingException, Exception;
}
Copy the code

Only these two write-related methods need to be implemented, that is, data can be written to the underlying database when the cache is updated, that is, the code only needs to interact with the CacheLoaderWriter, and the logic of updating the cache and writing to the database is not needed at the same time.

The logic of the Write Through mode is the same as that of the Read Through mode. Therefore, the Read Through mode and the Write Through mode can be used together.

Does the Write Through mode have consistency issues with the Read Through mode in concurrent scenarios? Obviously there is, and the reason for the inconsistencies is similar to the Read Through pattern, because the timing of updating the database and updating the cache is not guaranteed in concurrent scenarios.

Write Back

Write-back (also called write-behind): initially, writing is done only to the cache. The write to the backing store is postponed until the modified content is about to be replaced by another cache block.

Let’s take a look at Wikipedia’s definition of Write Back mode — this mode only writes to the cache during a Write request, and then only writes to the underlying database when the data in the cache is to be replaced out of memory. There are two main differences between the Write Back mode and the Write Through mode:

  1. In Write Through mode, data is written to the cache and database synchronously, while in Write Back mode, data is written to the cache asynchronously and in batches from the cache to the underlying database.
  2. In Write Back mode, when a Cache Miss occurs in a Write request, data is written Back to the Cache, which is different from the Write Through mode. Thus, Read Cache misses in Write Back mode are handled similarly to Write Cache misses.

The implementation logic of Write Back mode is complicated, the main reason is that this mode needs to track which “dirty” data is written to the underlying storage when necessary, and if there are multiple updates, it also needs to do batch merge Write. The diagram of the implementation logic of Write Back mode is not posted here, if you are interested. Check out the map on Wikipedia. Since it is asynchronous, the benefit of the Write Back pattern is high performance, but the disadvantage is that data consistency between the cache and the database is not guaranteed.

thinking

By looking at the implementation of these three patterns, you can see some implementation differences – whether to delete or update the cache, and whether to operate the cache first or update the database first. The table below lists all possible situations where 1 means the cache is consistent with the data in the database and 0 means inconsistent.

Cache operation failed Database operation failed
Update the cache first, then the database 1 0
Update the database first, then the cache 0 1
Delete the cache first, then update the database 1 1
Update the database first, then delete the cache 0 1

Above conditions are not consider caching operations into the database transaction (in general it is not recommended to put the database operations in the transaction, such as RPC calls, Redis operation, etc., the reason is that these external tend to rely on the network unreliable factors such as operation, once appear problem, can easily lead to a database transaction cannot submit or “long transaction” problem).

It can be seen that only “delete cache first, update database later” mode can guarantee data consistency in the case of partial failure, so we can conclude that “delete cache first, update database later” is the optimal solution. However, deleting and updating schemas can lead to cache breakdowns, which will be discussed later.

In addition, we can also observe that the Cache Aside/Read Through/Write Through modes are inconsistent with the database data in the concurrent scenario, and the reason is that in the concurrent scenario, the timing of updating the database and the Cache cannot be guaranteed. Updating the database occurs before writing to the cache, which is old data, resulting in data inconsistencies. Based on this, we can conclude 2 — as long as a pattern can solve this problem, it can guarantee data consistency between the cache and the database in a concurrent environment.

The last thing I found after looking at the above three patterns is that once the data inconsistency between the cache and the database occurs, if the data is not updated, then the data in the cache is always wrong and there is no remedy mechanism. Therefore, it can be concluded that 3 — there needs to be some kind of automatic cache refresh mechanism. The simplest way to do this is to set an expiration date on the cache, which is a backstop in case the data remains wrong in the event of inconsistencies. Based on the above three conclusions, the following two models are introduced.

Delay double delete

The delayed dual-delete mode can be regarded as an optimized version of Cache Aside mode. The main implementation logic is shown in the following figure.

In the written request, first of all, according to the above we best practices for conclusion, first remove the cache, to update the database, and then sends the MQ message, here the MQ message can be sent by the service oneself, also can pass some message middleware to monitor the DB binlog change, need to delay for a period of time after listening to the news, The delay can be implemented using the message queue’s delayed message function, or the consumer can sleep for a while after receiving the message and then delete the cache again. The pseudocode is shown below.

// Delete the cache
redis.delKey(key);
// Update database
db.update(x);
// Send delay message, delay 1smq.sendDelayMessage(msg, 1s); .// Consume delayed messages
mq.consumeMessage(msg);
// Delete the cache again
redis.delKey(key);
Copy the code

The implementation logic of the read request is the same as that of the Cache Aside mode. If the Cache is not hit, the Cache will be reloaded in the read request, and a reasonable expiration time should be set for the Cache.

Compared with Cache Aside, this reduces the likelihood of inconsistencies between the Cache and the database somewhat, but only slightly. The problem is still there, but the conditions are more stringent.

Read requests Write requests
Time points T1 The value of field A in the query cache is not matched
Time T2 Query database to get field A=1
The point in time T3 Update database field A = 2
Time T4 Failure of the cache
Time point T5 Sending delayed messages
Time point T6 Consuming delayed messages and invalidating the cache
T7 has the time point Set the value of cache A to 1

Since message consumption and read requests occur concurrently, the timing of cache invalidation after consuming delayed messages and cache Settings in read requests is still not guaranteed, and data inconsistencies are still possible, albeit less so.

Synchronization fails and updates

Combined with the advantages and disadvantages of the above modes, I adopted another mode in the actual project practice, which I named as “Synchronous failure and update mode”. The main implementation logic is shown in the following figure.

The idea of this mode is to read only cache in the read request, put the operation cache and database in the write request, and these operations are synchronous, and in order to prevent the concurrency of the write request, it is necessary to add a distributed lock on the write operation, and after the lock is obtained, the subsequent operations can be carried out. In this way, This eliminates all possibility of data inconsistencies due to concurrency.

Here the distributed lock can be determined according to the dimension of the cache, there is no need to use a global lock. For example, if the cache is at order latitude, the lock can also be at order latitude, and if the cache is at user latitude, the distributed lock can be at user latitude. Here, take the order as an example, write request implementation pseudo-code is as follows:

// Get the distributed lock for order latitude
lock(orderID) {
  	// Delete the cache first
	redis.delKey(key);
  	// Update the database again
  	db.update(x);
  	// Finally update the cache again
	redis.setEx(key, 60s);
}
Copy the code

This mode basically ensures data consistency between the cache and the database. In terms of performance, read requests are basically lossless. Write requests are affected to some extent because they need to write data to the database and cache synchronously. Of course, this model also has disadvantages, mainly including the following two points:

Write requests rely heavily on distributed locks

In this mode, write requests are strongly dependent on distributed locks, and if the first step fails to acquire the distributed lock, the entire request fails. In normal business processes, database transactions are generally used to ensure consistency. In some key business scenarios, in addition to transactions, distributed locks are also used to ensure consistency. Therefore, distributed locks are used in many business scenarios and cannot be considered as additional dependencies. Moreover, large factories generally have mature distributed lock services or components, even if not, the cost of using Redis or ZK to simply implement a distributed lock is not high, and the stability is basically guaranteed to some extent. In my own practice with projects using this pattern, there have been few problems with distributed locks.

Write requests fail to update the cache, causing cache breakdown

In pursuit of the cache and the database data consistency, so synchronous failure and update the model to the cache and the database write operations on the written request, so avoid under the concurrent environment, due to several operation cache and the database of data inconsistency problem, read requests to cache is read-only, even cache Miss will not reload cache.

However, because of this design, if there is a write request that fails to update the cache, the data in the cache will not be loaded if there is no subsequent write request, and all subsequent read requests will go directly to DB, causing a cache breakdown problem. Internet-based services are characterized by more read and less write, so the possibility of cache breakdown is relatively large.

The solution to this problem is to use compensation, such as timed task compensation or MQ message compensation, which can be either incremental or full, and my experience suggests that compensation is best.

Some other concerns

With a reasonable cache read/write pattern in place, let’s look at some other issues that need to be addressed to ensure data consistency between the cache and the database.

Avoid other problems that can cause the cache server to crash and lead to data inconsistency issues

  1. Cache penetration, cache breakdown, and cache avalanche

Front also mentioned in the “first remove the cache, to update the database” mode will be a cache problem of breakdown, in addition to the cache breakdown, related problems and cache to penetrate and cache avalanche, these problems will lead to the cache server crash, resulting in inconsistent data, let’s look at the definition of these problems and some conventional solutions.

The problem describe The solution
The cache to penetrate A query for a nonexistent key cannot hit the cache, causing each request to be sent to the DB, resulting in a database crash 1. Cache empty objects 2. Bloom filter
Cache breakdown When a cache key expires at a certain point in time, a large number of concurrent requests for the key can overwhelm the database in an instant 1. Use mutex (distributed lock) : only 1 request at a time can grab the lock and reload the cache 2. Never expire: Not physically, but logically (e.g., background tasks periodically refresh, etc.)
Cache avalanche The same value is used to set the expiration time of the cache. The cache expires at a large number of times, resulting in a large number of requests to access the database.The difference between cache avalanche and cache breakdown is: Cache breakdown is for a single key, cache avalanche is for multiple keys 1. Set the cache expiration time separately, such as adding random numbers and so on. 2. Use mutex (distributed lock) : only 1 request at a time can grab the lock and reload the cache

In actual project practice, 100% cache hit ratio is generally not pursued. Secondly, when using the mode of “delete cache first, then update database”, the interval between the two operations is very short under normal circumstances, and there will not be a large number of requests to penetrate the database, so some cache breakdowns are acceptable. However, if a system with high concurrency, such as seckilling, is completely unable to accept cache breakdowns, it can be used to preempt mutex updates or put the cache operation into the database transaction, so that the “update database first, update cache later” mode can be used to avoid cache penetration problems.

  1. The key/hot key

Both big key and hot key problems are service design problems, which need to be solved from the perspective of service design. Large keys affect performance. To solve the problem, divide large keys into multiple keys. In this way, the amount of data transmitted at a time can be effectively reduced and performance can be improved.

Hot keys tend to cause high single-point load on the cache server, leading to server crash. Hot keys can be solved by increasing the number of copies or splitting a hot key into multiple keys.

conclusion

It should be noted that none of the modes introduced above are completely data consistency. They can only be said to achieve the ultimate data consistency in the business sense. If consistency must be guaranteed, distributed consistency algorithms such as 2PC, 3PC, Paxos and Raft should be used. Finally, a summary of the patterns described above.

  • A system that does not have a large amount of concurrency or can accept data inconsistency between the Cache and the database for a certain period of time: Cache Aside/Read Through/Write Through mode.
  • Systems with a certain amount of concurrency or moderate requirements for consistency between cache and database data: deferred dual Deletion mode.
  • Systems with high concurrency or high requirements for cache and database data consistency: synchronizing failures and updating schemas.
  • Systems that require strong consistency of database data: 2PC, 3PC, Paxos, Raft and other distributed consistency algorithms.

So you can see, or that sentence, architecture, there is no silver bullet when doing the architecture design need to do all kinds of choices, so in the choice and design of cache read/write mode, requires a combination of specific business scenarios, such as concurrent quantity is big or small, high or low level of data consistency requirements, etc., flexible use of these patterns, if necessary can do some flexibility, determine the direction of the big, To fill in the details, we can have a good architectural design.

reference

  • Cache update routines
  • Cache(computing)
  • Read-Through, Write-Through, Write-Behind, and Refresh-Ahead Caching
  • (Cache Aside, Read Through, Write Through)
  • Understand data consistency issues between MySQL and Redis