How to solve the inconsistency between cache and database double-write in high concurrency scenarios?

Can insist on others can not insist, to have others can not have.

This article was first published on the public account programming Avenue, and there will be a delay of one or two days until the excavation.

Pay attention to programming avenue public number, let us stick to what we think in mind, grow up together!

Analysis and solution design of cache + database dual write inconsistency in high concurrency scenario

Redis is a very important part of the high concurrency and high availability architecture of enterprise systems. Redis mainly solves the problem of low concurrency of relational databases, which helps to relieve the pressure of relational databases in high concurrency scenarios and improve the throughput of the system (specifically how Redis improves the performance and throughput of the system, which will be specifically discussed later).

In the actual use of Redis, we will inevitably encounter data inconsistency between cache and database double write, which is also a problem we must consider. If you still don’t know this question, you can move a small bench to listen.

1. Inconsistency between database and cache double write is introduced

To address the database + cache double write inconsistency problem, we need to explain how this problem occurs. We choose the inventory service which requires high real-time data in the e-commerce system to illustrate this problem.

The inventory may be modified, and every time the database is modified, the cached data must be updated; ; Each time the inventory data in the cache expires or is cleared, the front-end request for inventory data is sent to the inventory service to retrieve the corresponding data.

Do I update the Redis cache directly when I write to the database? Actually not, because it’s not that simple. Here, in fact, it involves a problem, database and cache double write, data inconsistent problem. Around and combined with high real-time inventory services, the database and cache double write inconsistent problem and its solution, to share with you.

Ii. Various levels of inconsistency problems and solutions

1. The most elementary cache inconsistency problem and its solution

The problem

If you modify the database first and then delete the cache, there will be a problem. If the deletion of the cache fails, it will lead to new data in the database and old data in the cache, and there will be data inconsistency.

solution

Instead, delete the cache first and then modify the database. Read cache read not, check the database to update the cache when we got the latest inventory data. If deleting the cache succeeds, but modifying the database fails, the database is still old, the cache is empty, and the data is not inconsistent. The old data in the database is read and then updated to the cache because the cache does not have it at the time of reading.

2. Analysis of complex data inconsistency problems

When the inventory data changes, we delete the cache first and then modify the database.

Imagine if the operation of modifying the database has not been completed at this time, suddenly a request comes, reads the cache, finds that the cache is empty, queries the database, finds the old data before modification, and puts it in the cache.

After the data change operation is completed, the database inventory is changed to the new value, but the cache becomes the old data again. Is there any inconsistency between the cache and the database?

3. Why does this problem occur when hundreds of millions of concurrent traffic flows occur?

The above problem can only occur if the data is read and written concurrently.

In fact, if the concurrency is very low, especially if the read concurrency is very low, 10,000 visits per day, then very rarely, you get the kind of inconsistent scenario that I just described.

But the problem is, with high concurrency, there are a lot of problems. If the daily traffic is hundreds of millions, tens of thousands of concurrent reads per second, as long as there are data update requests per second, the above database + cache inconsistency may occur.

How to solve it?

4. Asynchronously serialize the update and read operations

Here’s a solution.

Don’t you just check the database and read the old data before updating the database? Isn’t it because you read it in front of the update? I’ll put you in line.

4.1 Asynchronous serialization

I maintain n internal memory queues, and when I update data, I route the operation to one of the internal memory queues of the JVM based on the unique identity of the data (requests for the same data are sent to the same queue). When the data is read, if the data is not in the cache and there is an operation to update the inventory in the queue, the data will be re-read and the operation to update the cache will also be sent to the same internal MEMORY queue based on the unique identifier. There is then a worker thread for each queue, and each worker thread gets the corresponding operation sequentially and executes it one by one.

In this case, a data change operation, to delete cache first, and then to update the database, but haven’t finished the update, if a read request at this time, read the empty cache, you can cache update request is sent to the first in the queue, at this point in the queue backlog, row after just update the operation of the library, Then synchronously wait for the cache update to complete before reading the library.

4.2 Read Operations for Deduplication

Multiple read library update cache request string is no sense in the same queue, therefore can do filtering, if it is found that the queue has the data update cache request, so don’t have to put it in, directly in front of the waiting for the update operation request is completed, to the corresponding work thread done on a queue operations (database), The next operation (read library update cache) is performed, at which point the latest value is read from the database and written to the cache.

If the request is still in the waiting time range and polling finds that the value can be fetched, it returns directly; If the request waits more than a certain amount of time, the current old value is read directly from the database this time. (Returning the old value causes the cache and database to be inconsistent again. That at least reduces the number of times that happen, because it doesn’t happen every time, it’s very rare. If you run out of time, just read the old value and return it.

5. Problems to be paid attention to in the scenario of high concurrency

In high concurrency scenarios, there are some issues with this solution that need special attention.

5.1 Read Request Is Blocked for a Long time

Because read requests are very lightly asynchronous, it is important to be aware of read timeouts, within which each read request must be returned.

In this solution, the biggest risk is that when the data is updated frequently, a large number of update operations are backlogged in the queue, and then a large number of read requests will occur timeout, resulting in a large number of requests going directly to the database to fetch the old value. So be sure to run some real-world tests to see what happens when you update data frequently.

On the other hand, because update operations for multiple data items may be backlogged in a queue, you need to test your own business situation to determine how many memory queues are created in an instance, and you may need to deploy multiple services, each sharing some data update operations.

If 100 inventory modification operations are stored in a memory queue and each inventory modification operation takes 10ms to complete, then the data of the last item read request may be obtained after 10 * 100 = 1000ms = 1s.

This results in a long block of read requests.

Be sure to do some stress testing and simulation of the online environment based on the actual business system performance to see how many update operations the memory queue may squeeze during peak hours and how long the last update request may hang. If the read request comes back in 200ms and you calculate that, even at peak times, you have a backlog of 10 updates waiting up to 200ms, that’s fine.

If a memory queue is likely to be particularly backlogged with updates, then you add machines so that fewer service instances deployed on each machine process less data, and the fewer backlogged updates per memory queue.

Tips: Based on our experience with previous projects, data write frequency is generally very low, so in fact, the backlog of updates in the queue should be very small. For high read concurrency, read cache architecture of the project, generally write requests relative to read, is very very few, QPS per second can reach hundreds of good. If 500 writes per second, you can think of it as 5 copies, 100 writes per 200ms. For a single machine, if there are 20 memory queues, each memory queue may backlog 5 write operations. After the performance test of each write operation, it is generally completed in about 20ms. So for each memory queue data read request, also hang for a while at most, within 200ms will definitely be able to return. If the write QPS is expanded by 10 times, but after the calculation just now, we know that it is no problem for a single machine to support hundreds of write QPS, then expand the machine, expand the machine by 10 times, 10 machines, 20 queues for each machine, 200 queues. For the most part, this should be the case: a large number of read requests come in and cache directly to the data. In rare cases, you may encounter read and data update conflicts. As mentioned above, if the update is queued first, then a large number of read requests for the data may be immediately followed by an update to the cache because of the de-duplication optimization. After the data is updated, the cache update operation triggered by the read request is completed, and all the temporarily waiting read requests can read the data in the cache.

5.2 The Number of Concurrent Read Requests is too High

There is also a risk that a sudden flood of read requests will hang on the service for tens of milliseconds to see how well the service can handle the peak of the maximum limit case.

However, not all data are updated at the same time, and the cache is not invalid at the same time. Therefore, the cache of a few data may be invalid every time, and then the corresponding read requests of those data will come, and the concurrency should not be very large.

Tips: If you calculate the ratio of write to read requests at 1:99, then 50,000 read QPS per second, you might only have 500 update operations. If there are 500 write QPS per second, it is necessary to calculate how many read requests to the cache may be caused by the 500 data that may be affected by the write operation. If the 500 data is invalidated in the cache, read requests are sent to the inventory service to update the cache. Generally speaking, if there are 1000 read requests per second to read the 500 update database data, there will be 1000 requests hang on the inventory service. If it is specified that each request will return 200ms, then the maximum hang time for each read request must be calculated. At most, 200 read requests can be hung at the same time. At worst, 200 read requests can be hung at the same time. But if 1:20, 500 data updates per second, the 500 seconds of data corresponding to read requests, there will be 20 * 500 = 10,000, 10,000 read requests all hang on the inventory service, dead.

5.3 Request Routing for Multi-Service Instance Deployment

It is possible that multiple instances of the inventory service are deployed, so it is necessary to ensure that requests to perform data update operations, as well as requests to perform cache update operations, are routed to the same machine for read and write requests to the same item. You can do hash routing between services based on the parameters of a request, or you can use the hash routing function of the Nginx server to route to the same service instance.

5.4 Hot Commodity Routing Problems Cause Request Skew

If an item’s read/write requests are so high that they all go to the same queue on the same machine, it may cause too much stress on the same machine.

However, since the cache is only cleared when the item data is updated, and then concurrent reads and writes occur, this problem is not particularly significant if the update frequency is not too high.

But it is possible that the load on some machines will be higher.

Third, summary

In general, if your system is not strict with cache + database must be consistent, cache can be a little bit with the occasional inconsistent database, it is best not to the serialization of the plan, because of read and write request serialization, string into an in-memory queue, so that can guarantee not appear inconsistent. After serialization, however, the throughput of the system decreases so much that you need several times more machines than normal to support a single request on the line.

In addition, it is not to say that lecturing, writing articles is superman, omnipotent. It’s like writing a book, you can get it wrong, you can get it wrong, you can get it wrong, you can get it wrong, you can get it wrong. Or some solutions are only suitable for certain scenarios, and in some cases, you may need to optimize and adjust the solution to be suitable for your own project.

If you have any questions or opinions about these programs, please feel free to communicate with us. If you really feel that the explanation is wrong, or some places are not considered, then we can communicate with each other.

Welcome to pay attention to the public number to learn communication ~~