Hello, I’m Leo

Following the last Redis technology summary five, we continue to talk about Redis related technology!

We introduced it in the last post

  1. Buffer overflow tuning scheme.
  2. Cache type, synchronous direct write, asynchronous write back policy.
  3. Cache elimination strategy, LRU, implementation of LFU algorithm.
  4. Dirty cache, clean cache criteria

This article is mainly introduced cache and database inconsistency, cache pollution, Redis to solve large concurrency, Redis to achieve distributed locking.

Recommended reading

What is MySQL

What is Redis?

What is Redis?

What is Redis?

What is Redis?

What is Redis?

Cache and data consistency issues

Redis frequent interviews: Cache and data consistency issues. Today we will sort out and summarize.

Redis caches are of two types: read/write cache and read-only cache. Different types produce different problems and different solutions, as we will see below.

Read and write cache

Let’s start by explaining what a read/write cache is. The read and write cache represents a master library. Provide both write and read services.

If the data is modified when the user selects the read/write cache. In addition to database consistency, we also need to consider data consistency in the cache. As we mentioned earlier, Redis write requests are also a storage medium. Therefore, the corresponding write back policy must be adopted during configuration

  • The synchronous direct write policy ensures data consistency between the cache and the database.
  • When the asynchronous write back policy is used, the data consistency cannot be guaranteed because the commands are executed asynchronously.

For read and write cache, synchronous direct write strategy should be adopted to ensure data consistency. Also known as synchronous execution, the program must not forget to add transaction mechanism to ensure that the database and cache atomicity.

Synchronous direct write can be used for some data with high requirements, while asynchronous write back strategy can be used for less demanding attributes (creation time, origin, hometown, attribute, etc.).

A read-only cache

After reading and writing caches, I’m sure many of you are familiar with read-only caches. So let me summarize this a little bit. Read-only caches, like slave libraries, are only responsible for reading services, not writing to users.

Therefore, we can bypass Redis and directly write new data into the database, so that the next time the user visits the database and finds no (cache missing), the data will be directly typed into the database. After the database finds no (cache missing), the data will be cached and inserted into the cache. After we cache it, we can read it directly from the cache next time and we don’t have to query the database.

It is well known that Redis has a higher concurrency tolerance than MySQL. That’s why we cache it in Redis.

If it’s a delete or modify operation we’re going to update the cache as well as the database. If the cache is not deleted, the user does not exist originally, but can still log in to access, resulting in data inconsistency problems. Serious cases may result in system-level errors! (For example, there is no data error in associative query)

So Redis, MySQL all have to be atomic,

  • Delete the cache first and update the database later: If the cache is deleted successfully, data update fails. As a result, when the user reads data from the cache, the user calls the database if the cache key is not found, and the database data is not updated successfully, causing the old value to be read
  • Update the database first and delete the cache later: When the database is updated successfully, the cache deletion fails, which results in the database retaining the latest value. When the user reads data from Redis, the user finds the cache key exists, and directly returns the old value.

How to solve

1. Retry mechanism

We can use the message queue idea to improve. If one command is executed successfully and the other command fails, the message is written to the message queue to enter the secondary consumption. When the consumption succeeds, the data in the message queue is deleted automatically. Avoid repeated operation!

This ensures database and cache consistency!

This does solve the problem of data consistency, but if the data is accessed concurrently, old values are generated at the moment of failure! Let’s analyze it again

  • Delete the cache first and update the database later

Suppose the cache is deleted successfully, but the database has not been updated, and a user request comes in. It finds that the data does not exist in the cache and calls the database. It fetches data from the database. This causes old values to be read. The old values are also cached in Redis when they are read.

Then the database update operation began…

The cached value appears, the database value disappears…

These two are not the same!

We usually let him sleep for a short period of time before cache deletion. The sleep parameter depends on (thread read data and write cache operation time as an estimate)

  • Update the database first, then delete the cache

If the value of the database is changed and the moment the cache is deleted comes concurrently, the user reads the old value from the cache pool and returns it directly to the user.

One for the jugular! .

If data inconsistency occurs due to a failure to delete cached values or update the database, you can use retry mechanisms to ensure that the deletion or update operation is successful.

In the two steps of deleting cached values and updating the database, concurrent read operations of other threads cause other threads to read old values. The solution is to delay double deletion.

Cache avalanche, breakdown, penetration

An avalanche

Avalanches are pretty obvious things. We can think of it as a big pile of snow coming in, a wall that can’t withstand the onslaught. It crashed through the wall, straight into your bedroom!

There are several reasons for cache avalanches

  • Big flood: system design problem, obviously there are 1 million flow, the program only designed 100 thousand
  • Flood from the sky accident: cache key just large area failure, expiration time design is unreasonable
  • The bricklayer fooled the wall down: The Redis instance crashed

When we usually solve the problem, we can avoid the cache write time to be the same in a large area. We can add a random function at the end to make the expiration time distribution more frequency bands. Cache avalanches can also be addressed by service degradation.

For example, during the COVID-19 pandemic last year. Check all passers-by, cough, fever will not pass. If there is no cough, fever and other conditions still have a physical examination report can go home to self-isolation.

The same is true in Redis, if it’s accessing core data, we can let it go, if it’s accessing additional attributes we can go straight back to the original data, or network fluctuations. This will filter some requests for additional attributes.

In other cases, the wall collapses before the flood comes. You see this is not naked beg, you are provoking the flood.

In order to ensure the normal operation of system services, we should take the following measures in advance

  1. Service circuit breaker is implemented
  2. Implement the request traffic limiting mechanism.
  3. High availability cluster

To prevent a chain reaction, we shut down the points service for our customers. Restart the service after the repair is successful. This prevents other services from being affected.

While the business system is running, we can monitor the load metrics of the machine where the Redis cache is located and the machine where the database is located, such as requests per second, CPU utilization, memory utilization, etc. A cache avalanche occurs if we find that the Redis cache instance is down and the load on the database host machine suddenly increases (for example, the number of requests per second spikes). A large number of requests are sent to the database for processing. Service circuit breakers can be enabled to suspend business applications’ access to the cache service, thus reducing the pressure to access the database

If the service is restricted, we can think of it as the morning police on the road deployment. One is that the inflow is constant, and if we want to operate properly, we have to slow down the inflow rate.

It’s 10,000 requests per second if you go back to the system, 1,000 requests per second if you limit the flow. Any more requests we turn away and wait in line.

A high availability cluster is also pre-emptive. It’s like during singles’ Day or the super Spring Festival travel rush. The traffic is super high. We set up the corresponding machinery in advance of the festival. Once the large screen panel detects a large amount of incoming traffic we can turn on an alternative to address the concurrency requirements by adding machines. This can also achieve the need to save hardware costs.

The breakdown

Cache breakdown is mainly hotspot data failure. I’m afraid there will be tragedy if the cache of commodities in Pang1 fails during singles’ Day. All of a sudden all requests hit the database. This is because the expiration time of hotspot data is not designed correctly.

We usually preheat this kind of data in advance. For example, we preheat the data of the top 100 for 30 minutes. This will not result in a large number of requests to the database in a short period of time during the second kill.

through

Cache penetration, as the name suggests, is when you’re playing with a sword jab, you accumulate enough energy to go straight at one place.

That’s what it means to go back into the system. When hackers hack into our system, they often guess some data not in the cache, so that a large number of requests hit the database, resulting in cache penetration. The next time THE request is made, the query is still not in the cache because it was not in the cache when it was queried from the database.

Our solution is to cache a key regardless of whether the database has the current data, and write a null in the value of the key.

You can also use Redis bloom filter to solve this problem. Now let’s talk about what is a Bloom filter

Bloom filter

Bloom filter consists of an array of bits with initial values of 0 and N hash functions, which can be used to quickly determine whether a certain data exists. When we want to flag that data exists (for example, that data has been written to a database), bloom filters do this by doing three things:

  • First of all, using N hash functions, you compute the hash value of this data, and you get N hash values.
  • We then modulo these N hash values with respect to the length of the bit array to find the position of each hash value in the array.
  • Finally, we set the corresponding bit to 1, which completes marking the data in the Bloom filter.

If the data does not exist (for example, there is no data written to the database) and we have not marked the data with the Bloom filter, then the value of the corresponding bit in the bit array is still 0.

When we need to query for some data, we perform the calculation described above, first obtaining the corresponding N positions of the data in the bit array. Next, we look at the values of the N bits in the bit array. As long as any of these N bits are not 1, it indicates that the bloom filter has not marked the data, so the queried data must not be stored in the database.

Cache pollution

What is cache contamination?

In some scenarios, some data is accessed very infrequently, or even only once. If the data remains in the cache after the access request has been served, it will only take up space in the cache. In this case, it is cache contamination.

How to solve it?

I had planned to talk about all eight elimination strategies, but I looked at the time. Barbie Q WC.

In the past few articles, we’ve covered eight strategies in detail. I’m just going to skip this one.

  • The Noeviction strategy does not perform data flushing and cannot be used for cache contamination
  • Volatile -random and allkeys-random are both random eliminations. Although they can be used to eliminate data, they are not good. Moreover, if hot data is eliminated, it will have the opposite effect.
  • Volatile – TTL is time-based obsolescence, which, like random obsolescence, is used to address cache contamination, which is not very good and is counterproductive.

The remaining four policies are volatile- LRU, volatile- LFU, allkeys-lRU, and allKeys-LFU. We can actually view it as two strategies, but one is local and one is global. And to understand that we’re going to use LRU and LFU algorithms.

LRU

No more nonsense, omit some, want to see the details can go (30,000 chat about what is Redis five). LRU algorithm is better, but the only disadvantage is affected by the access time, because only looking at the data access time, using LRU strategy can not solve the cache pollution in the process of scanning single query operation. A single scan query operation refers to an application that reads a large amount of data at a time. Each data is read and only read once. At this point, because the data being queried has just been accessed, the LRU field values are large.

The LRU will keep this data in the cache for a long time, causing cache contamination. If new data is accessed, the old data must be replaced with new values, which can affect the performance of the cache.

LFU

On the basis of LRU, LFU algorithm was born.

The LFU cache policy is based on the LRU policy, which adds a counter for each data to count the number of times the data is accessed. When the LFU policy is used to filter out data, data with the lowest access times is filtered out of the cache. If two data are accessed for the same number of times, the LFU policy compares the access timeliness of the two data and removes the data whose access time is longer than the last time.

Compared to data that is accessed frequently, data that is scanned for a single query is accessed no more because it is not accessed again. Therefore, the LFU policy preferentially removes these low-access data from the cache. In this way, the LFU policy prevents this data from contaminating the cache.

LFU uses two approximate schemes to cause the overhead of linked lists

  1. Both redisObjects can hold data, and an LRU field is set within the structure to record the time stamp of the data
  2. Random sampling is adopted to select a certain amount of data and put it into the candidate set, and then filter it according to the size of LRU field value in the candidate set.

When Redis implements the LFU strategy, it only splits the original 24bit LRU field into two parts.

  • LDT value: the first 16 bits of the LRU field, representing the access time stamp of the data.
  • Counter value: the last 8bit of the LRU field, indicating the number of times the data was accessed.

LFU uses an interesting design here. It uses an 8 field to store the number of accesses. We are told that 8 bytes can be stored 255 times. How does Redis respond if it exceeds 255?

When Redis is full, the LFU policy is used to flush out data. When both values are 255, the timestamp is compared. However, the number of access requests to Redis is far from enough. Therefore, in implementing the LFU strategy, Redis does not use a counting rule that increments the corresponding counter value every time the data is accessed, but uses a more optimized counting rule.

Each time the data is accessed, first, the current value of the counter is multiplied by the configuration item lFU_log_factor and then added by 1, and then the reciprocal is taken to obtain a p value. Then, this p value is compared to a random r value in the range of (0,1). The counter is incremented only when p value is greater than r value.

When lFU_log_factor is 1 and the actual number of accesses is 100K, the counter value is 255, and the data that is actually accessed more times is no longer distinguishable.

When lFU_log_factor is set to 10, there is a clear distinction between the counter values for the 100, 1000 and 100,000 level accesses (for general applications, we can set it to 10).

When lFU_log_factor is 100 and the actual access times is 10M, the value of counter reaches 255. At this point, different data with actual access times less than 10M can be distinguished by the value of counter.

At the end

So that sort of sums it up

  1. Problems with cache and data inconsistencies
  2. Different cache types and resolution access.
  3. Redis common production issues, cache avalanche, breakdown, penetration,
  4. From penetration to Bloom filters
  5. Cache pollution and application measures, along with a chat about LFU and LRU classic place

After learning it, I could only exclaim how great Redis author was. This kind of design idea is also worth learning from in our future system.

This article should not be as good as the previous ones. The director of our company was temporarily occupied this week, and THEN I carried a small flag by myself. All the study time was scattered every night. I spent Saturday working on new features for the e-commerce system. Evening recalled this week’s knowledge fragments, ao to 3 o ‘clock in the morning just completed a.

Each knowledge point is their own collation condensed expression, part of some not easy to understand the place please point out in time, we make progress together!

Because really tired of the reason, part of the knowledge from Jiang Dejun teacher, Redis design and implementation, Redis depth adventure. Respect the original!

Welcome to add my personal wechat about the back end of the issues we discuss together in the group! See you next time!

We make progress together every week