preface

When Redis as a cache, its purpose is to reduce the database access frequency, reduce the pressure of the database, but if we have some data does not exist in Redis, so request will go directly to the database, and once at the same time a large number of cache invalidation or a nonexistent cache is malicious attacks access request, All of this leads to a sudden increase in database stress, and how can you prevent it?

Cache avalanche

Cache avalanche refers to the failure of a large number of caches in Redis at the same time, and if a large number of requests are made at the same time, the requests will directly access the database, which may flush the database.

Cache avalanche generally describes data that is not in the cache but is in the database, and requests are directed to the database because time has expired.

The solution

There are many ways to overcome cache avalanche, and the following are commonly used:

Lock to ensure single thread access to the cache. This prevents many requests from accessing the database at the same time.
keyDo not set the expiration times of values to the same. Typically, when initializing preheated data, data can be stored in the cache at random times to ensure that a large number of caches do not fail at the same time.
If memory allows, you can set the cache to never expire.

Cache breakdown

A cache breakdown is similar to a cache avalanche, except that a cache breakdown is typically a failure of a single cache, with a large number of concurrent requests requiring access to the key at the same time, causing strain on the database.

The solution

The solution to cache breakdown is similar to the solution to cache avalanche:

Lock to ensure single thread access to the cache. After the first request reaches the database, it is written back to the cache, and subsequent requests can be read directly from the cache.
If memory allows, you can set the cache to never expire.

The cache to penetrate

The essential difference between cache penetration and the above two phenomena is that the data accessed at this time not only does not exist in Redis, but also does not exist in the database. In this way, if the concurrency is too large, the data will continue to arrive at the database, causing great pressure to the database.

The solution

Locking does not work well for cache penetration because the key itself does not exist, so requests continue to arrive in the database even though the number of threads accessed is controlled.

Generally, the following solutions can be used to solve the cache penetration problem:

Invalid packets were detected during interface layer verification. ProcedurekeyReturn directly. For example, in the database, increment is usedidSo if I have a nonintegeridOr negativeidYou can return it directly, or if you use theta32 位 uuid, so the discoveryidThe length is not equal to32Bits can also be returned directly.
Cache nonexistent data as well. You can cache an empty or other agreed invalid data directlyvalue. It would be best to adopt this schemekeySet a short expiry time, otherwise a large number of non-existentkeyIs stored in theRedis, will also take up a lot of memory.

Bloom Filter

For the cache penetration solution above, let’s think about it: if a key can bypass the verification of the first method, and a large number of non-existent keys are accessed (say 100 million or 1 billion), then it is not practical to store all of them in memory.

So is there a better solution? This is what we are going to introduce to bloom filters, bloom filters can store as much data as possible in the smallest possible space.

What is a Bloom filter

Bloom Filter was proposed by Bloom in 1970. It’s actually a long binary vector (bitmap) and a series of random mapping functions (hash functions).

Bloom filters can be used to retrieve whether an element is in a collection. Its advantage is that the space efficiency and query time are much better than the general algorithm, but its disadvantage is that there is a certain error recognition rate and difficult to delete.

Bitmap

In Redis, there is a data structure called bitmap, and the important implementation of Bloom filter is the implementation of bitmap, namely the bit array. In this array, each position has only 0 and 1 states, and each position occupies only 1 byte, where 0 means that there is no element and 1 means that there is an element. Here is an example of a simple Bloom filter (a key value hashed and bit computed to determine where it should fall) :

Hash collision

As we found above, Lonely and Wolf landed in the same position. The phenomenon of different key values getting the same value after hash operation is called hash collision. If you do a bit operation after a hash collision, you’re going to end up in the same place.

If too many hash collisions occur, the accuracy of judgment will be affected. Therefore, in order to reduce hash collisions, we generally consider the following two factors comprehensively:

Increase the size of the bitmap array (the larger the bitmap array, the more memory it takes up).
Increase the number of hashes (samekeyValue after1The delta functions are equal2The probability of two or more hash functions getting the same result is naturally reduced.

The above two methods we need comprehensive consideration: such as increasing array, you will need to consume more space, and through the hash computation will consume CPU affect the final computation time, so exactly how much an array, number of hash function and what need to calculate how much time appropriate to particular case is particular analysis.

Two main features of bloom filter

If our Redis does not exist at all, but after two hashing functions, the two positions of Redis are already 1. One is that Nosql is obtained via F1, which is why hash collisions occur and why bloom filters may misjudge.

Therefore, through the above phenomenon, we can conclude from the perspective of Bloom filter that the bloom filter has two main characteristics:

If the Bloom filter determines that an element exists, it probably does.
If the Bloom filter determines that an element does not exist, then it must not exist.

From the perspective of elements, two characteristics can also be obtained:

If the element actually exists, the Bloom filter must determine that it does.
If the element does not exist, the Bloom filter may determine that it does.

PS: It should be noted that, after N times of hash function, the existence of the element can be determined only if all the N positions obtained are 1. As long as one of them is 0, it can be determined that the element does not exist in bloom filter.

fpp

Because there will always be a misjudgment rate in bloom filters, because hash collisions are impossible to avoid 100%. Bloom filter refers to this misjudgment rate as False Positive Probability, namely, False Positive Probability, referred to as FPP.

When using bloom filters in practice, you can define a FPP of your own, and then you can calculate how many hash functions and how much bit array space you need according to the theory of Bloom filters. It is important to note that this FPP cannot be defined as 100% because there is no 100% guarantee that hash collisions will not occur.

Implementation of Bloom Filter (Guava)

In Guava package provides the implementation of Bloom filter, the following through Guava to experience the application of bloom filter:

The introduction ofpomRely on

<dependency>
   <groupId>com.google.guava</groupId>
   <artifactId>guava</artifactId>
   <version>29.0-jre</version>
</dependency>
Copy the code

Create a new test classBloomFilterDemo:

package com.lonely.wolf.note.redis;

import com.google.common.base.Charsets;
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;

import java.text.NumberFormat;
import java.util.ArrayList;
import java.util.List;
import java.util.UUID;

public class GuavaBloomFilter {
    private static final int expectedInsertions = 1000000;

    public static void main(String[] args) {
        BloomFilter<String> bloomFilter = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8),expectedInsertions);

        List<String> list = new ArrayList<>(expectedInsertions);

        for (int i = 0; i < expectedInsertions; i++) {
            String uuid = UUID.randomUUID().toString();
            bloomFilter.put(uuid);
            list.add(uuid);
        }

        int mightContainNum1 = 0;

        NumberFormat percentFormat =NumberFormat.getPercentInstance();
        percentFormat.setMaximumFractionDigits(2); // The largest decimal number

        for (int i=0; i <500; i++){ String key = list.get(i);if (bloomFilter.mightContain(key)){
                mightContainNum1++;
            }
        }
        System.out.println(Number of key values that bloom filter considers to exist: + mightContainNum1);
        System.out.println("-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- line -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --");

        int mightContainNum2 = 0;

        for (int i=0; i < expectedInsertions; i++){ String key = UUID.randomUUID().toString();if (bloomFilter.mightContain(key)){
                mightContainNum2++;
            }
        }

        System.out.println("Number of key values that bloom filter considers to exist:" + mightContainNum2);
        System.out.println("[Key does not exist] The misjudgment rate of Bloom filter is:" + percentFormat.format((float)mightContainNum2 / expectedInsertions)); }}Copy the code

The result after running is:

The mightContainNum1 output in the first part must be equal to the value in the for loop, which is a 100% match. This satisfies principle 1: if the element actually exists, then the Bloom filter must determine that it exists. The misjudgment rate of the output of the second part, FPP, is always around 3%, and approaches 3% as the number of for loops increases. This satisfies principle 2: if the element does not exist, then the Bloom filter may determine that it does.

How does this 3% rate of misjudgment come about? We go to the Create method that creates the Bloom filter and find that the default FPP is 0.03:

How much bit array space and how many hash functions are needed for this default 3% FPP? There are two default methods under BloomFilter to get the size of the array and the number of hash functions:

OptimalNumOfHashFunctions: access to the number of hash function
OptimalNumOfBits: Gets the size of the bit array

Debug to check:

The result is 7298440 bit= 0.87m, followed by five hashes. It can be found that this space occupation is very small, 100W key only occupies 0.87m.

PS: Click here to enter the website to calculate the size of the bit array and the number of hash functions.

How to delete bloom filter

Bloom filter determines the existence of an element by determining whether the corresponding position is 1. However, if you want to delete an element, you cannot change 1 to 0 directly, because there may be other elements in this position. So if we want to support deletion, what should we do? The simplest approach is to add a counter, that is an array of each bit if doesn’t exist is zero, there are several elements will have specific number, not just one, so that there is a problem, originally put 1 is a can meet, but if you want to save the specific figures such as 2, that would require the two, So a Bloom filter with a counter takes up more space.

Bloom filter with counter

Here is an example of a Bloom filter with a counter:

pomFiles import dependencies:

<dependency>
    <groupId>com.baqend</groupId>
    <artifactId>bloom-filter</artifactId>
    <version>1.07.</version>
</dependency>
Copy the code

Create a new Bloom filter with a counterCountingBloomFilter:

package com.lonelyWolf.redis.bloom;

import orestes.bloomfilter.FilterBuilder;

public class CountingBloomFilter {
    public static void main(String[] args) {
        orestes.bloomfilter.CountingBloomFilter<String> cbf = new FilterBuilder(10000.0.01).countingBits(8).buildCountingBloomFilter();

        cbf.add("zhangsan");
        cbf.add("lisi");
        cbf.add("wangwu");
        System.out.println("Is there a King five?" + cbf.contains("wangwu")); //true
        cbf.remove("wangwu");
        System.out.println("Is there a King five?" + cbf.contains("wangwu")); //false}}Copy the code

The first two parameters of the Bloom filter are the expected number of elements, the FPP value, and the countingBits parameter is the number of countingBits used by the counter. That is, 65535 repetitions are allowed.

conclusion

This article focuses on three problems with Redis: cache avalanche, cache breakdown, and cache penetration. The solution of each problem is described, and finally the solution of cache penetration: Bloom filter is introduced. The native Bloom filter does not support deletion, but it is possible to introduce a counter to implement a bloom filter with a counter to implement deletion. Also mentioned at the end, the bloom filter with a counter takes up more space.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

How do you quickly determine the existence of an element from a billion data sets

preface

Cache avalanche

The solution

Cache breakdown

The solution

The cache to penetrate

The solution

Bloom Filter

What is a Bloom filter

Bitmap

Hash collision

Two main features of bloom filter

fpp

Implementation of Bloom Filter (Guava)

How to delete bloom filter

Bloom filter with counter

conclusion

How do you quickly determine the existence of an element from a billion data sets

preface

Cache avalanche

The solution

Cache breakdown

The solution

The cache to penetrate

The solution

Bloom Filter

What is a Bloom filter

Bitmap

Hash collision

Two main features of bloom filter

fpp

Implementation of Bloom Filter (Guava)

How to delete bloom filter

Bloom filter with counter

conclusion

Related Posts

Install nginx under Linux(Centos7)

How many questions about the interview can you answer?

Nacos source (8) 1.4.1 Registry server