What you know about Bloom filters 😜

What is cache penetration?

If the Key is not in the cache, the database will be queried. If there are many such queries, this scenario is called cache penetration, cache breakdown, cache avalanche, etc. If you are interested, you can learn more about it.

Let’s take a look at this request flow chart first. Is there any problem? There are only USER:1, USER:2 and USER:3 keys in the cache, but the keys in the request are not in the cache, so these requests still need to query the database, so the cache is meaningless.

How to solve the existing problems?

In fact, we can save the existing data, in the user query first to query whether the existing data, if not in the results of the existing data we save, then directly returned, no need for query cache and database, if there is so may need to continue to check.

A data structure that holds data

Another question is where do we keep the data we already have? Is it in cache? So what’s the data structure? The collection? The string? (Let’s say we have 10 million pieces of data, but to make it easier, we just store ids.)

Let’s take a look at the set first, let’s assume that we have a Key and each element is only one byte, then ten million elements take up about 34M of space, for memory, such a large space is very precious, so we can not store data in the set for the time being.

So if we look at strings, this is a little bit harder to calculate, and it’s going to take a lot of storage space, because some developers think about concatenating ids into strings, and then ten million ids together, it’s not going to be very short, so we don’t think about that either.

We can solve this problem, but not optimally. Collections, for example, can take up a lot of memory, as can strings, but strings also have the problem of being inefficient.

Bloom filter

Install (Docker)

docker pull redislabs/rebloom

docker run -p 6379:6379 --name redis-bloom redislabs/rebloom

docker exec -it redis-bloom bash
Copy the code

TESTUSER = 1; TESTUSER = 0; TESTUSER = 1

root@3f20c7a7e3ac:/data# redis-cli 127.0.0.1:6379> bf.add TESTUSER 1 (integer) 1 127.0.0.1:6379> bf.add TESTUSER 2 (integer) 1 127.0.0.1:6379> bf.exists TESTUSER 1 (integer) 1 127.0.0.1:6379> Bf.exists TESTUSER 3 (integer) 0 127.0.0.1:6379> bf.exists TESTUSER 2 (integer) 1 127.0.0.1:6379> bf.madd TESTUSER 3 4 5 6 7 1) (integer) 12) (integer) 1 3) (integer) 1 4) (integer) 1 5) (integer) 1 127.0.0.1:6379> bf. Mexists TESTUSER 3 4 5 6 7 8 1) (integer) 12) (integer) 1 3) (integer) 1 4) (integer) 1 5) (integer) 1 6) (integer) 0Copy the code

Bloom Filter concept

It exists in order to retrieve whether an element exists in a set (not the set in Redis). Its spatial and temporal efficiency is far higher than the general algorithm, but there will be a certain misjudgment rate in the case of a large amount of data, and it cannot be deleted.

Calculation of space occupied by Bloom filter

The principle of

When an element is added to the collection of bloom filter, by K a hash function to map the element as the subscript of an array, and the value of the array subscript is set to 1, when to retrieve the value exists, just need to find the corresponding coordinates through K times hash function, and to assess the value of these elements is to 1, If all of them are 1, they might exist, and if all of them are 0, they definitely don’t exist.

The mathematical formula of Bloom filter is not enumerated, personally feel to understand the principle is not particularly helpful.

Set the error rate and estimate the amount of storage

Bloom filters have a certain error rate, so if we estimate the number of stored values, try to set the number of stored as large as possible to avoid false judgments.

127.0.0.1:6379> bf.reserve TESTRESERVE 0.03 10000 OK 127.0.0.1:6379> bf.add TESTRESERVE 1 (integer) 1 127.0.0.1:6379> bf.add TESTRESERVE 1 (integer) 1 127.0.0.1:6379> Bf. reserve TESTRESERVE 0.03 10000 (error) ERR item existsCopy the code

conclusion

Usage scenarios about bloom filter is actually a lot of, also can solve or avoid many problems, such as the cache penetration, etc., we are using it, we really know its advantages and disadvantages, then we can rest assured to use, and through optimizing to avoid its shortcomings, it will play its maximum value.

extension

One of the disadvantages of the Bloom filter is that it cannot delete the value that has been added. There is a filter that can solve this problem, that is, the cuckoo filter. The function is almost the same as the Bloom filter, but the cuckoo filter can delete the element that has been added. If you’re interested, take a look.