background

In the case of high concurrency requests, service data is generally cached to increase system concurrency, because disk I/O and network I/O are hundreds of times inferior to memory I/O performance. Do a simple calculation, if we need some data, the data read from the database disk needs 0.1 s, from the switch to 0.05 s, then each request to complete a minimum of 0.15 s (of course, in fact, disk and network IO nor so slowly, and here are just examples), the database server can only respond to 67 requests per second. If the data is in native memory and only takes 10us to read, it can respond to 100,000 requests per second. Caching improves processing efficiency by storing frequently used data closer to the CPU to reduce data transfer time. This is the point of caching.

However, as mentioned above, cache will have problems such as breakdown and penetration. For this, we can introduce Bloom Filter to prevent system abnormalities caused by excessive database pressure.

Bloom Filter

If you want to determine whether an element is in a set, the general idea is to store all the elements and compare them. Linked lists, trees, and so on. But with the increase of elements in the set, we need more and more storage space and slower and slower retrieval speed (O(n),O(logn)).

But there is also a data structure called a Hash table, which maps an element to a point in an array of bits using a Hash function, so that we can see if the point is a 1 to see if it is in the set. That’s the basic idea of a Bloom filter.

Bloom Filter was proposed by Bloom in 1970. It’s actually a long binary vector and a series of random mapping functions, and you can actually think of it simply as an imprecise set structure that might misjudge the existence of an object when you use its contains method. But the Bloom filter is not particularly inaccurate, as long as the parameters are set reasonably, its accuracy can be controlled relatively accurate enough, there will only be a small probability of misjudgment.

When a Bloom filter says a value exists, it may not exist; When it says it doesn’t exist, it certainly doesn’t exist. For example, when it says it doesn’t know you, it really doesn’t know you, but when it says it knows you, it may misjudge you because you look like another friend it knows (with a similar face).

The principle of

When an element is added to the Bloom filter, the following operations are performed:

  1. The element value is computed using the hash function in the Bloom filter to obtain the hash value (there are several hash functions that yield several hash values).
  2. Based on the resulting hash value, the value of the corresponding subscript in the bit array is set to 1.

When we need to determine whether an element exists in the Bloom filter, we do the following:

  1. Perform the same hash on the given element again;
  2. If each element in the bit array is 1, if it is 1, then the value is in the Bloom filter,

If there is a value that is not 1, the element is not in the Bloom filter.

As shown above,

  1. There are eight hash bits and the default value is 0,
  2. When thetencentAfter hashing, bits 2, 5, and 7 are set to 1,
  3. Continue tocloudAfter hashing, bits 3, 4 and 6 are set to 1.
  4. Now let’s look at other wordsotherThe word does not exist in the set if bits 1, 2, and 3 are assumed to be 0.

Summarize the process

How to choose the number of hash functions and bloom filter length

If the length of the Bloom filter is too small, all the bits will be used up quickly, at which point any query returns “may exist”; If the length of the Bloom filter is too large, the probability of miscalculation is small, but the memory space is wasted. Similarly, the more hash functions there are, the faster the bit of Bloom filter will be occupied. The lower the number of hash functions, the higher the probability of misjudgment. Therefore, the length of the Bloom filter and the number of hash functions need to be weighed according to the business scenario.

Three parameters

  • The number of hash functions k;
  • The capacity of the Bloem filter bit array m;
  • The number of data inserted by the Bloom filter n;

As for the setting of the three values, a big man has made a summary in his article ten years ago, which requires profound mathematical foundation, so I will not repeat it in detail here. The article link is http://blog.csdn.net/jiaomeng/article/details/1495500

The main mathematical conclusions are:

  • In order to obtain the optimal accuracy, when k = ln2 * (m/n), Bloom filter can obtain the optimal accuracy.

The advantages and disadvantages

Advantages: The advantages are obvious, binary array, very little memory, and inserts and queries are fast enough.

Disadvantages: With the increase of data, the misjudgment rate will increase; There is also the inability to determine that the data must exist; Another important disadvantage is that you cannot delete data

scenario

  • Big data check: This allows for the above de-duplication, using HashMap if your server has enough memory

Can be a good solution, theoretically can be O(1) time complexity, but when the data volume, still only consider bloom filter.

  • Solve cache penetration (mentioned in the background) : With bloom filters we can pre-query data by primary key, such as user ID or article ID

Cache to a filter. When querying data according to ID, we first determine whether the ID exists. If so, we proceed to the next step. If not, return it directly so that subsequent database queries are not triggered. It is important to note that cache penetration cannot be completely solved and can only be controlled to a tolerable extent.

  • Crawler/mailbox system filtering: I don’t know if you have noticed that some normal mail is also put in the spam directory, this is due to the use of bloom filter misjudgment.
  • Google Chrome uses bloom filters to identify malicious urls.

Basic operation

The bloom filter has two basic directives, bf.add to add an element and bf.exists to check if an element exists. This is used in the same way as sadd and sismember of set sets. Note that bf.add can only add one element at a time. If you want to add more than one element at a time, you need to use the bf.madd directive. Similarly, if you need to query the existence of multiple elements at once, you need to use the bf. Mexists directive.

127.0.0.1:6379> bf.reserve users 0.01 1000 OK 127.0.0.1:6379> bf.add users U1 (integer) 1 127.0.0.1:6379> bf.add users U2 (integer) 1 127.0.0.1:6379> bf.add Users u3 (integer) 1 127.0.0.1:6379> bf.exists Users U3 (integer) 1 127.0.0.1:6379> bf.madd users user4 user5 user6 1) (integer) 12) (integer) 1 3) (integer) 1 127.0.0.1:6379> bf.mexists users user4 user5 user6 user7 1) (integer) 1 2) (integer) 1 3) (integer) 1 4) (integer) 0Copy the code

Bf. reserve has three parameters, which are key, error_rate and initial_size:

The lower the error_rate is, the more space is needed. For occasions that do not need to be too precise, it is ok to set it a little larger. For example, the push system mentioned above will only filter out a small part of the content, and the overall viewing experience will not be greatly affected.

Initial_size means the number of elements expected to be put in. When the actual number exceeds this value, the misjudgment rate will increase. Therefore, it is necessary to set a large value in advance to avoid the increase of misjudgment rate caused by exceeding this value.

If bf.reserve is not applicable, the default error_rate is 0.01 and the default initial_size is 100.