This is the 15th day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

Bloom filter

What is a Bloom filter

Essentially, the Bloom filter is a kind of data structure, an ingenious probabilistic data structure that inserts and queries efficiently and tells you “something definitely doesn’t exist or might exist.”

It is more efficient and takes up less space than traditional List, Set, Map and other data structures, but the disadvantage is that the results it returns are probabilistic rather than exact

Realize the principle of

A HashMap problem

Usually you determine the existence of an element using a HashMap, and indeed you can map the value to the Key of the HashMap, and then you can return the result in O(1) time, very efficiently. However, the implementation of HashMap also has disadvantages, such as high memory ratio. Considering the load factor, usually the space cannot be used up, and once you have a lot of values, such as hundreds of millions, the amount of memory occupied by HashMap becomes quite significant.

For example, if your data set is stored on a remote server, the local service accepts input, and the data set is too large to be read into memory at once to build a HashMap, there are also problems

Bloom filter data structure

A Bloom filter is a bit vector or array of bits that looks like this:

If we want to map a value to a Bloom filter, we need to generate multiple hashes using multiple different hash functions and point to the bit position 1 for each generated hash value. For example, hashes 1, 4, 7 are generated for the value “baidu” and three different hash functions respectively. Then the figure above changes to:

We now store another value “Tencent”. If the hash function returns 3, 4, 8, the graph continues to change to:

Note that the 4 bit is overwritten because it is returned by the hash functions of both values. Now if we want to query whether the value “dianping” exists, the hash function returns 3 values, 1, 5 and 8, and we find that the value of the 5 bit is 0, indicating that no value is mapped to this bit. So we can safely say that the value “dianping” does not exist. When we need to query whether the value of “baidu” exists, the hash function will inevitably return 1, 4, 7, and then we check that the value of these three bits are all 1, so we can say that “baidu” exists? The answer is no, only that the value baidu may exist.

As more and more values are added, more and more bits will be set to 1. In this way, even if a value “Taobao” has not been stored, if all three bits returned by the hash function are set to 1 by other values, the program will still judge that the value “Taobao” exists

Does deletion support?

Traditional Bloom filters do not support deletion. However, a variant called Counting Bloom Filter can be used to test whether the number of elements counted is definitely below a certain threshold, and it allows element deletion

How to choose the number of hash functions and bloom filter length

Obviously, too small a Bloom filter will soon have all bits 1, and any query for a value will return “may exist”, defeating the purpose of filtering. The length of Bloom filter directly affects the false positive rate, the longer the bloom filter, the lower the false positive rate.

In addition, the number of hash functions also needs to be weighed. The more the number of hash functions, the faster the Bloom filter bit 1 is, and the lower the efficiency of the Bloom filter is. But if there are too few, our false positives will be higher.

How to choose k and m values suitable for business? Here is a direct formula:

K is the number of hash functions, m is the length of Bloom filter, n is the number of inserted elements, and p is the false positive rate.