The original

Introduction to the

Bloom Filter was proposed by Bloom in 1970. It is a fast search algorithm (storage structure) based on the hash function mapping. It can realize the record of the existence or not of massive data (black and white list judgment) with a very small space and operation cost. It is characterized by efficient insertion and query, which can determine the must not exist and may exist

Bloom filter

A Bloom filter is either a bit vector or an array of bits

“Add” element

To shoot an “add” into a Bloom filter, you need to use multiple hash functions to generate multiple hash values, each corresponding to a point on the bit array, and then mark the corresponding position of the bit array as 1.

As shown below, the string ‘hello’ generates the hash values 1,3,9 from the three hash functions

The string ‘word’ generates 1,5,7

Note: Since both Hello and Word return bit 1, the previous 1 is overwritten

Query element

To check whether an element is in the collection, the same method hashes the element to three points on the array.

If one of the three points is not 1, then we can judge that the element must not exist in the set.

If all three points are 1, then that element might be in the set. Note: It is not possible to determine whether the element is always present here, there may be a certain misjudgment rate. As more and more elements are added, more and more bits are set to 1, so even if an element is not stored, the Bloom Filter will determine that it exists if all three bits calculated by the hash function are set to 1 by other elements.

For example, if the three hash values of the word flink are 1,3, and 7, does that mean that flink must exist in the set?

The advantages and disadvantages

Compared with traditional data structures such as List, Set, Map, etc., Bloom Filter is more efficient, occupies less space, and can accurately determine that elements are not in the Set

disadvantages

  1. It can only judge that the element has a probability to be in the set, so it is unreliable to use Bloom Filter to judge the result of the element in the set
  2. When an element that has been added to the collection is deleted, the Bloom Filter cannot “refresh” the hash position of the element. Because these positions have a probability of also being hash positions for other elements. So collections that use the Bloom Filter cannot delete elements.

How to reduce the error

  • Increase the length of the Bloom filter, otherwise all bits will easily be 1

  • The number of hash functions should be considered. The more the number of hash functions is, the faster the speed of bit position 1 of the Bloom filter is, and the lower the efficiency of the Bloom filter is. But if it’s too little, then our error will be higher.