Bloom Filter

thinking

If you were constantly trying to determine the presence of an element, what would you do?

  • It’s easy to imagine that you could use a HashSet (HashMap) to look up elements as keys
    • The time complexity of this method is O(1), but the disadvantage is that the space utilization is not high, and more memory resources need to be occupied

But if you’re writing a web crawler to crawl a billion web sites, how do you tell if a site has crawled before, in order to avoid crawling to duplicate sites?

  • Obviously, a HashSet is not a very good alternative to a HashMap because it consumes a lot of memory

Is there a low time complexity, low memory footprint solution? Bloom filters do just that.

Introduction to Bloom filter

Bloom filter was proposed by Bloom in 1970. It is a probabilistic data structure with high space utilization, which can be used to tell you that an element must not exist or may exist. Based on this conclusion, bloom filter has the following advantages and disadvantages

  • Advantages: space efficiency and query time are far more than the general algorithm
  • Disadvantages: there is a certain misjudgment rate, difficult to delete

Although the Bloom filter has a certain misjudgment rate, the misjudgment rate can still be controlled by code, so it can be adjusted according to business requirements. In general, bloom filters can be considered in the following situations

  1. It is often necessary to determine whether an element exists
  2. The number of elements is large and you want to have less memory space
  3. A certain percentage of misjudgment is allowed

Essence: The essence of a Bloom filter is a long binary vector and a series of random mapping functions (Hash functions). As described above, a Bloom filter consists of two parts, one is a Hash function and the other is a binary vector

Binary vector: Can be understood as a binary array

Common application
  • Web page blacklist system, spam filtering system, crawler url judge system, to solve the problem of cache penetration
The principle of bloom filter

Assume that the Bloem filter consists of a 20-bit binary (with an initial value of 0) and three hash functions, each of which generates an index

  • Add elements and set the index generated by each hash function to 1. For example: Now suppose you want to add an element A, it will use three hash function respectively, used to generate the corresponding value (assuming that the first hash function generated by the index of 4, the second hash function generated by the index of 7, the third hash function to generate index to 18), and then in the array corresponds to the index value is set to 1 if you want to continue to add elements of B, It also takes three hash functions, generates the corresponding index value (assuming the first hash function generates the index value of 2, the second hash function generates the index value of 7, and the third hash function generates the index value of 15), and sets the corresponding index value in the array to 1
  • Query whether the element exists: use hash function to calculate the index of the element under the corresponding function. If all the values in the corresponding array index are 1, it means that the element may exist. If at least one of the values in the corresponding index is 0, it means that the element must not exist
    • If a hash function generates an index at a position other than 1, it does not exist (100% accurate)
    • If each hash function generates an index position of 1, it is present, but there is a certain misjudgment rate

So, according to the principle of bloom filter, we know that

The time complexity of add/query is O(k), where K is the number of hash functions

The space complexity is O(m), where M is the number of binary bits

Error rate of Bloom filter

Generally speaking, the misjudgment rate P is affected by three factors, respectively

  1. The number of bits m
  2. The number of hash functions k
  3. Data size n

According to the formula shown in the figure below, the current miscarriage rate p can be calculated

Since the value of n is very large when the data size is very large, the constant coefficient of 0,5 can be ignored, and the number of binary bits is also very large, so the constant 1 can also be ignored. Therefore, the simplified formula is as follows

Therefore, in actual development, the misjudgment rate is determined in combination with the business, so the misjudgment rate can be considered as a known value. And the data size is also known, so the number of binary bits M and the number of hash table K can be obtained by using the misjudgment rate P and the data size N

The scientists came up with the following formula:

Count the number of bits

Count the number of hash tables

The implementation of Bloom filter

Bloom filters provide two apis, one for adding elements and the other for checking whether elements exist

The implementation of the two apis is as follows

/* * n is the data size * p is the misjudgment rate (0,1) * */
public BloomFilter(int n,double p) {
    if (n <= 0 || p <= 0 || p >= 1) {throw new IllegalArgumentException("wrong n or p");
    }
    double ln2 = Math.log(2);
    // Compute the length of the binary vector
    bitSize = (int)(- (n * Math.log(p)) / (ln2 * ln2));
    // Count the number of hash functions
    hashSize = (int)(bitSize * ln2 / n);
    // The length of the bits array
    bits = new long[(int)((bitSize + Long.SIZE - 1)) / Long.SIZE];
}
/* * Add element * */
public void put(T value){
    nullCheck(value);
    int hash1 = value.hashCode();
    int hash2 = hash1 >>> 16;
    for (int i = 1; i <= hashSize; i++) {
        int combinedHash = hash1 + (i * hash2);
        if (combinedHash < 0) {
            combinedHash = ~combinedHash;
        }
        // Generate a binary index
        int index = combinedHash % bitSize;
        // Set the binary bit of index to 1set(index); }}Copy the code

The above is the main implementation logic of the two apis. Specific implementation can be referred to demo.

Demo Download Address

Finished!