Bloom filter

A Bloom filter is a probabilistic data structure consisting of a bit array and multiple hash functions that returns two possible and certain non-existent results.

An element in a Bloom filter is determined by multiple state values. Bit arrays store state values, and hash functions calculate the position of state values.

According to its algorithm structure, it has the following characteristics:

  • A finite bit array is used to represent the number of elements greater than its length, because a bit’s state value can simultaneously identify multiple elements.
  • Elements cannot be deleted. This is because a bit state value may simultaneously identify multiple elements.
  • Adding elements never fails. However, as more elements are added, the misjudgment rate increases.
  • If the judgment element does not exist, it must not exist.

For example, X,Y and Z are determined by three state values to determine whether the element exists, and the position of the state value is calculated by three hash functions.

Mathematical relationship

The probability of misjudgment

As for the probability of misjudgment, the state value of each bit may identify multiple elements at the same time, so it has a certain probability of misjudgment. If the bit array is full, it will always return true when determining whether an element exists, with a 100% error rate for non-existent elements.

So, the probability of miscalculation is related to the number of elements added, the length of the Bloom filter (the size of the bit array), the number of hash functions.

According to Wikipedia inference, the probability of misjudgment PfpP_{fp}Pfp has the following relationship:


P f p = ( 1 [ 1 1 m ] k n ) k material ( 1 e k n m ) k { P_{fp} =\left(1-\left[1-{\frac {1}{m}}\right]^{kn}\right)^{k}\approx \left(1-e^{{-\frac {kn}{m}}}\right)^{k}}
  • MMM is the size of the bit array;
  • NNN is the number of added elements;
  • KKK is the number of hash functions;
  • Eee mathematical constant, approximately 2.718281828.

It can be concluded that when the number of added elements is 0, the false positive rate is 0. When the bit array is all 1, the false positive rate is 100%.

The relation between Pfp P_{fp}Pfp and N nn under different number of hash functions is shown as follows:

There are a couple of things you can do based on the probability of misjudgment formula

  • Estimate the optimum bloom filter length.
  • Estimate the optimal number of hash functions.

Optimum bloom filter length

When NNN adds elements and PfpP_{fp}Pfp false positive probability is determined, MMM equals:


m = n ln P f p ( ln 2 ) 2 material 1.44 n log 2 P f p {\ displaystyle m = – {\ frac {n \ ln P_ {fp}} {\ ln (2) ^ {2}}} \ \ cdot approx 1.44 n \ log _ {2} P_ {fp}}

Optimal number of hash functions

When NNN and PfpP_{fp}Pfp are determined, KKK is equal to:


k = ln P f p ln 2 = log 2 P f p {\displaystyle k=-{\frac {\ln P_{fp} }{\ln 2}}=-\log _{2}P_{fp} }

When NNN and MMM are determined, KKK equals:


k = m n ln 2 {\displaystyle k={\frac {m}{n}}\ln 2}

Implement bloom filter

Before using a Bloom filter, we typically evaluate two factors.

  • The maximum number of elements expected to be added.
  • The business’s tolerance for error. For example, if one error is allowed in 1,000 cases, the probability of misjudgment should be less than one in 1,000.

Many Bloom filtering tools provide the expected number of additions and probability of miscalculation configuration parameters, and they calculate the optimal length and number of hash functions based on the configured parameters.

There are some good Bloom filtering toolkits in Java.

  • GuavaBloomFilter.
  • redissonRedissonBloomFilterIt can be used in Redis.

Take a look at Guava’s simple implementation of BloomFilter, which calculates the length of the bit array and the number of hash functions before creating it.

 static <T> BloomFilter<T> create(
      Funnel<? super T> funnel, long expectedInsertions, double fpp, Strategy strategy) {
    /** * expectedInsertions: Expect number of additions * FPP: Misjudgment probability */
    long numBits = optimalNumOfBits(expectedInsertions, fpp);
    int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);
    try {
      return new BloomFilter<T>(new BitArray(numBits), numHashFunctions, funnel, strategy);
    } catch (IllegalArgumentException e) {
      throw new IllegalArgumentException("Could not create BloomFilter of " + numBits + " bits", e); }}Copy the code

According to the optimal Bloom filter length formula, the optimal bit array length is calculated.


static long optimalNumOfBits(long n, double p) {
    if (p == 0) {
      p = Double.MIN_VALUE;
    }
    return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
  }
Copy the code

According to the optimal number of hash functions formula, calculate the optimal number of hash functions.

static int optimalNumOfHashFunctions(long n, long m) {
    return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
  }
Copy the code

The calculation method of RedissonBloomFilter in Redisson is also consistent.

    private int optimalNumOfHashFunctions(long n, long m) {
        return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
      }

    private long optimalNumOfBits(long n, double p) {
        if (p == 0) {
            p = Double.MIN_VALUE;
        }
        return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
    }
Copy the code

Memory footprint

Imagine a scenario where a mobile phone number is deleted. Each mobile phone number occupies 22 bytes. The estimated logical memory is as follows.

expected HashSet FPP = 0.0001 FPP = 0.0000001
1 million 18.28 MB 2.29 MB 4MB
10 million 182.82 MB 22.85 MB 40MB
100 million 1.78 G 228.53 MB 400MB

Note: The actual physical memory is larger than the logical memory.

Misjudgment probability PPP and added elements NNN, bit array length MMM, number of hash functions KKK relationship is as follows:

Application scenarios

  1. Weak password detection;
  2. Spam address filtering.
  3. Browser detects phishing sites;
  4. Cache penetration.

Weak cipher detection

Maintain a list of hashed weak passwords. When a user registers or updates a password, a bloom filter is used to check for the new password and the prompt user is detected.

Spam address filtering

Maintains a hashed list of spam addresses. When a user receives a message, bloom filter is used to detect it and identifies it as spam.

Browser detects phishing sites

Use the Bloom filter to find if a URL for a site exists in the phishing site database.

The cache to penetrate

Cache penetration is a query for data that does not exist and is not hit by either the cache layer or the database. When the cache is not hit, the database is queried

  1. If the database is not hit, empty results are not written back to the cache and returned.
  2. The database hits, the query result is written back to the cache and the result is returned.

A typical attack simulates a large number of requests for nonexistent data, all of which fall on the database, causing the database to go down.

One solution is to put the existing cache into a Bloom filter and validate the filter before the request is made.

summary

For quadrillion levels of data, the use of Bloom filters has certain advantages, and it is critical to properly evaluate the expected number of additions and the probability of misjudgment based on the business scenario.

reference

En.wikipedia.org/wiki/Bloom_…

hur.st/bloomfilter