preface

  • The article was first published on wechat public account [Code ape Technology Column]

  • Recently, I used bloom filter to delete the read content in the recommendation system, so I wrote an article to share it.
  • The article is not very long, mainly about the core idea of Bloom filter, the table of contents is as follows:! [](https://p1-jj.byteimg.com/tos-cn-i-t2oaga2asx/gold-user-assets/2020/4/22/171a11c651ef13a0~tplv-t2oaga2asx-image.image )

What is a Bloom filter?

  • A Bloom filter is a data structure consisting of a bit array ** of length ‘m’ bits and ** hash functions **. The bit array is initialized to ‘0’, and all hash functions separately hash the input data as evenly as possible.
  • When ** inserts ** an element, its data is converted to ‘k’ hash values by ‘K’ hash functions. The ‘K’ hash values are used as’ subscripts’ of the bit array, and the value of the corresponding subscripts in the array is set to ‘1’.
  • When * * a * * query elements, will be the same data by ` k ` a hash function converts ` ` a hash value k (subscript), the value of the corresponding query array subscript if there is a value for ` 0 subscript ` shows that element must not in the collection, if all the subscript values for ` 1 `, indicates that the element has ` may ` in the collection. ** Why is it possible to be in a set? ** Because it is possible that one or more subscripts 1 are affected by other elements, this is known as a ‘false positive’, more on this later.
  • ** cannot delete an element **, why? Because the hash of the element you delete may have the same hash of one element in the collection, deleting that element will cause other elements to be deleted.
  • The following diagram shows an example of a Bloom filter with ‘m=18’, ‘k=3’. The x, y, and z elements of the collection are hashed into the array by three different hash functions. When you query for element w, because one of the bits is 0, w is not in the collection. ! [](https://p1-jj.byteimg.com/tos-cn-i-t2oaga2asx/gold-user-assets/2020/4/22/171a11c652f1b758~tplv-t2oaga2asx-image.image )

Calculation of false positive probability

  • False positives are a pain point in bloom’s filter, so you need to do everything you can to reduce the probability of false positives, and then you need to calculate the probability of false positives.
  • Suppose our hash function selects bits in the bit array with equal probability. Of course, when designing hash functions, you should also try to satisfy uniform distribution.
  • Insert an element into the Bloom filter of bit array length ‘m’ and one of its hash functions sets a particular bit to ‘1’. Thus, the probability that the bit will still be zero after the element is inserted is:
  • ! [](https://p1-jj.byteimg.com/tos-cn-i-t2oaga2asx/gold-user-assets/2020/4/22/171a11c653174c29~tplv-t2oaga2asx-image.image )
  • With ‘k’ hash functions and ‘n’ elements inserted, we can naturally obtain the probability that the bit is still 0:! [](https://p1-jj.byteimg.com/tos-cn-i-t2oaga2asx/gold-user-assets/2020/4/22/171a11c656179cf1~tplv-t2oaga2asx-image.image )
  • Conversely, the probability that it has been set to ‘1’ is:
  • ! [](https://p1-jj.byteimg.com/tos-cn-i-t2oaga2asx/gold-user-assets/2020/4/22/171a11c654b8bccb~tplv-t2oaga2asx-image.image )
  • In other words, if after inserting ‘n’ elements, we detect an element that is not in the set, then the probability of being falsely reported as being in the set (i.e. the probability that all hash functions correspond to bits’ 1 ‘) is:! [](https://p1-jj.byteimg.com/tos-cn-i-t2oaga2asx/gold-user-assets/2020/4/22/171a11c656f4bf65~tplv-t2oaga2asx-image.image )
  • When ‘n’ is relatively large, the false positive rate can be approximated according to the limit formula:
  • ! [](https://p1-jj.byteimg.com/tos-cn-i-t2oaga2asx/gold-user-assets/2020/4/22/171a11c67c946b9c~tplv-t2oaga2asx-image.image )
  • Therefore, in the case of a certain number of hash functions’ k ‘, the conclusion can be drawn:

    1. The larger the bit array length m is, the lower the false positive rate is.
    2. The greater the number of inserted elements n, the higher the false positive rate.

advantages

  • Using the bit array representation, without storing the data itself, the space saving is absolutely superior to the traditional way.
  • Time efficiency is very high, no matter insert or query, only need to go through a simple hash function transformation, time complexity is’ O(k) ‘.

disadvantages

  • The probability of ‘false positives’ is not applicable to scenarios requiring high accuracy.
  • Only insert and query, not delete elements.

Application scenarios

  • Bloom filter has many uses, but the main function is to remove weight, here are a few usage scenarios.

Crawler repeat URL detection

  • Just think, Baidu is a crawler, it will regularly collect the information of major websites, articles, so it is how to ensure that crawling to the article information is not repeated, it will store the URL in the Bloom filter, each time before crawling from the bloom filter to determine whether the URL exists, so as to avoid repeated crawling. Of course, there is the possibility of false positives, but as long as your bit array is large enough, the probability of ‘false positives’ will be very low. On the other hand, do you think Baidu will care about this error? One of your articles may not be included because of the probability of false positives.

Douyin recommended function

  • There is no one who hasn’t checked Douyin before. Does douyin send you repeated videos every time? How does he ensure that his recommendations are not repeated?
  • The most obvious thing to think about is that Tiktok records a user’s viewing history and then excludes it from the history. That’s one solution, but what about performance? Needless to say, anyone with common sense knows this is impossible.
  • To solve this repetitive problem, Bloom filter has an absolute advantage, can be easily solved.

Preventing cache penetration

  • Cache penetration refers to the query of a piece of data that neither the database nor the cache has, the database will be continuously queried, and the pressure to access the database will continue to increase.
  • Bloom filters are also very good at solving cache penetration problems, which I won’t go into here, but will cover in a future article.

How to implement bloom filter?

  • After understanding the design idea of Bloom filter, want to achieve a bloom filter is actually very simple, Chen here will no longer move the door axe, introduce the ready-made implementation.

Redis implementation

  • Redis4.0 after the launch of plug-in functions, the following with docker installation:

    docker pull redislabs/rebloom docker run -p6379:6379 redislabs/rebloom

  • After the installation is complete, connect to Redis and run the following command:

    redis-cli

  • As for the specific use of here will not demonstrate, directly look at the official documentation and tutorial, it is still very simple to use.

Guava implementation

  • Guava supports the implementation of bloom filters. It is easy to implement a bloom filter using Guava.

1. Create a Bloom filter

  • Create a Bloom filter as follows:

    BloomFilter filter = bloomfiler.create (Funnels. IntegerFunnel (), 5000, 0.01); Intstream.range (0, 100_000).foreach (filter::put); Boolean b = filter.mightcontain (1);

  • ‘arg1’ : Used to convert input data of any type T to Java primitive type data, in this case converted to byte
  • Arg2: byte Indicates the cardinality of the array of bytes
  • Arg3: the expected probability of false positives

2. Estimate the optimal m and k values

  • Guava on the bottom of the byte array cardinality (m) and the number of hash functions k to do their own algorithm, source code is as follows:

    Static Long optimalNumOfBits(long n, double p) {if (p == 0) {p = double.MIN_VALUE; } return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2))); }

    / / k value calculation of the static int optimalNumOfHashFunctions (long long n, m) {/ / * (m/n) log (2), but avoid truncation due to division! return Math.max(1, (int) Math.round((double) m / n * Math.log(2))); }

  • To understand Guava’s calculations, we need to follow the process derived from above.
  • According to the approximate calculation method of false positive rate, if the false positive rate is to be as small as possible, the value of ‘k’ should be:
  • ! [](https://p1-jj.byteimg.com/tos-cn-i-t2oaga2asx/gold-user-assets/2020/4/22/171a11c67f82747e~tplv-t2oaga2asx-image.image )
  • By substituting K into the formula in the previous section and simplifying it, we can sort out the relationship between the expected false positive rate P and m and n:! [](https://p1-jj.byteimg.com/tos-cn-i-t2oaga2asx/gold-user-assets/2020/4/22/171a11c6888facbb~tplv-t2oaga2asx-image.image )
  • By conversion:
  • ! [](https://p1-jj.byteimg.com/tos-cn-i-t2oaga2asx/gold-user-assets/2020/4/22/171a11c6904f30b0~tplv-t2oaga2asx-image.image )
  • Based on the above analysis, the following conclusions can be drawn:

    1. If the expected false positive rate P is specified, the optimal m value has a linear relationship with the expected number of elements N.
    2. The optimal k value is actually only related to P, independent of m and n, namely:! [](https://p1-jj.byteimg.com/tos-cn-i-t2oaga2asx/gold-user-assets/2020/4/22/171a11c69b8dcc12~tplv-t2oaga2asx-image.image )
    3. In conclusion, it is important to determine the values of ‘p’ and ‘m’ when creating a Bloom filter.

conclusion

  • So far, the knowledge of the Bloom filter is introduced here, if you think Chen wrote well, forward in the watch a wave, a reader’s support will be my great encouragement.
  • ** In addition to private chat with Chen or want to add a group of friends, the public number to reply keywords’ group ‘plus Chen wechat, Chen will be the first time to pull you into the group. **

Shoulders of giants

  • https://blog.csdn.net/u012422440/article/details/94088166
  • https://blog.csdn.net/Revivedsun/article/details/94992323