This is the 17th day of my participation in the August Challenge

1, what is bloom filter?

The Bloom filter is essentially a data structure, a clever probabilistic data structure that inserts and queries efficiently and tells you “something definitely doesn’t exist or might exist.”

It is more efficient and takes up less space than traditional List, Set, Map and other data structures, but the disadvantage is that the results it returns are probabilistic rather than exact.

2. Why is a Bloom filter needed?

2.1 Cache Penetration

We often cache some data in Redis, such as article details.

This is the simplest, most common, and most effective way to improve performance by fetching data directly from the cache based on the article Id.

The general query request process is like this: first check the cache (if there is a cache directly return) -> if there is not in the cache, then go to the database query (put the data from the database into the cache) -> return data.

Everything looks good. But what happens if a lot of requests come in now, all of them asking for a non-existent article Id?

Since the article Id does not exist, then there must be no corresponding cache information, no cache, so a large number of requests are piled up to the database, the pressure of the database suddenly came up, and even the database hit full run dead.

2.2 Extensive Query

Query the mobile phone number system of nearly 100 million users whether there is their transaction information, how to make accurate and fast judgment?

  • Solution 1: Query information in the database

The query pressure seems a bit high, it is not efficient and there is a risk that the database service will be overwhelmed.

  • Scheme 2: Through Redis cache (put the user’s mobile phone number with transaction slip into Set)

Large keys may cause slow query, and memory resources may be wasted.

The above examples have one common feature: ** How to determine whether an element exists in a set. ** These problems might be solved in other ways, but bloom filters are effective.

3, the realization principle of Bloom filter

3.1 Hash function

The concept of a hash function: a function that converts data of any size into data of a specific size. The converted data is called a hash or hash code. The schematic diagram is as follows:

It is obvious that the original data becomes a hash code after being mapped by the hash function, and the data is compressed.

Hash functions are the basis for implementing hash tables and Bloom filters.

3.2 Data Structure

A Bloom filter is a bit vector or an array of bits, as shown below:

The core implementation of the Bloom Filter is a large bit array and several hash functions.

Suppose we want to put the elements X, y and z into the Bloom filter and then retrieve w, as shown in the following diagram:

The above picture shows the specific operation process:

Suppose the set has three elements {x, y, z} and the number of hash functions is 3. Initialize the bit array by setting each bit to 0.

  • Settings: set to the inside of the {x, y, z} each element, the element in turn through the three hash function mapping, each mapping produces a hash value, the value corresponds to a point above the array, and then will be marked as an array of the corresponding position 1 (this position 2, 4, 5, 6, 12, 14, 17-1).

  • Query: The same method hash maps W to three points in the array (position 5, 14, 16) to determine whether the element W exists in the set. If one of the three points is not 1, then the element must not be in the set. Conversely, if all three points are 1, the element may exist in the set.

Note: It is not possible to judge whether the element exists in the set, and there may be a certain misjudgment rate.

Let’s say that some element is mapped to 4, 5, and 6. Although these three points are all 1, it is obvious that these three points are the hashed positions of different elements. Therefore, this situation indicates that even though elements are not in the set, they may also correspond to 1, which is the reason for the existence of misjudgment rate.

4, Bloom filter operation

4.1 Adding Elements

Syntax: [bf.add key options] 127.0.0.1:6379> bf.add users 15012345678 (integer) 1Copy the code
  • I’m going to add elements to k hash functions

  • We get k positions that correspond to the bit array

  • Let’s set these k positions to be 1

4.2 Querying Elements

Grammar: [bf.exists key options] 127.0.0.1:6379> bf.exists users 15012345678 (integer) 1 127.0.0.1:6379> bf.exists users 15012345679 (integer) 0Copy the code
  • The element to be queried is given k hash functions

  • We get k positions that correspond to the bit array

  • If one of k positions is 0, you’re definitely not in the set

  • If all k positions are 1, then it might be in the set

4.3 Deleting Elements

Traditional Bloom filters do not support deletion.

However, a variant called Counting Bloom Filter can be used to test whether the number of elements counted is definitely below a certain threshold, and it allows element deletion. For details, see the principle and implementation of Counting Bloom Filter

5. Best practices

5.1 Application Scenarios

Common uses include using bloom filters to reduce disk I/O or network requests, because once a value is definitely not there, we don’t have to make subsequent, more expensive query requests.

Since you use bloom filters to speed up finding and determining presence, low-performance hash functions are not a good choice. Recommend MurmurHash, Fnv, etc.

5.2 Setting the Number of Hash Functions and the Length of bloom Filter

If the bloom filter is too small, soon all the bits will be set to 1, and any query will return “may exist”, defeating the purpose of filtering. The length of Bloom filter directly affects the false positive rate, the longer the bloom filter, the lower the false positive rate.

The number of hash functions also needs to be weighed. The more the number, the faster the Bit position of The Bloom filter is 1, and the lower the efficiency of the Bloom filter is. But if there are too few, our false positives will be higher.

– END –

Author: The road to architecture Improvement, ten years of research and development road, Dachang architect, CSDN blog expert, focus on architecture technology precipitation learning and sharing, career and cognitive upgrade, adhere to share practical articles, looking forward to growing with you. Attention and private message I reply “01”, send you a programmer growth advanced gift package, welcome to hook up.

Thanks for reading!