A full explanation of the Bloom filter

preface

We talked earlier about Redis cache avalanche, penetration, and breakdown. In this article we mentioned that one of the solutions to cache penetration is the Bloom filter, but we didn’t talk about how to use the Bloom filter last time.

As a warm man’s elder brother, to make up for you, please call me IT old warm man.

What is a Bloom filter

The Bloom Filter, which was introduced in 1970 by a guy named Bloom, is now 50 years old, as old as my brother.

It’s actually a long binary vector and a bunch of random mapping functions, and binary, as you probably all know, stores data that’s either 0 or 1, and the default is 0.

It is used to determine whether an element is in a set. 0 indicates that certain data does not exist, and 1 indicates that certain data exists.

Do you understand? As aSunshine boyMy brother is drawing you a picture to help you understand:

Use of Bloom filter

Resolving Redis cache penetration (today’s focus)
In crawler, the crawler url is filtered, and the url that already exists in Braun is not crawled.
Spam filtering, to determine whether each address that sends mail is in bloom’s blacklist, if so, it is judged as spam.

The above is just a simple example of use, we can draw inferences from one another, flexible use in the work.

Bloom filter principle

In the process

A Bloom filter, as it says, is a collection of binary data. When a piece of data is added to the set, it undergoes the following (there are drawbacks, which will be discussed below) :

The data is computed by K hash functions and returns K computed hash values
These K hash values map to the corresponding K binary array subscripts
Change the binary data corresponding to K subscripts to 1.

For example, if the first hash function returns x and the second and third hash functions return y and z, then the binary of x, y, and z is changed to 1.

As shown in the figure:

The query process

The main function of bloom filter is to query a data, in this binary set, the query process is as follows:

This data is computed by K hash functions, corresponding to K hash values computed
Find the corresponding binary array subscript using the hash value
Judgment: If there is a location where the binary data is 0, then the data does not exist. If both are 1’s, the data is in the set. (There are drawbacks, which will be covered below)

Removal process

You generally can’t delete data in bloom filters, which is one of the drawbacks we’ll examine below.

Advantages and disadvantages of bloom filters

advantages

Since you store binary data, it takes up very little space
Its insert and query speed is very fast and the time complexity is O (K), reminiscent of the process of HashMap
Confidentiality is good because it doesn’t store any raw data itself, only binary data

disadvantages

Which brings us back to the shortcomings we mentioned above.

Adding data is done by calculating the hash value of the data, so it is quite possible that two different data will compute the same hash value.

For example, if “hello” and “hello” in the figure end up with the same hash value, they will change the binary data with the same subscript to 1.

At this point, you don’t know whether binary with subscript 2 means “hello” or “hello.”

The following disadvantages can be obtained:

First, miscarriage of justice

If the graph above does not contain “hello” but only “hello”, then a query using “hello” will determine that “hello” exists in the collection.

Because the hash value for “hello” and “hello” is the same, the binary data found by using the same hash value is also the same, which is 1.

Two, delete difficulty

I’m going to stop here and I think you can understand why, but as your warm brother, I’ll tell you.

Again, use the above example, because “hello” and “hello” have the same hash value and the corresponding array index is the same.

At this time, the elder brother wanted to delete “hello”, and changed the binary data in the subscript 2 from 1 to 0.

Do we delete “hello” at the same time? (0 means there is such data, 1 means there is no such data)

By this point, you’ve already understood the Bloom filter. I’m the warm guy.

Implement bloom filter

There are many ways to implement this, one of which is the one provided by Guava.

Introduction of Guava POM configuration

<dependency>
  <groupId>com.google.guava</groupId>
  <artifactId>guava</artifactId>
  <version>29.0-jre</version>
</dependency>
Copy the code

Two, code implementation

And let’s measure the error rate by the way.

import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;

public class BloomFilterCase {

  /** * how much data is expected to be inserted */
  private static int size = 1000000;

  /** * Expected misjudgment rate */
  private static double fpp = 0.01;

  /** * Bloom filter */
  private static BloomFilter<Integer> bloomFilter = BloomFilter.create(Funnels.integerFunnel(), size, fpp);


  public static void main(String[] args) {
    // Insert 100,000 sample data
    for (int i = 0; i < size; i++) {
      bloomFilter.put(i);
    }

    // Test the error rate with another hundred thousand test data
    int count = 0;
    for (int i = size; i < size + 100000; i++) {
      if (bloomFilter.mightContain(i)) {
        count++;
        System.out.println(i + "Misjudged.");
      }
    }
    System.out.println("Total number of miscarriages of justice :"+ count); }}Copy the code

Running results:

There are 947 misjudgments in 100,000 data, which is about 0.01%, which is the misjudgment rate set in our code: FPP = 0.01.

Dive deep into code

The core bloomfilter.create method

@VisibleForTesting static <T> BloomFilter<T> create( Funnel<? Super T> Funnel, Long expectedInsertions, double FPP, Strategy Strategy) {... }Copy the code

There are four parameters:

funnel: data type (typically in the Funnels tool class)
expectedInsertions: The number of values expected to be inserted
fpp: Misjudgment rate (default 0.03)
strategy: Hash algorithm

Let’s focus on the FPP parameter

FPP misjudgment rate

Scene 1:`FPP = 0.01`

Number of misjudgments: 947
Memory size: 9585058 bits

Scene two:`FPP = 0.03`(Default parameters)

Number of misjudgments: 3,033
Memory size: 7298440 bits

Scene summary

The miscarriage rate can passfppParameter adjustment
The smaller the FPP, the more memory space is required: 0.01 requires more than 9 million bits and 0.03 requires more than 7 million bits.
The smaller the FPP, the more hash functions needed to calculate the more hash values to store in the corresponding array subscripts when adding data to the collection. (Forgot to look at the process of filtering the data stored in bloom above)

NumBits = 1 million int digits Theoretically save a million numbers, an int is 4 bytes 32 bits, need 481000000=32 million bits. If you use HashMap to store, you need 64 million bits for 50 percent HashMap storage efficiency. As you can see, BloomFilter has a very small storage space, about 1/10 of that of HashMap

The numHashFunctions above indicate that several hash functions are required to map different subscripts to whether the numbers exist (0 or 1).

Resolve Redis cache penetration

The bloem filter implemented above using Guava puts the data in local memory. This is not appropriate in distributed scenarios, where shared memory is not possible.

We can also implement bloom filters with Redis, using the Redis packaged client tool Redisson.

At the bottom is the use of the data structure bitMap, which is understood as the binary structure mentioned above. Due to space reasons, bitMap is not covered in this article, and we will write an article about it later.

Code implementation

Pom configuration:

<dependency>
  <groupId>org.redisson</groupId>
  <artifactId>redisson-spring-boot-starter</artifactId>
  <version>3.134.</version>
</dependency>
Copy the code

Java code:

public class RedissonBloomFilter {

  public static void main(String[] args) {
    Config config = new Config();
    config.useSingleServer().setAddress("Redis: / / 127.0.0.1:6379");
    config.useSingleServer().setPassword("1234");
    / / Redisson construction
    RedissonClient redisson = Redisson.create(config);

    RBloomFilter<String> bloomFilter = redisson.getBloomFilter("phoneList");
    // Initialize the Bloom filter: the expected element is 100000000L, with an error rate of 3%
    bloomFilter.tryInit(100000000L,0.03);
    // Insert the number 10086 into the Bloom filter
    bloomFilter.add("10086");

    // Check whether the following numbers are in the Bloom filter
    / / output is false
    System.out.println(bloomFilter.contains("123456"));
    / / output true
    System.out.println(bloomFilter.contains("10086")); }}Copy the code

Because of the Guava version, we have covered the parameters of the Bloom filter in great detail, so we will not repeat them here.

Give a [look], is the biggest support for IT elder brotherCopy the code