In the previous article, we introduced the concepts and some applications of hash tables and bitmaps. In this article, we will look at BitMap extensions: Bloom Filters.

concept

The Hash table actually provides a one-to-one mapping for every possible number, with each element having its own exclusive space, provided by the Hash function. The Hash table can even keep track of the number of occurrences of each element, which allows for more complex functionality. Our requirement is that each element in the set has its own space and that we can find a way to map to that space. For our problem, a Boolean is enough, or 1 bit is enough, and we just want to know if an element has ever been present. If you assign a bit to every possible value, that’s what a BitMap would do. However, when the amount of data reaches a certain level, the required storage space will be beyond the affordable range. For example, writing 64-bit data requires about 2 EXabytes of storage.

Bloom Filter was proposed by Bloom in 1970. Bloom filters can be used to retrieve whether an element is in a collection. Bloom filter is a probabilistic algorithm and data structure with high spatial efficiency. It is actually a long binary vector and a series of random mapping functions. BitMap maps each possible integer value by direct addressing, which is equivalent to using a hash function. The Bloen filter introduces K (k>1) mutually independent hash functions to ensure the completion of element weight determination in a given space and misjudgment rate.

Algorithm description

Collection representation queries with elements

Specifically, look at how the Bloom Filter represents collections as bits arrays. Initially, the Bloom Filter is an array of m bits, each set to 0.

Bloom Filter uses k independent Hash functions that map each element in the collection to {1,… ,m}. For any element x, the position of the ith hash function mapping hash_i(x) is set to 1 (1≤ I ≤k).

When an element is added to the set, the k hash function maps the element to k points in an array of bits, and sets all k points to 1. The following figure shows the Bloom filter at k=3.

The three positions of X, y and z in Bitmap are set to 1 by hash function mapping. When W appears, it means that W is in the set only when all three flag bits are 1. In the case shown in the figure, the Bloom filter determines that W is not in the set.

Error rate

Bloom Filter has a certain misjudgment rate. When judging whether an element belongs to a certain set, it is possible to misjudge the element that does not belong to this set as belonging to this set. Therefore, it is not suitable for those “zero misjudgment” applications. In application scenarios that can tolerate low misjudgments, bloom filter can greatly save storage space by changing zone with few misjudgments.

So what is the error of the Bloom filter? We assume that all hash functions hash evenly enough that the probability of falling into each Bitmap location is equal. The size of Bitmap is M, the size of original number set is N, and the number of hash functions is K:

  1. With k independent hash functions, the probability of a certain position being 0 in the Bitmap when receiving an element is:

  1. Assuming that all elements in the original set are not equal (the strictest case) and all elements are input into the Bloom filter, the probability that a position will still be 0 is:

The probability of a position being 1 is:


  1. When we judge the weight of an element, misjudgment means that the k flag bits corresponding to this element are not all 1, but all k flag bits are set to 1, and the misjudgment rate ε is about:

scenario

The greatest use of a Bloom filter is that it can quickly determine whether an element is in a set. So he has the following three usage scenarios:

  • Web crawler to the URL to avoid crawling the same URL address

  • Anti-spam, judging whether an email address is spam from a list of billions of spam messages (same with spam messages)

  • Cache breakdown, the existing cache into the Bloom filter, when the hacker access the cache does not exist to quickly return to avoid cache and DB down.

    In the cache system, when the VALUE corresponding to the KEY does not exist and a large number of concurrent requests are made to the KEY, it will cause great pressure to the back end. If the cache set is invalidated for a period of time, a large number of cache penetrations occur and all queries fall on the database, causing a cache avalanche.

The persistence layer is queried each time due to a cache miss. So you lose the point of caching. Here we only need to add a service of bloom algorithm. When the server inserts a key, set it in this service once. When you need to query the server, check whether the key exists on the back end to avoid pressure on the server.

Implementation and Application

Let’s take a look at BloomFilter implemented using Google.

Introduction of depend on

        <dependency>
            <groupId>com.google.guava</groupId>  
            <artifactId>guava</artifactId>  
        </dependency>
Copy the code

Finding an element

  private static int size = 1000000;

  private static BloomFilter<Integer> bloomFilter =
      BloomFilter.create(Funnels.integerFunnel(), size);

  @Test
  public void consumeTest(a) {
    for (int i = 0; i < size; i++) {
      bloomFilter.put(i);
    }
    long startTime = System.nanoTime(); // Get the start time

    // Check whether one of the million numbers contains 29999

    if (bloomFilter.mightContain(29999)) {
      System.out.println("Hit it.");
    }
    long endTime = System.nanoTime(); // Get the end time
    System.out.println("Program runtime:" + (endTime - startTime) + "纳秒");
  }
Copy the code

Use BloomFilter to find an element 29999, very fast.

Misjudgment rate

  private static int size = 1000000;

  private static BloomFilter<Integer> bloomFilter =
      BloomFilter.create(Funnels.integerFunnel(), size);

  @Test
  public void errorTest(a) {

    for (int i = 0; i < size; i++) {
      bloomFilter.put(i);
    }

    List<Integer> list = new ArrayList<>(1000);
    // Take 10000 values that are not in the filter and see how many are considered in the filter
    for (int i = size + 10000; i < size + 20000; i++) {
      if (bloomFilter.mightContain(i)) {
        list.add(i);
      }
    }
    System.out.println("The number of miscarriages of justice:" + list.size());
  }
Copy the code

As shown in the above code, we take 10,000 values that are not in the filter, but there are still 330 values that are thought to be in the filter, indicating a misjudgment rate of 0.03. That is, without setting anything, the default misjudgment rate is 0.03. The default constructor for BloomFilter is as follows:

    @CheckReturnValue
    public static <T> BloomFilter<T> create(Funnel<? super T> funnel, long expectedInsertions) {
        return create(funnel, expectedInsertions, 0.03D);
    }

Copy the code

Of course, we can manually set the error rate by using the following constructor.

private static BloomFilter<Integer> bloomFilter = BloomFilter.create(Funnels.integerFunnel(), size,0.01);
Copy the code

The practical application

  static int sizeOfNumberSet = Integer.MAX_VALUE >> 4;

  static Random generator = new Random();

  @Test
  public void actualTest(a) {
    int error = 0;
    HashSet<Integer> hashSet = new HashSet<Integer>();
    BloomFilter<Integer> filter = BloomFilter.create(Funnels.integerFunnel(), sizeOfNumberSet);
    System.out.println(sizeOfNumberSet);
    for (int i = 0; i < sizeOfNumberSet; i++) {
      int number = generator.nextInt();
      if(filter.mightContain(number) ! = hashSet.contains(number)) { error++; } filter.put(number); hashSet.add(number); } System.out.println("Error count: "
            + error
            + ", error rate = "
            + String.format("%f", (float) error / (float) sizeOfNumberSet));
  }
Copy the code

The actual application of BloomFilter is similar to the above, which can be called by the Redis client for scenarios such as redis cache breakdown.

conclusion

This paper mainly introduces the concept of Bloom filter, algorithm description, error rate statistics and the implementation and application of Bloom filter. Bloom filter is an industrial implementation of BitMap, which solves the problem that when using BitMap, when the amount of data is large enough, the required storage space will be beyond the acceptable range.

Bloom filter is to introduce k(k>1) mutually independent hash functions to ensure the completion of element weight determination in a given space and misjudgment rate. Bloom filter has a concept of misjudgment rate, the lower the misjudgment rate, the longer the array, the larger the space. The higher the misjudgment rate, the smaller the array and the smaller the space occupied. Finally, we introduce how to use the BloomFilter and customize the misjudgment rate through the BloomFilter implemented by Google.

Bloom filters have huge advantages in both space and time over other data structures. Bloem filter storage space and insert/query time are constant (O(k)). A hash table can also be used to determine whether an element is in a set, but a Bloom filter can do the same thing with 1/8 or 1/4 of the spatial complexity of the hash table.

The disadvantages of bloom filters are other than the error rate, which increases as the number of elements stored increases. But if the number of elements is too small, the hash table is sufficient), and the element cannot be removed from the Bloom filter. It’s easy to think of turning an array of bits into an array of integers, incrementing the counter for each element that is inserted, and then subtracting the counter when an element is deleted. However, ensuring that elements are safely removed is not so simple. First we must make sure that the deleted element is actually inside the Bloom filter. That’s not guaranteed by the filter alone.

Finally, welcome to buy the author’s new book “Spring Cloud Advanced Microservices Architecture”.

Recommended reading

The Concept and Application from Hash table to BitMap (PART 1)

reference

  1. Massive Data Deduplication: Bitmaps and Bloom Filters
  2. Bloom Filter in detail