What is the BloomFilter

Bloom Filter (English: Bloom Filter) was proposed by Bloom in 1970. It’s actually a long binary vector and a series of random mapping functions. It is used to determine whether an element is in a set.

Often we encounter a lot of business scenarios to determine whether an element is in a collection, and the common idea is to save all the elements in the collection and then compare them. Linked lists, trees, Hash tables, and other data structures work this way. However, as the number of elements in the collection increases, the amount of storage space we need also increases linearly, eventually reaching the bottleneck. At the same time, the retrieval speed is slower and slower. The retrieval time complexity of the above three structures is O(n)O(n)O(n) O(n)O(logn)O(logn)O(logn) O(logn)O(1)O(1)O(1) O(1)O(1)O(1) O.

That’s where the Bloom Filter comes in.

Bloom filter principle

Before we look at the principles of Bloom filters, let’s review the principles of Hash functions.

The hash function

A hash function is a function that converts input data of any size into output data of a specific size. The converted data is called a hash value or hash code, also called a hash value. Here’s a schematic:

All hash functions have the following basic properties:

  • If two hashes are not the same (according to the same function), then the original input of the two hashes is also not the same. This property is a consequence of the deterministic nature of the hash function, which is called a unidirectional hash function.

  • The input and output of a hash function are not uniquely corresponding. If two hash values are the same, the two input values may be the same, but may also be different, which is called “Collision of hash”.

However, when using hash tables to store large amounts of data, the space efficiency is still very low, and when there is only one hash function, it is also prone to hash collisions.

Bloom filter data structure

BloomFilter is composed of a fixed size binary vector or bitmap and a series of mapping functions.

In the initial state, for a bit array of length m, all its bits are set to 0, as shown below:

When a variable is added to the set, the variable is mapped to K points in the bitmap by K mapping functions, setting them to 1 (assuming that there are two variables that each pass three mapping functions).

So when we look up a variable we just have to see if all of these points are 1 and we know with a high probability that it’s in the set

  • If any of these points have a 0, the queried variable must not be there;
  • If both are 1, the queried variable is likely to exist

Why is it possible, not certain? That’s because the mapping function itself is a hash function, and hash functions are subject to collisions.

Misjudgment rate

The error of bloom filter is that multiple inputs are hashed at the same bit position 1, so it is impossible to determine which input is generated. Therefore, the error is caused by the same bit being mapped multiple times and set to 1.

This situation also creates deletion problems for The Bloom filter, because each bit of the bloom filter is not exclusive, and it is possible that multiple elements share one bit. If we delete this bit directly, it will affect the other elements. (Like # 3 in the image above)

features

  • An element may not exist if it is judged to exist, but it certainly does not exist if it is judged to not exist.
  • A Bloom filter can add elements, but cannot remove them. Because deleting elements increases the misjudgment rate.

Add and query elements steps

Add elements

  1. I’m going to add elements to k hash functions
  2. We get k positions that correspond to the bit array
  3. Let’s set these k positions to be 1

Query element

  1. The element to be queried is given k hash functions
  2. We get k positions that correspond to the bit array
  3. If one of k positions is 0, you’re definitely not in the set
  4. If all k positions are 1, then it might be in the set

advantages

Bloom filters have huge advantages in both space and time over other data structures. The storage space and insert/query time of the Bloom filter are constant O(K)O(K)O(K) O(K). In addition, the hash functions are unrelated to each other, which is convenient for parallel implementation by hardware. Bloom filters do not need to store the elements themselves, which can be advantageous in situations where confidentiality is very important.

The Bloom filter can represent the whole set, and no other data structure can;

disadvantages

But the disadvantages of bloom filters are as obvious as their advantages. Miscalculation is one of them. As the number of stored elements increases, the miscalculation rate increases. But if the number of elements is too small, a hash table is sufficient.

In addition, you generally cannot remove elements from a Bloom filter. It’s easy to think of turning an array of bits into an array of integers, incrementing the counter for each element that is inserted, and then subtracting the counter when an element is deleted. However, ensuring that elements are safely removed is not so simple. First we must make sure that the deleted element is actually inside the Bloom filter. That’s not guaranteed by the filter alone. Counter winding can also cause problems.

Much work has been done to reduce the error rate, leading to the emergence of many variants of bloom filters.

Bloom filter usage scenarios and examples

In the world of programming, bloom filters are a great tool for programmers to quickly solve some of the more difficult problems in a project.

Such as web URL deduplication, spam identification, the determination of duplicate elements in large collections and cache penetration and other problems.

Typical applications of Bloom filters are:

  • The database prevents database penetration. Google Bigtable, HBase, Cassandra, and Postgresql use BloomFilter to reduce disk lookups for nonexistent rows or columns. Avoiding costly disk lookups can greatly improve the performance of database query operations.

  • In business scenarios, determining whether a user has read a certain video or article, such as Douyin or Toutiao, will of course lead to certain misjudgments, but users will not see repeated content.

  • Cache downtime, caching, breakdown, generally to determine whether a user in the cache, if in the return results directly, not in the query db, if the data to a wave of cold, can lead to a cache of breakdown, the avalanche effect, can filter cloth long when the cache’s index, at this time only in bloom filter to the query cache, if no query to penetrate to the db. If not in the bloom, return directly.

  • WEB interceptor intercepts the same request if it is used to prevent repeated attacks. For the first request, the user puts the request parameters into the Bloom filter. For the second request, the user determines whether the request parameters are matched by the Bloom filter. Can improve cache hit ratio. The Squid web proxy cache server uses Bloom filters in cache digests. Google Chrome uses bloom filters to speed up secure browsing services

  • The Venti document storage system also employs a Bloom filter to detect previously stored data.

  • The SPIN model detector also uses a Bloom filter to track the reachable state space when verifying problems on a large scale.

Coding~

Knowing the principle and usage scenarios of Bloom filter removal, we can implement a simple Bloom filter ourselves

Custom BloomFilter

public class MyBloomFilter {

    /** * A 1 billion bit */
    private static final int DEFAULT_SIZE = 256 << 22;

    /** * To reduce the error rate, the addition hash algorithm is used, so define an 8-element prime array */
    private static final int[] seeds = {3.5.7.11.13.31.37.61};

    /** * is equivalent to building 8 different hash algorithms */
    private static HashFunction[] functions = new HashFunction[seeds.length];

    /** * Initializes the bitmap of the Bloom filter */
    private static BitSet bitset = new BitSet(DEFAULT_SIZE);

    /** * Add data **@paramValue Specifies the value to be added */
    public static void add(String value) {
        if(value ! =null) {
            for (HashFunction f : functions) {
                // Compute the hash value and change the bitmap position to true
                bitset.set(f.hash(value), true); }}}/** * determines whether the corresponding element has *@paramValue Specifies the element *@returnResults the * /
    public static boolean contains(String value) {
        if (value == null) {
            return false;
        }
        boolean ret = true;
        for (HashFunction f : functions) {
            ret = bitset.get(f.hash(value));
            // a hash function returns false to break out of the loop
            if(! ret) {break; }}return ret;
    }

    /** * The simulated user is not a member, or the user is not online... * /
    public static void main(String[] args) {

        for (int i = 0; i < seeds.length; i++) {
            functions[i] = new HashFunction(DEFAULT_SIZE, seeds[i]);
        }

        // Add 100 million data
        for (int i = 0; i < 100000000; i++) {
            add(String.valueOf(i));
        }
        String id = "123456789";
        add(id);

        System.out.println(contains(id));   // true
        System.out.println("" + contains("234567890"));  //false}}class HashFunction {

    private int size;
    private int seed;

    public HashFunction(int size, int seed) {
        this.size = size;
        this.seed = seed;
    }

    public int hash(String value) {
        int result = 0;
        int len = value.length();
        for (int i = 0; i < len; i++) {
            result = seed * result + value.charAt(i);
        }
        int r = (size - 1) & result;
        return (size - 1) & result; }}Copy the code

What? We write these early and help us to achieve, also make the wheels, it’s a waste of time, No, No, No, we can be built on wheels in the learning process, make the wheel itself is our own, the process of design and implementation of the concrete floor can not only improve our ability to programming, in the process of rolling is bound to encounter many problems we don’t have thought about, Grow to see see ~~

When the actual project was used, the leader told me that the project must run stably, so I gave up my wheel without confidence.

The Guava BloomFilter

<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>23.0</version>
</dependency>
Copy the code
public class GuavaBloomFilterDemo {

    public static void main(String[] args) {
        // The last two parameters: the expected amount of data to contain, and the allowable error value
        BloomFilter<Integer> bloomFilter = BloomFilter.create(Funnels.integerFunnel(), 100000.0.01);
        for (int i = 0; i < 100000; i++) {
            bloomFilter.put(i);
        }
        System.out.println(bloomFilter.mightContain(1));
        System.out.println(bloomFilter.mightContain(2));
        System.out.println(bloomFilter.mightContain(3));
        System.out.println(bloomFilter.mightContain(100001));

        //bloomFilter.writeTo();}}Copy the code

In a distributed environment, bloom filters also need to be considered as resources that can be shared, which is when we think of Redis. Yes, Redis also implements Bloom filters.

Of course, we can also pass the Bloom filter throughbloomFilter.writeTo()Write a file to object storage such as OSS and S3.

The BloomFilter in Redis

The bitMap provided by Redis can implement the Bloom filter, but you need to design the mapping function and some details, which is no different from our custom.

The bloom filter was introduced in Redis 4.0 as a plugin. Bloom filters are loaded into Redis Server as a plug-in, providing Redis with powerful bloom de-duplexing capabilities.

There are two ways to install RedisBloom with Redis already installed

Directly compile and install

Git clone https://github.com/RedisBloom/RedisBloom.git CD RedisBloom make # compiler generates a rebloom. Redis - so file server -- loadModule /path/to/rebloomCopy the code

Use Docker to install

Docker run -p 6379:6379 --name redisbloom redislabs/rebloom:latest Docker exec -it redis-redisbloom bash redis-cliCopy the code

use

Basic instructions for Bloom filter:

  • Bf. Add Adds elements to the Bloom filter
  • Bf. exists Determines whether an element is in a Bloom filter
  • Bf. madd adds multiple elements to the Bloom filter, while bf.add can only add one
  • Bf. Mexists determine if multiple elements are in a Bloom filter
127.0.0.1:6379> bf.add user Tom (integer) 1 127.0.0.1:6379> bf.add user John (integer) 1 127.0.0.1:6379> bf.exists user Tom (integer) 1 127.0.0.1:6379> bf.exists user Linda (integer) 0 127.0.0.1:6379> bf.madd user Barry Jerry Mars 1) (INTEGER) 12) (integer) 1 3) (integer) 1 127.0.0.1:6379> bf. Mexists user Barry Linda 1) (integer) 12) (integer) 0Copy the code

We only have these parameters, so there’s no misjudgment, and as you increase the number of elements, there will be some misjudgment, so I’m not going to do this experiment here.

The Bloom filter used above is just the bloom filter for the default parameter, which was created automatically when we first added it.

Redis also provides a bloem filter with custom parameters, bf.reserve filter name error_rate initial_size

  • Error_rate: The error rate allowed for a Bloom filter. The lower the value, the larger the size of the filter’s bit array and the more space it takes up
  • Initial_size: The number of elements that the Bloom filter can store. When the number of elements stored exceeds this value, the accuracy of the filter decreases

But this operation needs to be explicitly created before add. If the corresponding key already exists, bf. Reserve will report an error

127.0.0.1:6379> bf.reserve user 0.01 100
(error) ERR item exists
127.0.0.1:6379> bf.reserve topic 0.01 1000
OK
Copy the code

I am a Javaer, I must use Java to implement, Java Redis client more, some do not provide instruction extension mechanism, I know that Redisson and lettuce can use bloom filter, we use Redisson

public class RedissonBloomFilterDemo {

    public static void main(String[] args) {

        Config config = new Config();
        config.useSingleServer().setAddress("Redis: / / 127.0.0.1:6379");
        RedissonClient redisson = Redisson.create(config);

        RBloomFilter<String> bloomFilter = redisson.getBloomFilter("user");
        // Initialize the Bloom filter with an expected number of statistical elements of 55000000 and an expected error rate of 0.03
        bloomFilter.tryInit(55000000L.0.03);
        bloomFilter.add("Tom");
        bloomFilter.add("Jack");
        System.out.println(bloomFilter.count());   / / 2
        System.out.println(bloomFilter.contains("Tom"));  //true
        System.out.println(bloomFilter.contains("Linda"));  //false}}Copy the code

extension

In order to solve the problem that the Bloom filter could not delete elements, the cuckoo filter was born. The authors of the paper Cuckoo Filter: Better Than Bloom make an in-depth comparison between the Cuckoo Filter and Bloom Filter. Compared to the cuckoo filter, the Bloom filter has the following disadvantages: poor query performance, low space utilization efficiency, no reverse operation (delete), and no counting support.

Due to less use, not in-depth.

The article continues to update, you can wechat search “JavaKeeper” for the first time to read, no routine to receive 500+ this e-book and 30+ video teaching and source code, GitHub github.com/JavaKeeper has been included, Javaer development, interview necessary skills weapon spectrum, There’s something you want.

Reference and thanks

www.cs.cmu.edu/~dga/papers…

www.justdojava.com/2019/10/22/…

www.cnblogs.com/cpselvis/p/…

Juejin. Cn/post / 684490…