Bloom filter

The two scenarios of massive data processing and cache penetration introduced me to The Bloom filter. I looked up some information to understand it, but a lot of existing information did not meet my needs, so I decided to summarize an article on the Bloom filter myself. Hopefully through this article more people will know about bloom filter and actually use it!

Here we will be divided into several aspects to introduce bloom filter:

What is a Bloom filter?
The principle of Bloom filter is introduced.
Application scenario of bloom filter.
Implement bloom filter manually through Java programming.
Use the Bloom filter that comes with Google’s open source Guava.
Bloom filter in Redis.

1. What is a Bloom filter?

First, we need to understand the concept of a Bloom filter.

The Bloom Filter was invented in 1970 by a fellow named Bloom. We can think of it as a data structure consisting of binary vectors (or arrays of bits) and a series of random mapping functions (hash functions). Compared with the commonly used List, Map, Set and other data structures, it takes up less space and is more efficient, but the disadvantage is that the results it returns are probabilistic rather than very accurate. Theoretically, the more elements you add to the set, the greater the possibility of false positives. Also, data stored in the Bloom filter is not easy to delete.

Each element in the bit array takes up only one bit, and each element can only be 0 or 1. Thus, applying a 100W bit array takes up only 1000000Bit / 8 = 125000 Byte = 125000/1024 KB ≈ 122KB.

Conclusion: A man named Bloom proposed a data structure to retrieve whether an element is in a given large set. This data structure is efficient and has good performance, but has some disadvantages of false recognition rate and deletion difficulty. And, theoretically, the more elements you add to the set, the greater the possibility of false positives.

2. The principle of Bloom filter is introduced

When an element is added to the Bloom filter, the following operations are performed:

The element value is computed using the hash function in the Bloom filter to obtain the hash value (there are several hash functions that yield several hash values).
Based on the resulting hash value, the value of the corresponding subscript in the bit array is set to 1.

When we need to determine whether an element exists in the Bloom filter, we do the following:

Perform the same hash on the given element again;
If the value is 1, then the value is in the Bloom filter. If there is a value that is not 1, then the element is not in the Bloom filter.

Here’s a simple example:

As shown in the figure, when a string store is to be added to a Bloom filter, the string is first hashed differently by multiple hash functions and then set to 1 in the element below the corresponding bit array (all positions are 0 when the bit array is initialized). When you store the same string a second time, it’s easy to know that the value already exists because the previous corresponding position was set to 1 (de-duplicating is handy).

If we need to determine whether a string is in bloom filter, only need the same hash for the given string again is calculated, after get the value judgment whether each element of an array of 1, if the value is 1, it means that the value in bloom filter, if there is a value not to 1, indicate that the element is not in bloom filter.

Different strings may hash to the same position, in which case we can increase the size of the bit array or adjust our hash function.

To sum up, we can conclude that there is a small probability that a Bloom filter will misjudge the presence of an element. The Bloom filter says that an element is not there, so that element must not be there.

3. Application scenario of bloom filter

Determine the existence of a given piece of data: for example, determine whether a number exists in a number set containing a large number of numbers (the number set is large, over 500 million!). , prevent cache penetration (determine whether the requested data is valid to avoid directly bypassing the cache request database), etc., spam filtering of mailbox, blacklist function, etc.
De-weight: for example, when climbing a given URL, the URL has been climbed to de-weight.

4. Manually implement bloom filter through Java programming

We have described the principle of Bloom filter above, after knowing the principle of Bloom filter, you can implement a manual.

If you want to implement one manually, you need to:

An appropriately sized bit array holds the data
A couple of different hash functions
Add elements to the array (Bloom filter) method implementation
An implementation of a method that determines whether a given element exists in a bit array (Bloom filter).

Here is a code that I think is fairly well written (refer to the improved code available on the web for all types of objects) :

import java.util.BitSet;

public class MyBloomFilter {

    /** * the size of the bit array */
    private static final int DEFAULT_SIZE = 2 << 24;
    /** * from this array you can create 6 different hash functions */
    private static final int[] SEEDS = new int[] {3.13.46.71.91.134};

    /** * bit array. The elements in the array can only be 0 or 1 */
    private BitSet bits = new BitSet(DEFAULT_SIZE);

    /** * an array of classes containing hash functions */
    private SimpleHash[] func = new SimpleHash[SEEDS.length];

    /** * Initializes an array of classes containing hash functions, each with a different hash function */
    public MyBloomFilter(a) {
        // Initialize multiple different Hash functions
        for (int i = 0; i < SEEDS.length; i++) {
            func[i] = newSimpleHash(DEFAULT_SIZE, SEEDS[i]); }}/** * add element into array */
    public void add(Object value) {
        for (SimpleHash f : func) {
            bits.set(f.hash(value), true); }}/** * Determines whether the specified element exists in the bitarray */
    public boolean contains(Object value) {
        boolean ret = true;
        for (SimpleHash f : func) {
            ret = ret && bits.get(f.hash(value));
        }
        return ret;
    }

    /** * static inner class. For hash operations! * /
    public static class SimpleHash {

        private int cap;
        private int seed;

        public SimpleHash(int cap, int seed) {
            this.cap = cap;
            this.seed = seed;
        }

        /** * Computes the hash value */
        public int hash(Object value) {
            int h;
            return (value == null)?0 : Math.abs(seed * (cap - 1) & ((h = value.hashCode()) ^ (h >>> 16))); }}}Copy the code

Testing:

        String value1 = "https://xxx/";
        String value2 = "https://xxxsss";
        MyBloomFilter filter = new MyBloomFilter();
        System.out.println(filter.contains(value1));
        System.out.println(filter.contains(value2));
        filter.add(value1);
        filter.add(value2);
        System.out.println(filter.contains(value1));
        System.out.println(filter.contains(value2));
Copy the code

Output:

false
false
true
trueCopy to clipboardErrorCopied
Copy the code

Testing:

        Integer value1 = 13423;
        Integer value2 = 22131;
        MyBloomFilter filter = new MyBloomFilter();
        System.out.println(filter.contains(value1));
        System.out.println(filter.contains(value2));
        filter.add(value1);
        filter.add(value2);
        System.out.println(filter.contains(value1));
        System.out.println(filter.contains(value2));
Copy the code

Output:

false
false
true
Copy the code

5. Use the Bloom filter that comes with Google’s open source Guava

The purpose of self-implementation is mainly to make myself understand the principle of Bloom filter. The realization of Bloom filter in Guava is relatively authoritative, so we do not need to manually implement a bloom filter in the actual project.

First we need to introduce Guava’s dependencies into the project:

        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>28.0-jre</version>
        </dependency>Copy to clipboardErrorCopied
Copy the code

The actual usage is as follows:

We created a Bloom filter that holds up to 1500 integers, and we can tolerate a 1% (0.01) probability of misjudgment.

        // Create a Bloom filter object
        BloomFilter<Integer> filter = BloomFilter.create(
                Funnels.integerFunnel(),
                1500.0.01);
        // Checks whether the specified element exists
        System.out.println(filter.mightContain(1));
        System.out.println(filter.mightContain(2));
        // Add the element to the Bloom filter
        filter.put(1);
        filter.put(2);
        System.out.println(filter.mightContain(1));
        System.out.println(filter.mightContain(2));
Copy the code

In our example, when the mightContain () method returns true, we can be 99% sure that the element is in the filter, and when the filter returns false, we can be 100% sure that the element is not in the filter.

Guava provides a good implementation of bloom filters (see the source code for more details), but it has a major drawback that it can only be used in a single machine (and capacity expansion is not easy), and the Internet is generally distributed. To solve this problem, we need to use the Bloom filter in Redis.

6. Bloom filter in Redis

6.1 introduction

With Module functionality introduced after Redis V4.0, Redis Modules allows Redis to extend its functionality with external Modules. The Bloom filter is the Module. For details, see Redis Modules: Redis. IO/Modules

In addition, the Redis bloom filter Module is recommended at github.com/RedisBloom/… .Others include:

Redis-lua-scaling -bloom-filter (Lua script implementation) : github.com/erikdubbelb…
PyreBloom (fast Redis Bloom filter in Python) : github.com/seomoz/pyre…
.

RedisBloom provides client support for multiple languages, including Python, Java, JavaScript, and PHP.

6.2 Installation using Docker

If we need to experience the Bloom filter in Redis is very simple, just use Docker! We directly searched The Docker Redis Bloomfilter on Google and found the answer we were looking for in the first result that excluded ads (this is my usual way to solve the problem, please share). The specific address is: Hub.docker.com/r/redislabs… (This is very detailed).

Specific operations are as follows:

➜  ~ docker run -p 6379:6379 --name redis-redisbloom redislabs/rebloom:latest
➜  ~ docker exec -it redis-redisbloom bash
root@21396d02c252:/data# redis-cli
127.0.0.1:6379> Copy to clipboardErrorCopied
Copy the code

6.3 Common Commands

Note: key: the name of the Bloom filter, item: the element added.

BF.ADD : Adds an element to a Bloom filter, or creates it if it does not already exist. Format:BF.ADD {key} {item}.
BF.MADD : Adds one or more elements to the Bloom Filter and creates a filter that does not yet exist. The operation mode of this commandBF.ADDSame as this, except that it allows multiple inputs and returns multiple values. Format:BF.MADD {key} {item} [item ...] 。
支那BF.EXISTS** : Determines whether the element exists in the Bloom filter. Format:BF.EXISTS {key} {item}.
BF.MEXISTSDetermines whether one or more elements are formatted in the Bloom filter:BF.MEXISTS {key} {item} [item ...].

Also, the bf. RESERVE command needs a separate introduction:

The format of this command is as follows:

BF.RESERVE {key} {error_rate} {capacity} [EXPANSION EXPANSION].

The following describes the specific meanings of each parameter:

Key: indicates the name of the Bloom filter
Error_rate: indicates the expected probability of false positives. This should be a decimal value between 0 and 1. For example, for a desired false positive rate of 0.1% (1 in 1000), error_rate should be set to 0.001. The closer this number gets to zero, the more memory is consumed per project and the higher CPU usage per operation.
Capacity: indicates the capacity of the filter. After the actual number of stored elements exceeds this value, performance begins to degrade. The actual downgrade will depend on the extent to which the limit is exceeded. As the number of filter elements increases exponentially, performance decreases linearly.

Optional parameters:

Expansion: If a new subfilter is created, its size will be the size of the current filter timesexpansion. The default extended value is 2. This means that each subsequent subfilter will be twice as large as the previous one.

6.4 Actual Use

127.0.0.1:6379> bf. ADD myFilter Java (integer) 1 127.0.0.1:6379> bf. ADD myFilter javaguide (integer) 1 127.0.0.1:6379> Bf.exists myFilter Java (integer) 1 127.0.0.1:6379> bf.exists myFilter javaguide (integer) 1 127.0.0.1:6379> bf.exists myFilter github (integer) 0Copy the code