preface

Recently, a friend asked me this interview question:

Now I have a very large amount of data, let’s say it’s all ints. Now I give you a number, and you need to tell me if it’s in there (as efficiently as possible).

The requirement is very clear, just to determine whether a data exists.

But there’s an important premise here: a lot of data.

Conventional implementation

Regardless of this condition, what is the first scenario that comes to mind?

I think most people think of using HashMap to store data, because it is more efficient to write queries.

There are apis for both writing and determining whether an element exists, so it’s relatively easy to implement.

To do this, I wrote a single test using a HashSet to store the data (also a HashMap at the bottom); Also write the heap memory to death for later comparison:

-Xms64m -Xmx64m -XX:+PrintHeapAtGC -XX:+HeapDumpOnOutOfMemoryError 
Copy the code

GC log printing and memory Dump after memory overflow are added to facilitate debugging.

    @Test
    public void hashMapTest(a){
        long star = System.currentTimeMillis();

        Set<Integer> hashset = new HashSet<>(100);for (int i = 0; i < 100; i++) {
            hashset.add(i) ;
        }
        Assert.assertTrue(hashset.contains(1));
        Assert.assertTrue(hashset.contains(2));
        Assert.assertTrue(hashset.contains(3));

        long end = System.currentTimeMillis();
        System.out.println("Execution time:" + (end - star));
    }
Copy the code

It’s fine when I’m only writing 100 pieces of data.

Try to write 1000W data on this basis:

Memory ran out immediately after execution.

We can’t use this method when memory is limited.

And so it is; If you want to determine whether a data exists in a set, the efficiency and accuracy of the algorithm you consider must be to load all the data into memory.

Bloom Filter

Based on the conditions analyzed above, the most important problem to implement this requirement is how to load large amounts of data into memory.

Whether we can change the way of thinking, because we only need to determine whether the data exists, but also do not need to query the data out, so there is no need to store the real data in.

Great scientists have helped us think of this need.

Burton Howard Bloom proposed an algorithm called Bloom Filter in 1970.

It is mainly used to determine whether an element is in a set, but its advantage is that it only needs a small memory space and has high query efficiency.

So it’s perfect for this situation.

Bloom Filter principle

Here’s how it works.

Officially, it’s a long binary vector implemented in combination with a Hash function.

It sounds convoluting, but it’s easier to understand when you look at a graph.

As shown in the figure:

  • We first need to initialize a binary array with length L (8 in the figure) and initial values of all zeros.
  • When writing aA1=1000, H times is requiredhashThe operation of the function (2 times here); It’s kind of like a HashMap, just by figuring it outHashCodeAfter modulating and L, locate to 0 and 2, and set the value there to 1.
  • A2=2000And the same thing happens when you do the calculation4, 7I’m going to set the position to 1.
  • When you have oneB1=1000If you need to determine whether there is one or not, you will also perform two Hash operations to locate 0 and 2. At this time, their values are both 1B1=1000It exists in the set.
  • When you have oneB2=3000And the same thing is true. The first Hash is locatedindex=4, the value in the array is 1, so the second Hash operation is performed, and the result is located toindex=5The value of is 0, so we thinkB2=3000It doesn’t exist in the set.

The whole process of writing and querying is like this, which can be summarized as:

Hashes the written data H times to its location in the array and changes the data to 1. When data is queried, it is positioned in the array in the same way. Once one of these bits is zero, the data must not exist in the set, otherwise the data might exist in the set.

So The Bloom filter has the following characteristics:

  1. As long as the returned data does not exist, it certainly does not exist.
  2. The returned data exists, but only with a high probability.
  3. At the same time, the data cannot be cleared.

The first point should be understandable, but focus on the two or three points.

Why is it possible to return data that exists? This is also similar to a HashMap.

When A large amount of data is stored in A limited array length, even the most perfect Hash algorithm will have conflicts, so it is possible that two completely different data A and B will end up in the same position.

So when you do a query with B that’s a false positive.

The same is true for data deletion. When I delete the data of B, I actually delete the data of A, which will also cause subsequent false positives.

Based on the above premise of Hash conflict, Bloom Filter has a certain false positive rate, which is related to the number H of the Hash algorithm and the array length L.

Implement a Bloom filter yourself

The algorithm is actually very simple and not difficult to understand, so using Java to achieve a simple prototype.

public class BloomFilters {

    /** * Array length */
    private int arraySize;

    /** array */
    private int[] array;

    public BloomFilters(int arraySize) {
        this.arraySize = arraySize;
        array = new int[arraySize];
    }

    /** * Write data *@param key
     */
    public void add(String key) {
        int first = hashcode_1(key);
        int second = hashcode_2(key);
        int third = hashcode_3(key);

        array[first % arraySize] = 1;
        array[second % arraySize] = 1;
        array[third % arraySize] = 1;

    }

    /** * determine whether data exists *@param key
     * @return* /
    public boolean check(String key) {
        int first = hashcode_1(key);
        int second = hashcode_2(key);
        int third = hashcode_3(key);

        int firstIndex = array[first % arraySize];
        if (firstIndex == 0) {
            return false;
        }

        int secondIndex = array[second % arraySize];
        if (secondIndex == 0) {
            return false;
        }

        int thirdIndex = array[third % arraySize];
        if (thirdIndex == 0) {
            return false;
        }

        return true;

    }


    /** * Hash algorithm 1 *@param key
     * @return* /
    private int hashcode_1(String key) {
        int hash = 0;
        int i;
        for (i = 0; i < key.length(); ++i) {
            hash = 33 * hash + key.charAt(i);
        }
        return Math.abs(hash);
    }

    /** * Hash algorithm 2 *@param data
     * @return* /
    private int hashcode_2(String data) {
        final int p = 16777619;
        int hash = (int) 2166136261L;
        for (int i = 0; i < data.length(); i++) {
            hash = (hash ^ data.charAt(i)) * p;
        }
        hash += hash << 13;
        hash ^= hash >> 7;
        hash += hash << 3;
        hash ^= hash >> 17;
        hash += hash << 5;
        return Math.abs(hash);
    }

    /** * Hash algorithm 3 *@param key
     * @return* /
    private int hashcode_3(String key) {
        int hash, i;
        for (hash = 0, i = 0; i < key.length(); ++i) {
            hash += key.charAt(i);
            hash += (hash << 10);
            hash ^= (hash >> 6);
        }
        hash += (hash << 3);
        hash ^= (hash >> 11);
        hash += (hash << 15);
        returnMath.abs(hash); }}Copy the code
  1. We first initialize an int array.
  2. Write data three timeshashCompute, and set the corresponding position to 1.
  3. Query the same three timeshashOnce the value is 0, the data is not considered to exist.

The implementation logic is exactly the same as described above.

Let’s test the same parameters:

-Xms64m -Xmx64m -XX:+PrintHeapAtGC
Copy the code
    @Test
    public void bloomFilterTest(a){
        long star = System.currentTimeMillis();
        BloomFilters bloomFilters = new BloomFilters(10000000);for (int i = 0; i < 10000000; i++) {
            bloomFilters.add(i + ""); } Assert.assertTrue(bloomFilters.check(1+""));
        Assert.assertTrue(bloomFilters.check(2+""));
        Assert.assertTrue(bloomFilters.check(3+""));
        Assert.assertTrue(bloomFilters.check(999999+""));
        Assert.assertFalse(bloomFilters.check(400230340+""));
        long end = System.currentTimeMillis();
        System.out.println("Execution time:" + (end - star));
    }
Copy the code

The execution result is as follows:

It only took three seconds to write 1000W of data and make an accurate judgment.


When ASKED to reduce the size of the array to 100W, there is a false positive. The number 400230340 is not in the set, but it is returned as existing.

This also reflects the Bloom Filter’s false alarm rate.

Increasing the array length and the number of hash computations can reduce the false positives rate, but the corresponding CPU and memory consumption will increase. This needs to be weighed according to business needs.

Guava implementation

This approach implements functionality, but also satisfies a large amount of data. But observing GC logs is very frequent, and the old age is used at 90%, which is close to breaking point.

In general, memory utilization is not good.

In fact, this algorithm is also implemented in The Google Guava library, so let’s take a look at the industry authority’s implementation.

-Xms64m -Xmx64m -XX:+PrintHeapAtGC 
Copy the code

    @Test
    public void guavaTest(a) {
        long star = System.currentTimeMillis();
        BloomFilter<Integer> filter = BloomFilter.create(
                Funnels.integerFunnel(),
                10000000.0.01);

        for (int i = 0; i < 10000000; i++) {
            filter.put(i);
        }

        Assert.assertTrue(filter.mightContain(1));
        Assert.assertTrue(filter.mightContain(2));
        Assert.assertTrue(filter.mightContain(3));
        Assert.assertFalse(filter.mightContain(10000000));
        long end = System.currentTimeMillis();
        System.out.println("Execution time:" + (end - star));
    }
Copy the code

Also write 1000W data, the execution is no problem.

If you look at the GC log, you will find that there is not a single fullGC, and the usage of the old age is very low. This is obviously a lot better than the last one, and you can write more data.

Source code analysis

Let’s take a look at Guava and see how it works.

The two important parameters in the constructor are the expected amount of data to store and the acceptable false positives rate. My test demo here is 1000W and 0.01 respectively.

Guava will help you calculate the number of numBits you should use for the array size and the number of times you need to compute the numHashFunctions, based on the number of false positives you expect.

The rules for this algorithm can be found in Wikipedia.

Put write function

The actual put function that holds the data looks like this:

  • According to themurmur3_128Method to a length of 128 bitsbyte[].
  • Take the high and low 8 bits up to two bitshashValue.
  • Then according to the initialization to the executionhashTimes ofhashOperation.
bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize);
Copy the code

So it’s also the modulus of the hash and it takes the index and it assigns it to 1.

The focus is on the bits.set() method.

The set method is a function of BitArray, which is the underlying data structure that actually holds the data.

A long[] data is used to store data.

So set() is also dealing with this data.

  • insetBefore going throughget()Determines if the data exists in the collection, and if it does, returns to inform the client that the write failed.
  • The next step is to do bit operationsOr assignment.
  • get()The method uses the same calculation logic as the set method, and returns the presence of the value as long as the value is 0.

MightContain specifies whether a function exists

The logic for the previous steps is similar, except that the get() method is called to determine whether the element exists.

conclusion

The application of Bloom filtering is quite a lot, such as database, crawler, buffer breakdown and so on.

This is especially true when you need to know exactly what data doesn’t exist to do something.

This time the algorithm was also found to be interesting, and I should continue to share some similar content in the future.

Share it if it helps you.

The sample code for this question is referenced here:

Github.com/crossoverJi…

Your likes and shares are the biggest support for me