Preface:

In recent projects, requests need to be de-duplicated, and two solutions come to mind:

  • Database: the cost is higher, the amount of data, but also sub – library sub – table, query efficiency is low
  • Bloom filter: simple to use, low cost, only need to occupy a small memory space, and can set the expiration time; Suitable for reprocessing under large data volume

Bloom Filter concept

Bloom Filter (English: Bloom Filter) was proposed by Bloom in 1970. It’s actually a long binary vector and a series of random mapping functions. Bloom filters can be used to retrieve whether an element is in a collection. Its advantage is that its space efficiency and query time are far more than the general algorithm, but its disadvantage is that it has a certain false identification rate and deletion difficulty

Principle:

When an element is added to the collection, the element is mapped by K hash functions to K points in an array of bits, setting them to 1. When we search, we know (roughly) if it is in the set by looking at whether all of the points are 1’s: if any of the points have 0’s, the element must not be there; If both are 1, the element being checked is most likely in. That’s the basic idea of a Bloom filter

From the above concepts, two basic conditions are required to implement a Bloom filter:

  1. Binary vectors (bits) : Bits introduced into Redis
  2. Suitable hash functions: Refer to Guava’s implementation

The bits in the Redis

Bit is the smallest unit in a computer and has a value of 0 or 1. In Redis, set the value of the specified offset to bit 1, indicating that the position has been occupied

Note: offset must be greater than or equal to 0 and less than 2^32 (the bit mapping is limited to 512 MB).

API:

  1. SETBIT key offset value: Sets the specified offset value for the key
  2. GETBIT key offset: obtains the corresponding value of the key offset

Hash function:

  1. The 64-bit hash mapping function implemented by Guava is used to hash object keys into binary byte arrays
  2. Then hash the lower and higher octets of the byte array to obtain two hash values
  3. Using hash value, modulo is twice taken to obtain offset, and the bit of Redis is set to 1

At this point, a simplified version of Bloom filtering is implemented, as shown in the following code.

Add elements

Byte [] bytes=Hashing. Murmur3_128 (). HashString (key, charset.forname ();"UTF-8")).asBytes();
        //hashFunction 1: low 8-bit longhash1 = lowereight(bytes);
        //hash2 function: high 8-bit longshash2 = uppereight(bytes);

        long combinedHash = hash1; / / 2 timeshash
        for(int i = 0; i < 2; I++) {/ / tohashLong offeset = (combinedHash & long.max_value) % integer.max_value; // Check whether the specified position already has a valueif (RedisClient.getbit(name,(int) offeset)) {
                return false; } // Set the value to 1 redisclient. setbit(name,(int) offeset); combinedHash +=hash2;
        }
        return true;
    }
Copy the code
Private Long lowereight(byte[] bytes) {private long lowereight(byte[] bytes) {returnLongs.fromByteArray(bytes); Private long uppereight(byte[] bytes) {private long uppereight(byte[] bytes) {return Longs.fromBytes(
                bytes[15], bytes[14], bytes[13], bytes[12], bytes[11], bytes[10], bytes[9], bytes[8]);
    }
Copy the code

Determine whether an element exists:

As with the set procedure, the element exists only twice in one position of the hash value (value 1)

public boolean contain(String key) {
        byte[] bytes=Hashing.murmur3_128().hashString(key,Charset.forName("UTF-8")).asBytes();
        long hash1 = lowereight(bytes);
        long hash2 = uppereight(bytes);
        long combinedHash = hash1;
        for (int i = 0; i < 2; i++) {
            long offeset = (combinedHash & Long.MAX_VALUE) % Integer.MAX_VALUE;
            if (RedisClient.getbit(name,(int) offeset)) {
                return true;
            }
            combinedHash += hash2;
        }
        return false;
    }
Copy the code

Test results:

100,000 records, 13 miscarriages of justice

@Test
    public void test3(){
        int count=0;
        SBloomFilter bloomFilter=SBloomFilter.getSBloomFilter("selrain");
        for(int i=0; i<100000; i++){ String key="test"+i;
            if(bloomFilter.contain(key)){
                count++;
            }else {
                bloomFilter.add(key);
            }
        }
        System.out.println(count);
    }
Copy the code

Think about:

  1. The general Blonding rate will have two input parameters: expected insert times (number) and false positive probability. The specific hash times are calculated by these two values. I fixed hash twice here. Redis has a function that initializes the size of the array of bits
  2. I found that the running data occupied 200M+ memory, which was still very high. Finally, I used the scheme implemented by Redisson (because the memory occupied was smaller, about 10M+). I wonder what optimization was made? If you’re interested, check it out

Reference:

Segmentfault.com/a/119000001…

Finally:

Welcome to follow my public account and share your thoughts on programming, investment and life from time to time 🙂