Moment For Technology

【 Practical questions 】-- Three practices of Bloom filter: handwriting, Redission and Guava (2)

Posted on Aug. 8, 2022, 6:39 p.m. by 崔淑華
Category: The back-end Tag: The back-end

Before we have talked about the principle of the Bloom filter [actual combat problem] - cache penetration of the Bloom filter (1), all understand that this is the operation, then generally we use the bloom filter, how to use it? If you do it yourself, how do you do it?

[TOC]

Bloom filter

Say the definition again:

The Bloom Filter, proposed by Burton Howard Bloom in 1970, is actually a long binary vector and a series of random hash mapping functions (in plain English, binary arrays store data features).

Take the following example: if there are three hash functions, the "old six" will be hashed separately by three hash functions, and mod the length of the bit array to three positions.

If you still don't understand how this works, check out my last article.

Handwritten Bloom filter

So when we write a Bloom filter, we first need a bit array, and in Java we have a bit array wrapped, BitSet.

A brief introduction to bitsets, or bitmaps, which use compact storage space to represent large bits of data. When used, we can specify the size directly, which is equivalent to creating an array of bits of the specified size.

BitSet bits = new BitSet(size);
Copy the code

BitSet also provides a number of apis, including basic operations:

  • Empties the bit array of data
  • Flip a bit of data
  • Sets the data for a certain bit
  • Get data for a certain bit
  • Get the currentbitSetNumber of bits

Here are the points to consider when writing a simple Bloom filter:

  • The size of the bit array needs to be specified. Otherwise, the larger the bit array is,hashThe less likely there is to be conflict.
  • multiplehashFunction, we need to usehashArray to store,hashHow do you set up the function? To avoid conflict, we should use several different primes as seeds.
  • Method: the main implementation of two methods, one to add elements to the Bloom filter, the other is to determine whether the bloom filter contains an element.

The following is a concrete implementation, which is only a simple simulation and cannot be used in production environment. The hash function is relatively simple, mainly using hash to perform xor for high and low bits, then multiply the seed, and then take the remainder of the size of the bit array:

import java.util.BitSet;

public class MyBloomFilter {

    // Default size
    private static final int DEFAULT_SIZE = Integer.MAX_VALUE;

    // The smallest size
    private static final int MIN_SIZE = 1000;

    // The size is the default
    private int SIZE = DEFAULT_SIZE;

    // Seed of the hash function
    private static final int[] HASH_SEEDS = new int[] {3.5.7.11.13.17.19.23.29.31};

    // An array of bits, 0/1, representing features
    private BitSet bitSet = null;

    / / the hash function
    private HashFunction[] hashFunctions = new HashFunction[HASH_SEEDS.length];

    // No parameter initialization
    public MyBloomFilter(a) {
        // Use the default size
        init();
    }

    // Initialize with parameters
    public MyBloomFilter(int size) {
        // The size is initialized to be less than the minimum size
        if (size = MIN_SIZE) {
            SIZE = size;
        }
        init();
    }

    private void init(a) {
        // Initialize bit size
        bitSet = new BitSet(SIZE);
        // Initialize the hash function
        for (int i = 0; i  HASH_SEEDS.length; i++) {
            hashFunctions[i] = newHashFunction(SIZE, HASH_SEEDS[i]); }}// Add an element to an array
    public void add(Object value) {
        for (HashFunction f : hashFunctions) {
            // Set the hash position to true
            bitSet.set(f.hash(value), true); }}// Determine if the element's characteristics exist in the bitarray
    public boolean contains(Object value) {
        boolean result = true;
        for (HashFunction f : hashFunctions) {
            result = result  bitSet.get(f.hash(value));
            // If one of the hash functions evaluates to false, it returns
            if(! result) {returnresult; }}return result;
    }

    / / the hash function
    public static class HashFunction {
        // Bit array size
        private int size;
        / / hash seeds
        private int seed;

        public HashFunction(int size, int seed) {
            this.size = size;
            this.seed = seed;
        }

        / / the hash function
        public int hash(Object value) {
            if (value == null) {
                return 0;
            } else {
                / / hash value
                int hash1 = value.hashCode();
                // High hash value
                int hash2 = hash1  16;
                // Merge hash values (equivalent to combining high and low features)
                int combine = hash1 ^ hash1;
                // Multiply the remainder
                returnMath.abs(combine * seed) % size; }}}public static void main(String[] args) {
        Integer num1 = new Integer(12321);
        Integer num2 = new Integer(12345);
        MyBloomFilter myBloomFilter =newMyBloomFilter(); System.out.println(myBloomFilter.contains(num1)); System.out.println(myBloomFilter.contains(num2)); myBloomFilter.add(num1); myBloomFilter.add(num2); System.out.println(myBloomFilter.contains(num1)); System.out.println(myBloomFilter.contains(num2)); }}Copy the code

Operating results, in line with expectations:

false
false
true
true
Copy the code

However, this approach does not support the expected error rate, but can specify the size of the bit array.

Of course, we can also provide the amount of data and the expected approximate error rate for initialization. The approximate initialization code is as follows:

    // Initialize with parameters
    public BloomFilter(int num,double rate) {
        // Calculate the size of the bit array
        this.size = (int) (-num * Math.log(rate) / Math.pow(Math.log(2), 2));
        // Number of hsah functions
        this.hashSize = (int) (this.size * Math.log(2) / num);
        // Initializes the bit array
        this.bitSet = new BitSet(size);
    }
Copy the code

Redis implementation

Usually we can choose to use Redis features in bloom filter, why? Because Redis has bitset-like instructions, such as setting the value of the bit array:

setbit key offset value
Copy the code

The key is the key, offset is the offset, and value is either 1 or 0. For example, the following example sets key1's position 7 to 1.

To obtain a digit, use the following command:

gitbit key offset
Copy the code

We can implement a good Bloom filter with the help of Redis, but we don't need to write it ourselves. The Redisson client already has a good implementation. To build a project using Maven, you first need to guide the package to POM.xml:

    dependencies
        dependency
            groupIdorg.redisson/groupId
            artifactIdredisson/artifactId
            version3.11.2/version
        /dependency
    /dependencies
Copy the code

The code is as follows, I use docker, remember to set the password when starting, change the password when running does not work:

docker run -d --name redis -p 6379:6379 redis --requirepass "password"
Copy the code

The code is as follows, first need to connect to Redis, and then create redission, use redission to create bloom filter, directly use it. (You can specify the expected number and expected miscarriage rate)

import org.redisson.Redisson;
import org.redisson.api.RBloomFilter;
import org.redisson.api.RedissonClient;
import org.redisson.config.Config;

public class BloomFilterTest {
    public static void main(String[] args) {
        Config config = new Config();
        config.useSingleServer().setAddress("redis://localhost:6379");
        config.useSingleServer().setPassword("password");
        // create a redis connection
        RedissonClient redisson = Redisson.create(config);

        RBloomFilterString bloomFilter = redisson.getBloomFilter("myBloomFilter");
        // Initialize with an expected number of elements of 100000000 and an expected error rate of 4%
        bloomFilter.tryInit(100000000.0.04);
        // Insert the number 10086 into the Bloom filter
        bloomFilter.add("12345");

        System.out.println(bloomFilter.contains("123456"));//false
        System.out.println(bloomFilter.contains("12345"));//true}}Copy the code

The results are as follows: It is worth noting that this is a single machineredisIf it isredisClusters do things a little differently.

Google GUAVA implementation

Bloem filters are also available in the Guava package provided by Google, introducing POM files:

        dependency
            groupIdcom.google.guava/groupId
            artifactIdguava/artifactId
            version18.0/version
        /dependency
Copy the code

The specific code to implement the call is as follows, and you can also specify the specific amount of storage and the expected error rate:

import com.google.common.base.Charsets;
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;

public class GuavaBloomFilter {
    public static void main(String[] args) {
        BloomFilterString bloomFilter = BloomFilter.create(
                Funnels.stringFunnel(Charsets.UTF_8),1000000.0.04);

        bloomFilter.put("Sam");

        System.out.println(bloomFilter.mightContain("Jane"));
        System.out.println(bloomFilter.mightContain("Sam")); }}Copy the code

The results are as follows and are in line with expectations

The above three are handwriting, redis, Guava practice bloom filter, just a simple usage, in fact, the implementation of Redis and Guava can also see, interested can understand, I first set a Flag.

About the author

Qin Huai, author of public number [Qin Huai Grocery store], the road of technology is not at that time, the mountain is high and the water is long, even if slow, and not stop. Personal Writing Direction: Java source code analysis, JDBC, Mybatis, Spring, Redis, distributed, sword Offer, LeetCode, etc., carefully write each article, do not like the title party, do not like the flowery, mostly write a series of articles, can not guarantee that I write are completely correct, But I guarantee that what I write is done through practice or research. We hope to correct any omissions or mistakes.

What did I write about 2020?

Open Source Programming Notes

Search
About
mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.