Introduction to the

Bloom Filter is a probabilistic algorithm proposed by Burton Howard Bloom in 1970 to judge whether an element exists in a set. If Bloom Filter determines that the element is not in the set, it must not be in the set. If the judgment result is in the set, it may misjudge, that is, the element may not be in the set. In other words, misjudgments can be made, but they can never be missed. Elements can be added to a collection, but not removed (although this can be done with a “count” filter); The more elements you add to the set, the greater the chance of miscalculation. The Bloom Filter maps elements into arrays, so it saves a lot of space.

Algorithm description

Usually we judge whether an element in the collection, the set can be stored in a list data structure, such as by comparison to find way to determine whether contain the element, but when the number of elements is more and more big, list the memory space will be more and more big, it can be stored in the disk space, but it will significantly reduce the query time. The real data is stored in disk space, memory is stored in the hash table, and the element is determined by whether the bit of the element hash map is 1. But when hash element increases, the probability of conflict is also increasing, the conflict will reduce the efficiency of the query, in order to reduce conflict, can be solved by multiple hash function, when an element through multiple hash function mapping is to 1, then the judgment may elements in the collection, must not in the element or elements. The Bloom Filter uses this approach.

Initialize the

Bloom Filter is an array mapped by different hash functions. The initialization is a bit array, and all elements of the array are set to 0, indicating that there are no elements in the set at present. The initialization length of the array will be introduced below.

Insert elements

The inserted element needs k hash functions to get k positions in the corresponding bit array. Set all the bits in these positions to 1. If they are already set to 1, you can leave them unchanged.

Figure 1 sets the corresponding bits in the bit array of x, y and z in the set obtained by the three hash functions to 1.

Determines whether an element is in a collection

The positions of the elements to be judged on the corresponding bit array are also obtained through k hash functions, and the positions are determined whether all the bits are 1. If so, the elements may exist in the set; otherwise, they certainly do not exist. If one of the bits of the hash function of w in the figure above is not 1, then W must not exist in the set.

Parameter selection

Bloom Filter has four main parameters:

  • KKK: Number of hash functions
  • MMM: Size of the bit array
  • NNN: The number of elements in the set
  • PPP: false positive probability, that is, the probability of misjudgment that elements not in the set are judged to be in the set

The number of hash functions KKK must be a positive integer. If this constraint is not considered, then for the given MMM and NNN or PPP, the probability of misjudgment is minimal when KKK satisfies the following formula:


k = l n p l n 2 = m n l n 2 k = -\frac{lnp}{ln2} = \frac{m}{n}ln2

The best value of MMM can be obtained from NNN and PPP:


m = n l n p ( l n 2 ) 2 m = -\frac{nlnp}{(ln2)^2}

So the best bits for each element are:


m n = l n p ( l n 2 ) 2 material 1.44 l o g 2 p \ frac {m} {n} = – \ frac {LNP} {(ln2) ^ 2} \ approx 1.44 log_2p

implementation

Simple implementation of Bloom Filter, see: github.com/MagnusS/Jav…

import java.io.Serializable;
import java.nio.charset.Charset;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.BitSet;
import java.util.Collection;

public class BloomFilter<E> implements Serializable {
    int k;          // Number of Hans functions
    int m;          / / bit of digits
    int n;          // Expected storage capacity
    int size;       // Actual storage capacity
    BitSet bitSet;
    static final Charset CHARSET = Charset.forName("utf-8");
    static final String hashName = "MD5";
    static final MessageDigest MESSAGE_DIGEST;
    static {
        MessageDigest tmp;
        try {
            tmp = MessageDigest.getInstance(hashName);
        }catch (NoSuchAlgorithmException e){
            tmp = null;
        }
        MESSAGE_DIGEST = tmp;
    }
    /** * Calculate the size of the bit array based on the probability and expected number of false positives *@paramFalsePositiveProbability falsePositiveProbability *@paramN Number of expected sets *@return* /
    private static int calculateM(double falsePositiveProbability, int n){
        return (int)Math.ceil(-1.44*Math.log(falsePositiveProbability)/Math.log(2.0) * n);
    }

    /** * Calculate the number of hash functions based on the probability of false positives and the expected number *@paramFalsePositiveProbability falsePositiveProbability *@paramN Number of expected sets *@return* /
    private static int calculateK(double falsePositiveProbability, int n){
        return (int) Math.round(calculateM(falsePositiveProbability, n)/ (double)n * Math.log(2.0));
    }

    /** * Count the number of hash functions based on the size and expected number of bits *@param m
     * @param n
     * @return* /
    private static int calculateK(int m, int n){
        return (int) Math.round(m/(double)n * Math.log(2.0));
    }

    public BloomFilter(double falsePositiveProbability, int n){
        this(calculateK(falsePositiveProbability, n), calculateM(falsePositiveProbability, n), n);
    }

    public BloomFilter(int m, int n){
        this(calculateK(m, n), m, n);
    }

    public BloomFilter(int k, int m, int n){
        this.k = k;
        this.m = m;
        this.n = n;
        bitSet = new BitSet(this.m);
        size = 0;
        System.out.println("bloom filter init success, K: " + k + " m: " + m + "N." + n);
    }

    /** * gets the corresponding hash *@param data
     * @param hashNum
     * @return* /
    public static int[] getHashs(byte[] data, int hashNum){
        int[] result = new int[hashNum];

        int k = 0;
        byte salt = 0;
        while (k < hashNum){
            byte[] digest;
            synchronized (MESSAGE_DIGEST){
                MESSAGE_DIGEST.update(salt);
                salt++;
                digest = MESSAGE_DIGEST.digest(data);
            }

            for(int i = 0; i < digest.length/4 && k < hashNum; i++){
                int h = 0;
                for(int j = i*4; j < (i*4) + 4; j++){ h <<=8;
                    h |= ((int) digest[j] & 0xFF); } result[k] = h; k++; }}return result;

    }

    /** * Add element to filter *@param element
     */
    public void add(E element){
        int[] hashCodes = getHashs(element.toString().getBytes(CHARSET), this.k);
        for(int hashCode: hashCodes){
            bitSet.set(Math.abs(hashCode % this.m), true); size ++; }}public void addAll(Collection<E> collection){
        for(E element: collection){ add(element); }}/** * determines whether the element is in the collection *@param element
     * @return* /
    public boolean contains(E element){
        int[] hashCodes = getHashs(element.toString().getBytes(CHARSET), this.k);
        for(int hashCode: hashCodes){
            if(! bitSet.get(Math.abs(hashCode % this.m))){
                return false; }}return true;
    }

    public boolean containsAll(Collection<E> collection){
        for(E element: collection){
            if(! contains(element)){
                return false; }}return true; }}Copy the code

application

  • Url deduplication in a crawler
  • Bloom Filter is used in Hbase to quickly determine whether the queried data is in the HFile, speeding up the query speed