Explain the principle and implementation of Bloom filter

Why do you need a Bloom filter

Imagine how you would handle the following scenario:

Whether the mobile phone number is registered repeatedly
Whether the user has participated in a seckill activity
A large number of records whose ids do not exist are queried by forged requests. In this case, the cache fails to match. How can I avoid cache penetration

In view of the above problems, the conventional approach is: query database, database hard shoulder, if the pressure is not large can use this method, keep it simple.

Improvements: Use list/set/tree to maintain a set of elements, determine whether the element is in the set, time complexity or space complexity will be relatively high. For microservices, the list/set data structure in Redis can be used. The data scale is very large and the memory capacity requirements of this solution may be very high.

These scenarios have one thing in common, which abstracts the question into: How do you efficiently determine if an element is not in the set? So is there a better solution to achieve both time complexity and space complexity?

There are! Bloom filter.

What is a Bloom filter

Bloom Filter (English: Bloom Filter) was proposed by Bloom in 1970. It’s actually a long binary vector and a series of random mapping functions. Bloem filter can be used to retrieve whether an element is in a set. Its advantage is that both spatial efficiency and query time are far more than ordinary algorithms.

The working principle of

The principle of a Bloom filter is that when an element is added to a collection, the element is mapped to K offset points in an array of bits by K hash functions, setting them to 1. When we search, we know (roughly) if it is in the set by looking at whether all of the points are 1’s: if any of the points have 0’s, the element must not be there; If both are 1, the element being checked is most likely in. That’s the basic idea of a Bloom filter.

In simple terms, prepare an array of bits of length m, initialize all elements to 0, hash the elements k times with k hash functions mod Len (m) to get k positions and set the corresponding position in M to 1.

Advantages and disadvantages of bloom filter

Advantages:

The space footprint is minimal, because the data itself is not stored but is indicated by bits, which has a somewhat secret effect.
The time complexity of both insert and query is O(k), a constant level, where K represents the number of hash function executions.
Hash functions can be independent of each other and can speed up computation at the hardware instruction level.

Disadvantages:

Error (false positive rate).
Unable to delete.

Error (false positive rate)

The Bloom filter can determine 100% that elements are not in the set, but it can misjudge when elements are in the set because the k-points generated by the hash function can be repeated when there are too many elements. Wikipedia has a mathematical derivation of false positive rates (see link at the end of this article). , it is assumed that:

Bit array length m
Number of hash functions k
Expected number of elements n
Expected error _ε_

In order to find the appropriate m and k when creating bloom filter, we can deduce the most appropriate m and k according to the expected number of elements n and ε.

In Java, Guava and Redisson used this algorithm to estimate the optimal m and K of Bloom filter:

// Count the number of hashes
@VisibleForTesting
static int optimalNumOfHashFunctions(long n, long m) {
    // (m / n) * log(2), but avoid truncation due to division!
    return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
}

// Calculate the length of the bit array
@VisibleForTesting
static long optimalNumOfBits(long n, double p) {
    if (p == 0) {
        p = Double.MIN_VALUE;
    }
    return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
}
Copy the code

Unable to delete

Some k points in the bit array are reused by multiple elements. If we set all k points of one element to zero, the other elements will be directly affected. This causes us to be unable to handle scenarios where elements are deleted when using bloom filters.

The dirty data can be cleared by periodic reconstruction. If it is through redis, do not delete the original key, but to rename a new command, then delete the old data can be.

Go – Zero in bloom filter source analysis

Core /bloom/bloom.go A Bloom filter has two core properties:

An array:
The hash function

The bit array in the Bloom filter implemented by Go-Zero adopts Redis. Bitmap. Since Redis is adopted, distributed scenes are naturally supported, and the hash function adopts MurmurHash3

Why can Redis. Bitmap be used as a bit array?

There is no separate Bitmap data structure in Redis, the underlying implementation is dynamic String (SDS), and strings in Redis are actually stored in binary. The ASCII code of A is 97, and the binary code is 01100001. If we want to convert it to B, we only need to carry one digit: 01100010. This is done with redis.setbit:

set foo a

OK

get foo

“a”

setbit foo 6 1

0

setbit foo 7 0

1

get foo

“b”

The dynamic string used at the bottom of the bitmap can be dynamically expanded. When the offset reaches a high level, the bitmap at other positions will automatically fill 0. The maximum length of the bitmap can be 2^32-1 (occupying 512 MB of memory). According to the above algorithm principle, it can be known that the realization of Bloom filter mainly does three things:

The k hash function computes k points.
Set the value of the k points in the bit array to 1 on insertion.
During the query, determine whether all sites K are 1 according to the calculation result of 1; otherwise, the element does not exist.

Here’s how Go-Zero is implemented:

Object definitions

// Indicates how many hash functions are computed
// Fixed 14 times
maps = 14

type (
    // Define the Bloom filter structure
    Filter struct {
        bits   uint
        bitSet bitSetProvider
    }
    // Bit array manipulation interface definition
    bitSetProvider interface {
        check([]uint) (bool, error)
        set([]uint) error
    }
)
Copy the code

Bit array operation interface implementation

First, you need to understand two Lua scripts:

ARGV: offset array // KYES[1]: Setbit operation key // all set to1
setScript = `
    for _, offset in ipairs(ARGV) do
        redis.call("setbit", KEYS[1], offset, 1)
    end// ARGV: offset array // KYES[1]: key of the setbit operation // Check whether all are1
testScript = `
    for _, offset in ipairs(ARGV) do
        if tonumber(redis.call("getbit", KEYS[1], offset)) == 0 then
            return false
        end
    end
    return true
    `
Copy the code

Why do you have to use lua scripts? Because you need to make sure that the whole operation is atomic.

// Redis bit array
type redisBitSet struct {
    store *redis.Client
    key   string
    bits  uint
}
// Check whether the offset array is all 1
// Yes: the element may exist
// No: the element must not exist
func (r *redisBitSet) check(offsets []uint) (bool, error) {
    args, err := r.buildOffsetArgs(offsets)
    iferr ! =nil {
        return false, err
    }
    // Execute the script
    resp, err := r.store.Eval(testScript, []string{r.key}, args)
    // Note that the underlying use of go-redis
    // redis.Nil indicates that the case where the key does not exist requires special judgment
    if err == redis.Nil {
        return false.nil
    } else iferr ! =nil {
        return false, err
    }

    exists, ok := resp.(int64)
    if! ok {return false.nil
    }

    return exists == 1.nil
}

// Set all k sites to 1
func (r *redisBitSet) set(offsets []uint) error {
    args, err := r.buildOffsetArgs(offsets)
    iferr ! =nil {
        return err
    }
    _, err = r.store.Eval(setScript, []string{r.key}, args)
    Nil indicates that the key of the operation does not exist
    // The key does not exist
    if err == redis.Nil {
        return nil
    } else iferr ! =nil {
        return err
    }
    return nil
}

// Build an array of offset strings because go-redis executes lua with the parameter []stringy
// So you need to switch
func (r *redisBitSet) buildOffsetArgs(offsets []uint) ([]string, error) {
    var args []string
    for _, offset := range offsets {
        if offset >= r.bits {
            return nil, ErrTooLargeOffset
        }
        args = append(args, strconv.FormatUint(uint64(offset), 10))}return args, nil
}

/ / delete
func (r *redisBitSet) del(a) error {
    _, err := r.store.Del(r.key)
    return err
}

// Automatically expires
func (r *redisBitSet) expire(seconds int) error {
    return r.store.Expire(r.key, seconds)
}

func newRedisBitSet(store *redis.Client, key string, bits uint) *redisBitSet {
    return &redisBitSet{
        store: store,
        key:   key,
        bits:  bits,
    }
}
Copy the code

So now that we’ve done all the bit array operations, how do we compute k points from k hash functions

K hashes compute k loci

Func (f *Filter) getLocations(data []byteLocations := make([]uint, maps); // Make ([]uint, maps)14
    for i := uint(0); i < maps; I++ {// hash, using"MurmurHash3"HashValue := hash.Hash(append(data,byte// offset locations[I] = uint(hashValue % uint64(F.its))}return locations
}
Copy the code

Insert and query

Add and query implementation is very simple, just combine the above functions.

Func (f *Filter) Add(data []byte) error {
    locations := f.getLocations(data)
    returnF.bitset.set (locations)} // Check whether func (f *Filter) Exists(data []byte) (bool, error) {
    locations := f.getLocations(data)
    isSet, err := f.bitSet.check(locations)
    iferr ! =nil {
        return false, err
    }
    if! isSet {return false.nil
    }

    return true.nil
}
Copy the code

Suggestions for improvement

The overall implementation is very clean and efficient, so is there room for improvement?

Personally, I think there are some. As mentioned above, the mathematical formula for automatically calculating the optimal m and k can be created if the parameter is changed to:

Expect Total number of expectedInsertions

There is an expected error

Many developers are not sensitive to the length of a bit array and do not intuitively know how many bits are passed and how many errors are expected.

// New create a Filter, store is the backed redis, key is the key for the bloom filter,
// bits is how many bits will be used, maps is how many hashes for each addition.
// best practices:
// elements - means how many actual elements
// when maps = 14, formula: 0.7*(bits/maps), bits = 20*elements, the error rate is 0.000067 < 1e-4
// for detailed error rate table, see http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html
func New(store *redis.Redis, key string, bits uint) *Filter {
    return &Filter{
        bits:   bits,
        bitSet: newRedisBitSet(store, key, bits),
    }
}

// expectedInsertions - Expect total number
// falseProbability - Expected error
// You can also change the option mode without breaking the original compatibility
func NewFilter(store *redis.Redis, key string, expectedInsertions uint, falseProbability float64) *Filter {
    bits := optimalNumOfBits(expectedInsertions, falseProbability)
    k := optimalNumOfHashFunctions(bits, expectedInsertions)
    return &Filter{
        bits:   bits,
        bitSet: newRedisBitSet(store, key, bits),
        k:      k,
    }
}

// Calculate the optimal number of hashes
func optimalNumOfHashFunctions(m, n uint) uint {
    return uint(math.Round(float64(m) / float64(n) * math.Log(2)))}// Calculate the optimal array length
func optimalNumOfBits(n uint, p float64) uint {
    return uint(float64(-n) * math.Log(p) / (math.Log(2) * math.Log(2)))}Copy the code

Back to the question

How do I prevent cache penetration caused by illegal ids?

Because the ID does not exist, the request fails to hit the cache and the traffic is directly sent to the database. At the same time, the database cannot write to the cache because the record does not exist. In high concurrency scenarios, this will greatly increase the pressure on the database. There are two solutions:

Use bloom filter

Bloom filter needs to be synchronized when data is written to the database, and bloom filter needs to be periodically rebuilt in case of dirty data (such as deletion). When redis is used as storage, Bloom cannot be deleted directly and can be updated by rename key

A null value with a short expiration time is written to the cache when both the cache and database cannot be hit.

data

Bloom Filter principle and implementation in Guava

Bloom filters – Wikipedia

Redis.setbit

The project address

Github.com/zeromicro/g…

Welcome to Go-Zero and star support us!

Wechat communication group

Pay attention to the public account of “micro-service Practice” and click on the exchange group to obtain the QR code of the community group.