In the near future, I plan to compile several posts on the advanced topics of Reids. This article is about how bloom filters are used in Redis.

1. Know the BloomFilter

1.1 the principle

BloomFilter, also known as BloomFilter, can be said to be a binary vector and a series of random mapping function implementations. Can be used to retrieve whether an element is in a collection.

Let’s see how the Bloom filter determines that elements are in a set, as shown below:

There are three hash functions and an array of bits. Oracle uses three hash functions to get the first, fourth, and fifth bits as 1, and database uses two, five, and tenth bits as 1, so if we need to determine whether Oracle is in this array, we can use the hash function to determine whether the first, fourth, and fifth bits of the array are all 1. If both values are 1, oracle and database are in this array. This is how a Bloom filter determines whether an element is in a set.

If Bloom runs through three hash algorithms and needs to determine whether bits 1, 5, and 10 are 1, bloom’s filter will determine that Bloom is in the set because oracle and Database are added to the array. However, it is guaranteed that if the Bloom filter determines that an element is not in a collection, that element will not be in the collection.

Yes, this is a disadvantage of The Bloom filter, which has a bit of false identification rate, but the bloom filter has two advantages that make this disadvantage acceptable in some application scenarios. The two advantages are that the spatial efficiency and query time are much higher than the general algorithm. The conventional data structure set is also used to determine whether an element is in the set, but if there are millions or more data, the space occupied by the set structure will be huge. Bloom filter only needs millions of bits, more than 10 Kb.

This disadvantage is caused by hash collisions, but bloom filters reduce the error rate caused by hash collisions through multiple hash functions, as shown in the following figure:

When there is only one hash function, the error rate is very high, but when there are four hash functions, the error rate is reduced by more than 10 times. You can dynamically adjust the number of hash functions based on the recognition rate allowed by service requirements. Of course, more hash functions reduce space efficiency and query efficiency.

The second disadvantage, compared to sets, is that they cannot be deleted because bloom filters do not store keys, but instead map them into arrays.

To sum up, knock on the blackboard:

  • Bloom filters are used to determine whether an element is in a collection. Implemented with an array of bits and N hash functions.
  • Advantages:
    • High space efficiency, small space.
    • The query time is short.
  • Disadvantages:
    • Elements added to a collection cannot be deleted.
    • There is a certain rate of misjudgment

1.2 Application Scenarios

  • The database prevents database penetration. Google Bigtable, HBase, Cassandra, and Postgresql use BloomFilter to reduce disk lookups for nonexistent rows or columns. Avoiding costly disk lookups can greatly improve the performance of database query operations.
  • In business scenarios, determining whether a user has read a certain video or article, such as Douyin or Toutiao, will of course lead to certain misjudgments, but users will not see repeated content. In addition, in a matching-like social scene I met before, it is necessary to determine whether the user is in the match. If so, it is necessary to update the match content. Bloom filter can also be used to reduce the number of db or cache queries of absent users.
  • Cache downtime, caching, breakdown, generally to determine whether a user in the cache, if in the return results directly, not in the query db, if the data to a wave of cold, can lead to a cache of breakdown, the avalanche effect, can filter cloth long when the cache’s index, at this time only in bloom filter to the query cache, if no query to penetrate to the db. If not in the bloom, return directly.
  • WEB interceptor intercepts the same request if it is used to prevent repeated attacks. For the first request, the user puts the request parameters into the Bloom filter. For the second request, the user determines whether the request parameters are matched by the Bloom filter. Can improve cache hit ratio.

1.3 Hands-on implementation of Bloom filter

Implement a Bloom filter manually in Python based on the principles of bloom filters. First you need to install MMH3, which is an implementation of the MurmurHash3 algorithm that is also used in Redis. Then you need to install Bitarray, the Python implementation of bit-arrays.

pip install mmh3
pip install bitarray
Copy the code

Once the environment is ready, start implementing bloom filters and go straight to the code

# python 3.6  simple_bloomfilter.py
import mmh3
from bitarray import bitarray

class BloomFilter(object):
    def __init__(self, bit_size=10000, hash_count=3, start_seed=41):
        self.bit_size = bit_size
        self.hash_count = hash_count
        self.start_seed = start_seed
        self.initialize()

    def initialize(self):
        self.bit_array = bitarray(self.bit_size)
        self.bit_array.setall(0)

    def add(self, data):
        bit_points = self.get_hash_points(data)
        for index in bit_points:
            self.bit_array[index] = 1

    def is_contain(self, data):
        bit_points = self.get_hash_points(data)
        result = [self.bit_array[index] for index in bit_points]
        return all(result)

    def get_hash_points(self, data):
        return [
            mmh3.hash(data, index) % self.bit_size
            for index in range(self.start_seed, self.start_seed +
                               self.hash_count)
        ]
Copy the code

The complete code is available at github.com/fuzctc/tc-r…

The BloomFilter class initializes three variables, the size of the bit_size array, the number of hash_count Hash functions, and the start_seed random number of Hash seeds.

Two methods are implemented. The add method is to obtain different bit bits according to data through multiple hash functions, and the relevant position in the bit array is 1. Is_contain is to determine whether data is in BloomFilter.

Test the code:

>>> from simple_bloomfilter import BloomFilter
>>> bloomFilter = BloomFilter(1000000.6)
>>> bloomFilter.add('databases')
>>> bloomFilter.add('bloomfilter')
>>> bloomFilter.is_contain('databases')
True
>>> bloomFilter.is_contain('bloomfilter')
True
>>> bloomFilter.is_contain('bloomfilte')
False
Copy the code

Test BloomFilter function is effective, but in the actual production process, the magnitude of storage is very large, usually use redis bitmap data structure to replace their implementation of the bit array, the following practice in redis how to use BloomFilter.

2. Redis – BloomFilter practice

In version 4.0, Redis comes in the form of module, which can be used as a plug-in for additional Redis functions. The website recommends a RedisBloom Module as a RedisBloom filter.

There are other ways to do this. Here are some of them:

  • RedisBloom – Bloom Filter Module for Redis
  • pyreBloom
  • Lua script implementation
  • Native language, call Redis bitmap related operations to achieve

Let’s do it one by one.

2.1 RedisBloom

RedisBloom needs to be installed first, and Docker is recommended for easy installation:

docker pull redislabs/rebloom:latest
docker run -p 6379:6379 --name redis-redisbloom redislabs/rebloom:latest
docker exec -it redis-redisbloom bash
# redis-cli
# 127.0. 01.:6379> bf.add tiancheng hello
Copy the code

Of course, you can also compile directly to install:

git clone https://github.com/RedisBloom/RedisBloom.git
cd RedisBloom
make // The compilation produces a rebloom.so file
redis-server --loadmodule /path/to/rebloom.so
redis-cli -h 127.0. 01. -p 6379
Copy the code

This module not only implements bloom filter, but also CuckooFilter, as well as TopK function. CuckooFilter is based on BloomFilter to solve the disadvantage of BloomFilter can not be deleted. Let’s take a look at BloomFilter and then CuckooFilter.

2.1.1 BloomFilter Related operations

Let’s get familiar with the basic instructions of Bloom filter:

  • Bf. Add Adds elements to the Bloom filter
  • Bf. exists Determines whether an element is in a Bloom filter
  • Bf. madd adds multiple elements to the Bloom filter, while bf.add can only add one
  • Bf. Mexists determine if multiple elements are in a Bloom filter
127.0.0.1:6379> bf.add tiancheng tc01 (integer) 1 127.0.0.1:6379> bf.add Tiancheng tc02 (integer) 1 127.0.0.1:6379> Bf.add Tiancheng TC03 (integer) 1 127.0.0.1:6379> bf.exists Tiancheng TC01 (integer) 1 127.0.0.1:6379> bf.exists Tiancheng TC02 (integer) 1 127.0.0.1:6379> bf.exists tiancheng TC03 (integer) 1 127.0.0.1:6379> bf.exists tiancheng TC04 (INTEGER) 0 127.0.0.1:6379> bf.madd tiancheng TC05 TC06 TC07 1) (INTEGER) 12) (integer) 1 3) (integer) 1 127.0.1:6379 > bf.mexists tiancheng TC05 TC06 TC07 TC08 1) (INTEGER) 12) (integer) 1 3) (integer) 1 4) (integer) 0Copy the code

2.1.2 Test misjudgment rate

Let’s test the miscarriage rate:

import redis
client = redis.StrictRedis()

client.delete("tiancheng")
size = 100000
count = 0
for i in range(size):
    client.execute_command("bf.add", "tiancheng", "tc%d" % i)
    result = client.execute_command("bf.exists", "tiancheng", "tc%d" % (i + 1))
    if result == 1:
        # print(i)
        count += 1

print("size: {} , error rate: {}%".format(
    size, round(count / size * 100, 5)))
Copy the code

The test results are as follows:

➜ BloomFilter git:(Master) Qualify Python redisbloom_test.py size:1000 , error rate: 1.0% ➜ BloomFilter git:(Master) Qualify Python redisbloom_test.py size:10000 , error rate: 1.25% ➜ BloomFilter git:(Master) Qualify Python redisbloom_test.py size:100000 , error rate: 1.304%
Copy the code

Size =1000, there will be 1% misjudgment rate. The higher the size, the higher the misjudgment rate will be. Is there any way to control the misjudgment rate?

This module also provides a command called bf.reserve, which provides three parameters: key, error_rate, and initial_size. The lower the error rate, the more space is required. The initial_size parameter indicates the number of elements expected to be placed in the Bloom filter. When the actual number exceeds this value, the error rate increases. The default parameters are error_rate=0.01 and initial_size=100.

Here’s a test:

import redis client = redis.StrictRedis() client.delete("tiancheng") size = 10000 count = 0 Execute_command ("bf.reserve", "tiancheng", 0.001, size) # add for I in range(size): client.execute_command("bf.add", "tiancheng", "tc%d" % i) result = client.execute_command("bf.exists", "tiancheng", "tc%d" % (i + 1)) if result == 1: #print(i) count += 1 print("size: {} , error rate: {}%".format( size, round(count / size * 100, 5)))Copy the code

Add a line of code to simply test the effect:

➜ BloomFilter git:(Master) Qualify Python redisbloom_test.py size:10000 , error rate: 0.0% ➜ BloomFilter git:(Master) Qualify Python redisbloom_test.py size:100000 , error rate: 0.001%
Copy the code

The misjudgment rate dropped by 1,000 times.

However, the lower the misjudgment rate is required, the more space is needed. A formula can be used to calculate. Since the formula is more complex, it is directly similar to a calculator.

If 10 million data, misjudgment rate is allowed to 1%, about 11M is needed, as shown in the figure below:

If a misjudgment rate of 0.1% is required, about 17 M is required.

But that’s a lot less space than storing 10 million data directly with SET.

2.1.3 Extended Learning

The RedisBloom module also implements the cuckoo filter. For a brief overview, there is a paper interested in communication that can be read at www.cs.cmu.edu/~dga/papers…

Compared with the cuckoo filter, the bloon filter has the following disadvantages:

  • Weak query performance Weak query performance is because the Bloom filter needs to use N hash functions to calculate N different points in the bit array. These points have large memory span, resulting in low CPU cache hit ratio.
  • Low space utilization Efficiency At the same misjudgment rate, cuckoo space utilization is higher than bloom filter, saving about 40%.
  • This is the biggest optimization of the cuckoo filter relative to the Bloon filter. It supports the reverse operation and delete function.
  • Count not supported

Because we have not had a deeper understanding of the cuckoo filter, it is not clear whether it is as good as the article says. We will study it again sometime.

2.2 pyreBloom

PyreBloom is a Python Redis + BloomFilter module implemented in C language. If the Redis Module deployment is troublesome or the online Redis version is not 4.0 or above, you can use this, but it is based on Hiredis, requires hiredis installation, and does not support reconnection and retry. If it is used in the production environment, it requires simple encapsulation.

Installation:

git clone https://github.com/redis/hiredis.git src/hiredis && \
cd src/hiredis && make && make PREFIX=/usr install && ldconfig

// mac brew install hiredis
git clone https://github.com/seomoz/pyreBloom src/pyreBloom && \
cd src/pyreBloom && python setup.py install
Copy the code

Demo code:

from pyreBloom import pyreBloom

redis_conf = {'host': '127.0.0.1'.'password': ' '.'port': 6379.'db': 0}

for k, v in redis_conf.items():
    redis_conf = convert_utf8(redis_conf)

key = convert_utf8('tc')
value = convert_utf8('hello')

p = pyreBloom(key, 10000.0.001, **redis_conf)
p.add(value)
print(p.contains(value))
Copy the code

2.3 Native language + Redis implementation

Python native language is slow, if it is Go language, do not find suitable open source Redis BloomFilter, you can use native language + Redis bitmap related operations. In Java, the RedisBloom project implements JReBloom, which non-Java developers can learn for themselves.

The demo is in Python, and the Go version will be added later:

# python3.6 bloomfilter_py_test. Py
import mmh3
import redis


class BloomFilter(object):
    def __init__(self, bf_key, bit_size=10000, hash_count=3, start_seed=41):
        self.bit_size = bit_size
        self.hash_count = hash_count
        self.start_seed = start_seed
        self.client = redis.StrictRedis()
        self.bf_key = bf_key

    def add(self, data):
        bit_points = self._get_hash_points(data)
        for index in bit_points:
            self.client.setbit(self.bf_key, index, 1)

    def madd(self, m_data):
        if isinstance(m_data, list):
            for data in m_data:
                self.add(data)
        else:
            self.add(m_data)

    def exists(self, data):
        bit_points = self._get_hash_points(data)
        result = [
            self.client.getbit(self.bf_key, index) for index in bit_points
        ]
        return all(result)

    def mexists(self, m_data):
        result = {}
        if isinstance(m_data, list):
            for data in m_data:
                result[data] = self.exists(data)
        else:
            result[m_data] = self.exists[m_data]
        return result

    def _get_hash_points(self, data):
        return [
            mmh3.hash(data, index) % self.bit_size
            for index in range(self.start_seed, self.start_seed +
                               self.hash_count)
        ]
Copy the code

Based on simple_bloomfilter.py above, introduce redis setbit and getbit to replace bitarray. Both mADD and Mexists methods similar to BloomFilter are implemented, and the test results are as follows:

>>> from bloomfilter_py_test import BloomFilter
>>> bf_obj = BloomFilter('bf.tc')
>>> bf_obj.add('tc01')
>>> bf_obj.madd('tc02')

>>> bf_obj.madd(['tc03', 'tc04'])
>>> bf_obj.exists('tc01')
True
>>> bf_obj.mexists(['tc02', 'tc03', 'tc04', 'tc05'])
{'tc02': True, 'tc03': True, 'tc04': True, 'tc05': False}
Copy the code

So much for the bloem filter. The code is available at github.com/fuzctc/tc-r…

For more Redis related articles and discussions, please pay attention to the public account: “Tiancheng Technology Talk”