Author: Lu Wei, Zhang Read senior backend engineer

Why introduce

We often encounter library penetration problems in our business, which can usually be solved by caching. If the data has many dimensions and the resulting data set is large, the effect of caching is not obvious. Therefore, in order to solve the problem of library penetration, we introduce Bloom Filter.

Open source project address: github.com/luw2007/blo…

Let’s start by looking at the general business caching process:

Query the cache first. If the cache does not match, query the database. Even if the data doesn’t exist, you need to create a cache to prevent it from running through the library. You need to distinguish whether the data exists or not. If the data does not exist, the cache time can be set relatively short to prevent problems such as master/slave synchronization from being magnified.

The weak point in this process is that when there are too many users, we cache a lot of empty data, and when a wave of cold users comes in, there is an avalanche effect. In this case, we produce a second version of the process: the Redis Filtering cold user cache process

We put the hit users in the database into the set type of Redis and set them not to expire. This is equivalent to redis as the index of the database, as long as you query Redis, you can know whether the data exists. If it does not exist in Redis, the result can be returned directly. If so, follow the general business caching process mentioned above.

You’re smart enough to come up with more questions:

  1. Redis can cache itself, so why not just return the data?
  2. If there is a large amount of data, a single set will have performance problems, okay?
  3. The business is not important, and the full amount of data is stored in Redis, which occupies a large amount of server memory. Disproportionate input and output?

Problem 1 needs to be differentiated from business scenarios. The result data is small, so we can directly use Redis as cache and directly return data. As a result, it is not suitable for redis storage. For example, ugC content, a comment may contain tens of thousands of words, many business fields.

Redis uses a lot of tricks. Bigkey causes serious damage. The stability of Redis may be affected by the release of memory requests due to capacity expansion or reduction or the return of a large amount of data due to improper use of query commands. I won’t go into the cause and harm here. The solution to bigkey is simple. We can use the hash function to split buckets and split the data among multiple keys. Reduce the size of a single key without affecting query efficiency.

Problem 3 is that redis storage takes up too much memory. So we need to reduce memory usage. Rethink the purpose of introducing Redis. Redis is like a collection, and the whole business is to verify that the request parameters are in the collection.

Most programming languages have filters built in. In Python, for example, the filter function is used to filter a sequence, filtering out elements that do not meet the criteria and returning a list of elements that do meet the criteria.

Let’s look at an example:

$python2 Python 2.7.10 (default, Oct 6 2017, 22:29:07) [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)] on Darwin Type "help", "copyright", "credits" or "license" for more information. >>> s = {2, 4} >>> filter(lambda x:x in s, [0, 1, 2]) [2]Copy the code

There are 2,4 numbers in set s, and we need to query 0, 1, and 2 for those numbers in set s. Lambda x:x in s Construct an anonymous function to determine whether the input parameter x is in the set S. The filter, in turn, performs anonymous functions on the numbers in the list. Finally, the list is returned [2].

Redis implements set using two constructs: intSet and Hash table. Non-numbers or a large number of numbers degenerate into a Hash table. Can a good algorithm save the size of a hash table?

In fact, as early as 1970 by Burton Howard Bloom proposed Bloom Filter (English: Bloom Filter). It’s actually a long binary vector and a series of random mapping functions. Bloom filters can be used to retrieve whether an element is in a collection. Its advantage is that its space efficiency and query time are far more than the general algorithm, but its disadvantage is that it has certain error recognition rate and deletion difficulty.

BloomFilter principle

We commonly splice business fields into MD5 and put them in a collection. Md5 generates a fixed length 128bit string. If we use a bitmap, we need to

2**128 = 340282366920938463463374607431768211456 bit
Copy the code

To determine whether a value is present or not is to determine whether the value is 1 in the bitmap. But our machines around the world don’t have enough storage space to store downloads. So we can only allocate a limited amount of space for storage. Such as:

import crc32

def BloomFilter(sample, size, hash_size=1):
    Construct a hash function that hashes the input data to position size
    hash = lambda x:crc32(str(x).encode())%size
    collision, s = 0, set()
    for i in range(sample):
        k = set()
        for j in range(hash_size):
            k.add(hash(i+j*size/hash_size))
        I is considered repeated only if all hashes k are in s
        if not k - s:
            collision += 1
            continue
        Update the hash result k to set S
        s |= k
    return collision
Copy the code

When there is only one hash function: Collisions are easy.

As you can see, the hash result for both 1 and 2 above is 7. What happens if you add a hash function?

We use more hash functions and larger data sets to test. You get the table below

It can be seen that increasing hash method can effectively reduce collision probability. Good data are as follows:

However, the addition of hash methods will reduce the efficiency of space use. When the collection takes up 25% of the total space, the effect of increasing hash becomes insignificant

The above use of multiple hash methods to reduce collisions is the core idea of BloomFilter.

Suitable scene

  • Database protection against library penetration Google Bigtable, Apache HBase and Apache Cassandra, and Postgresql use BloomFilter to reduce disk lookups for nonexistent rows or columns. Avoiding costly disk lookups can greatly improve the performance of database query operations. Like the business scenario at the beginning. If the amount of data is large, it is not convenient to put it in the cache. Requests need to be intercepted to prevent them from penetrating the library.

  • Cache outage If the cache outage occurs, the Bloom filter may cause misjudgment. The reason is that in addition to the misjudgment rate of Bloom Filter itself, the cache before the outage may not cover all the data in the DB. After the outage, when the user requests a data that has never been requested before, misjudgment will occur. Of course, it should be tolerable to use bloom filters as a contingency for cache outages.

  • WEB interceptor Same request interception to prevent attack. For the first request, the user puts the request parameters into BloomFilter. For the second request, the user checks whether the request parameters are matched by BloomFilter. Can improve cache hit ratio

  • Malicious Address detection Chrome checks for malicious addresses. Any URL is first checked against the local BloomFilter, and the executed URL is fully checked only if BloomFilter returns a positive result (and the user is warned if it also returns a positive result).

  • Bitcoin speeds up Bitcoin uses BloomFilter to speed up wallet synchronization.

Advantages of the algorithm:

  • Data space is small, do not store the data itself.

Disadvantages of the algorithm itself:

  • Elements can be added to a collection, but cannot be deleted.
  • The matching result can only be “absolutely not in the collection”, and there is no guarantee that the matched value is already in the collection.
  • The probability of false positives increases as the collection nears full, that is, near the estimated maximum capacity.
  • Data space enlargement. In general, for a 1% probability of false positives, each element is less than 10 bits, regardless of the size or number of elements in the set. – The query process slows down and the number of hash functions increases. As a result, multiple hash bits need to be searched to check whether the hash bits exist during each match.

For all the benefits of BloomFilter, the drawbacks are negligible. After all, you only need kN of storage space to store N elements. The spatial efficiency is excellent.

How do I use BloomFilter

BloomFilter requires a large bitmap to store. Given the current state of the company, the best storage container is Redis. Simple research from Github Topics: Bloom – Filter.

Redis integrated BloomFilter solution:

  • Native Python calls setbit to construct BloomFilter
  • The lua script
  • Rebloom – Bloom Filter Module for Redis (note: Redis Module is introduced in Redis4.0)
  • Call Redis pyreBloom with hiredis

The native Python approach is too slow, and lua scripts and Modules are cumbersome to deploy. So we recommend pyreBloom, the underlying use.

PyreBloom :master λ ls Makefile bloom.h bloom.pxd mutation.c pyrebloom.pyx bloom.c bloom.o main.c pyreBloomCopy the code

As you can see from the file name, Bloom is written in C. PyreBloom is written in Cython.

Bloom. H to achieve the core logic of BloomFilter, complete the interaction with Redis Server; The hash function; Add, check, and delete method implementations.

int init_pyrebloom(pyrebloomctxt * ctxt, char * key, uint32_t capacity, double error, char* host, uint32_t port, char* password, uint32_t db);
int free_pyrebloom(pyrebloomctxt * ctxt);

int add(pyrebloomctxt * ctxt, const char * data, uint32_t len);
int add_complete(pyrebloomctxt * ctxt, uint32_t count);

int check(pyrebloomctxt * ctxt, const char * data, uint32_t len);
int check_next(pyrebloomctxt * ctxt);

int delete(pyrebloomctxt * ctxt);
Copy the code

pyreBloom.pyx

import math
import random

cimport bloom


class pyreBloomException(Exception):
	'''Some sort of exception has happened internally'''
	pass


cdef class pyreBloom(object):
	cdef bloom.pyrebloomctxt context
	cdef bytes               key

	property bits:
		def __get__(self):
			return self.context.bits

	property hashes:
		def __get__(self):
			return self.context.hashes

	def __cinit__(self, key, capacity, error, host='127.0.0.1', port=6379,
		password=' ', db=0):
		self.key = key
		if bloom.init_pyrebloom(&self.context, self.key, capacity,
			error, host, port, password, db):
			raise pyreBloomException(self.context.ctxt.errstr)

	def __dealloc__(self):
		bloom.free_pyrebloom(&self.context)

	def delete(self):
		bloom.delete(&self.context)

	def put(self, value):
		if getattr(value, '__iter__'.False):
			r = [bloom.add(&self.context, v, len(v)) for v in value]
			r = bloom.add_complete(&self.context, len(value))
		else:
			bloom.add(&self.context, value, len(value))
			r = bloom.add_complete(&self.context, 1)
		if r < 0:
			raise pyreBloomException(self.context.ctxt.errstr)
		return r

	def add(self, value):
		return self.put(value)

	def extend(self, values):
		return self.put(values)

	def contains(self, value):
		# If the object is 'iterable'...
		if getattr(value, '__iter__'.False):
			r = [bloom.check(&self.context, v, len(v)) for v in value]
			r = [bloom.check_next(&self.context) for i in range(len(value))]
			if (min(r) < 0) :raise pyreBloomException(self.context.ctxt.errstr)
			return [v for v, included in zip(value, r) if included]
		else:
			bloom.check(&self.context, value, len(value))
			r = bloom.check_next(&self.context)
			if (r < 0) :raise pyreBloomException(self.context.ctxt.errstr)
			return bool(r)

	def __contains__(self, value):
		return self.contains(value)

	def keys(self):
		'''Return a list of the keys used in this bloom filter'''
		return [self.context.keys[i] for i in range(self.context.num_keys)]
Copy the code
Cdef class pyreBloom(object): cdef bloom. Pyrebloomctxt context Cdef bytes property bits: Def delete(self): def put(self, value): Def add(self, value): def extend(self, values): Def contains(self, value): // Check if contains(self, value) exists, returns' [value] 'when' value 'can be iterated, otherwise returns' bool' def keys(self): // The list of keys stored in redisCopy the code

Because pyreBloom uses the Hiredis library and has no logic to rewire itself, it misses the simple encapsulation.


    # coding=utf-8
    Extend, keys, contains, add, PUT, hashes, bits, delete >>> class TestModel(BaseModel): ... PREFIX = "bf:test" >>> t = TestModel() >>> t.add('hello') 1 >>> t.extend(['hi', 'world']) 2 >>> t.contains('hi') True >>> t.delete() '''
    import logging
    from six import PY3 as IS_PY3
    from pyreBloom import pyreBloom, pyreBloomException

    from BloomFilter.utils import force_utf8


    class BaseModel(object):
        PREFIX: redis PREFIX BF_SIZE: indicates the maximum memory size. BF_ERROR: indicates the allowed error rate. RETRIES: indicates the number of connection RETRIES. Redis server IP port: Redis server port DB: Redis server db _BF_conn: Internal save pyreBloom instance.
        SLOT = {'add'.'contains'.'extend'.'keys'.'put'.'delete'.'bits'.'hashes'}
        PREFIX = ""
        BF_SIZE = 100000
        BF_ERROR = 0.01
        RETRIES = 2

        def __init__(self, redis=None):
            Initializing the Redis configuration :param redis: Redis configuration
            The class static variable is initialized to prevent the reuse of multiple inherited classes, resulting in data contamination
            self._bf_conn = None

            self._conf = {
                'host': '127.0.0.1'.'password': ' '.'port': 6379.'db': 0
            }

            if redis:
                for k, v in redis.items():
                    if k in self._conf:
                        self._conf[k] = redis[k]
            self._conf = force_utf8(self._conf)

        @property
        def bf_conn(self):
            Initialize pyreBloom
            if not self._bf_conn:
                prefix = force_utf8(self.PREFIX)
                logging.debug(
                    'pyreBloom connect: redis://%s:%s/%s, (%s %s %s)',
                    self._conf['host'], self._conf['port'], self._conf['db'],
                    prefix, self.BF_SIZE, self.BF_ERROR,
                )
                self._bf_conn = pyreBloom(
                    prefix, self.BF_SIZE, self.BF_ERROR, **self._conf)
            return self._bf_conn

        def __getattr__(self, method):
            Methods that call the Pyrebloom method without an enumeration will get from 'Pyrebloom' :param Method: :return: Pyrebloom.{method}"
            Only internal methods are provided
            if method not in self.SLOT:
                raise NotImplementedError()

            Catch pyreBloom exceptions and print the necessary logs
            def catch_error(*a, **kwargs):
                Multiple retry service
                args = force_utf8(a)
                kwargs = force_utf8(kwargs)
                for _ in range(self.RETRIES):
                    try:
                        func = getattr(self.bf_conn, method)
                        res = func(*args, **kwargs)
                        Python3 > python2 > python2
                        Handle the return type manually
                        if method == 'contains' and IS_PY3:
                            if isinstance(res, list):
                                return [i.decode('utf8') for i in res]
                        return res
                    except pyreBloomException as error:
                        logging.warn(
                            'pyreBloom Error: %s %s', method, str(error))
                        self.reconnect()
                        if _ == self.RETRIES:
                            logging.error('pyreBloom Error')
                            raise error

            return catch_error

        def __contains__(self, item):
            __contains__ method :param item: query list of elements/single element :type item: list/ baseString :return: [bool...] /bool '''
            return self.contains(item)

        def reconnect(self):
            Bloom 'pyreBloom' connection uses the C driver, does not provide timeout, uses the built-in timeout and adds multiple retries to ensure service reliability. struct timeval timeout = { 1, 5000 }; ctxt->ctxt = redisConnectWithTimeout(host, port, timeout); Del self._bf_conn calls the C del method built into 'pyreBloom' and closes the redis connection.
            if self._bf_conn:
                logging.debug('pyreBloom reconnect')
                del self._bf_conn
                self._bf_conn = None
                _ = self.bf_conn
Copy the code

Advanced: Counting Filter

Provides a way to implement the delete operation on BloomFilter without having to re-create the filter. In a counting filter, the array position (bucket) expands from a single bit to an n-bit counter. In fact, a regular Bloom filter can be thought of as a count filter with a bucket size of one digit.

The insert operation is extended to increment the value of the bucket, and the find operation checks that each required bucket is non-zero. The delete operation then involves decrement the value of each bucket.

Arithmetic overflow of buckets is a problem, and buckets should be large enough to make such cases rare. If it does occur, the increment and decrement operations must set the storage area to the maximum possible value in order to preserve the attributes of BloomFilter.

Counters are usually 3 or 4 bits in size. As a result, the computational bloom filter has 3 to 4 times more space than a static bloom filter. In contrast, the data structures of Pagh, Pagh and Rao (2005) and Fan et al. (2014) also allows deletion but uses less space than static BloomFilter.

Another problem with counting filters is their limited scalability. Because you cannot extend the count bloom filter table, you must know in advance the maximum number of keys to be stored in the filter at the same time. Once the table’s design capacity is exceeded, the rate of false positives increases rapidly as more keys are inserted.

Bonomi, etc. (2006) introduces a data structure based on D-left hash, which is functionally equivalent, but uses about half the space of the computational BloomFilter. There are no scalability issues in this data structure. Once the designed capacity is exceeded, the key can be reinserted into a new hash table of double size.

A space-saving variant of Putze, Sanders and Singler (2007) can also be used to implement counting filters by supporting inserts and deletes.

Rottenstreich, Kanizo and Keslassy (2012) introduced a new general method based on variable increment, which significantly improves the probability of calculating false positives for Bloom filters and their variants, while still supporting deletion. Unlike counting Bloem filters, a hash counter is incremented by a hash variable increment rather than a unit increment at each element insertion. To query elements, consider the exact values of counters, not just their positivity. If the sum represented by the counter value cannot consist of the corresponding variable increment of the query element, the negative answer can be returned to the query.

Open source project address: github.com/luw2007/blo…