Cc/Redis /2018/…

Author: JayChen

What is a Bloom filter?

A Bloom filter is a magical data structure that can be used to determine whether an element is in a set. A very common function is to de-weight. A common demand in the crawler: the target site URL thousands of, how to judge whether a URL crawler favored? Simple point can be crawler every collection of a URL, the URL is stored in the database, every time a new URL comes to the database to query whether it has been accessed.

select id from table where url = 'https://jaychen.cc'
Copy the code

However, as the crawler crawls more and more urls, it accesses the database once before each request, and the SQL query efficiency for this string is not high. In addition to the database, the use of Redis set structure can also meet this requirement, and performance is better than the database. But Redis has a problem: it consumes too much memory. This is where the Bloom filter comes in: let me do this.

Compared to database and Redis, using bloom filters can avoid performance and memory problems.

A Bloom filter is essentially an array of bits. A bit array means that each element of the array takes only 1 bit. Each element can only be a 0 or a 1. Thus applying for a bit array of 10,000 elements only takes up the space of 10000/8 = 1250 B. A Bloom filter has K hash functions in addition to an array of bits. When an element is added to the Bloom filter, it does the following:

  • K hash values are computed K times using K hash functions.
  • Based on the resulting hash, set the corresponding index in the bitarray to 1.

For example 🌰, suppose the Bloom filter has three hash functions: F1, F2, F3 and a bit array arR. Now insert https://jaychen.cc into the Bloom filter:

  • Hash the values three times to get n1, n2, and n3.
  • Set arr[n1], arr[n2], arr[3] to 1 in the bit array.

To determine whether a value is in the Bloom filter, the element is hashed again, and after the value is obtained, it is judged whether each element in the bit array is 1. If the value is 1, it means that the value is in the Bloom filter. If there is a value not 1, it means that the element is not in the Bloom filter.

Can not understand the text to see the following soul painting hand diagram explanation 👇👇👇

The above explanation inevitably raises a question: The more elements that are inserted, the more positions in the bitarray are set to 1. When an element that is not in the Bloom filter is hashed, and the resulting value is queried in the bitarray, it is possible that these positions are set to 1 as well. Such a non – bloom filter can also be misjudged as a Bloom filter. But if the Bloom filter determines that an element is not in the Bloom filter, then the value must not be in the Bloom filter. In short:

  • The Bloom filter says an element is in, it could be misjudged.
  • If the Bloom filter says an element is absent, it must be absent.

The defect of the Bloom filter is put into the requirements of the crawler above. There may be some unvisited URLS that may be misjudged as accessed, but if the URL is accessed, it will definitely not be misjudged as not accessed.

The Bloom filter in Redis

Redis has added the Module function in version 4.0, and the Bloom filter can be added to Redis in the form of module, so the bloom filter in Redis can be used by loading the Module in redis version 4.0 or higher. But this isn’t the easiest way to use Docker to experience the Bloom filter directly in Redis.

> docker run -d -p 6379:6379 --name bloomfilter redislabs/rebloom
> docker exec -it bloomfilter redis-cli
Copy the code

The Redis Bloom filter has two main commands:

  • bf.addTo add elements to the Bloom filter:bf.add urls https://jaychen.cc.
  • bf.existsTo determine whether an element is in a filter:bf.exists urls https://jaychen.cc.

As mentioned above, there is misjudgment in the Bloom filter. In Redis, two values determine the accuracy of the Bloom filter:

  • error_rate: Allows the error rate of the Bloom filter. The lower this value, the larger the size of the filter bit array and the larger the space it takes.
  • initial_size: Number of elements that the Bloom filter can store. When the actual number of elements stored exceeds this value, the accuracy of the filter will decrease.

Redis has a command to set these two values:

Bf. Reserve 0.01 100 urlsCopy the code

Meanings of the three parameters:

  • The first value is the name of the filter.
  • The second value iserror_rateThe value of the.
  • The third value is zeroinitial_sizeThe value of the.

The filter name should not exist before the command is executed. If the filter name does exist before the command is executed, an error message is displayed: (Error) ERR Item exists