Topic describes

A website has 10 billion urls in a blacklist, with each URL averaging 64 bytes. How do I save this blacklist? If you enter a URL randomly at this time, how can you quickly determine whether the URL is in the blacklist?

title

This is an algorithm that often comes up in job interviews. By virtue of the subject is extremely easy to describe, electrical surface also appeared.

Without considering the details, this problem is a simple search problem. Hash tables are often a more efficient solution for finding problems.

However, if you use a hash table in your interview answer, the next question you’ll be asked is: What then? If you can’t come up with an answer, the interviewer will flatly inform you that the interview is over for today and we’ll get back to you within a week.

Why not hash tables

10 billion is a large order of magnitude, where each URL averages 64 bytes and the entire storage takes 640 gigabytes of memory. Because of the data structure used in hash tables, hash conflicts can occur. To keep the hash table small and avoid too many hash collisions, you need to use the linked list method, which stores the linked list pointer. So you end up with more than 1000 gigabytes of memory.

Just storing a URL requires 1000GB of space, and the boss can’t stand it!

BitMap

This is when you need to expand your thinking. First, consider a similar but simpler problem: now you have a very large number, say 10 million integers, and the integers range from 100 million to 100 million. So how do you quickly find out if an integer is in 10 million integers?

We need to determine whether the number exists, that is, the number exists in two states: exists (True) or does not exist (False).

So this can be handled using an array that stores the state. This array features a size of 100 million and a data type of Boolean (True or False). Set the value of the array to True. For example, array[233] = True for the integer 233.

This operation is called bitmap: each bit is used to store some state. It is suitable for large-scale data, but there are not many data states.

In addition, bitmap has the advantage that the space does not increase with the number of elements in the set. Its storage space is calculated by finding the largest element among all elements (let’s say N), so the space occupied is:

Therefore, when N is 100 million, 12MB of storage space is required. When N is 1 billion you need 120MB of storage. In other words, the bitmap takes up more space as the largest element in the set increases. This creates a problem. If you are looking for a small number of elements but one of them has a large value, such as a range of numbers from 100 billion to 100 billion, it consumes a lot of space.

Therefore, for performance and memory footprint, a Bloom filter is the best solution here: a Bloom filter is an improvement over bitmaps.

Bloom filter

Bloom Filter (English: Bloom Filter) was proposed by Burton Bloom in 1970.

It's actually a long binary vector and a series of random mapping functions.Copy the code

It can be used to determine whether an element is in a set. Its advantage is that it only needs to occupy a small memory space and has high query efficiency.

The essence of a Bloom filter is a bit array: a bit array is an array in which each element takes up only one bit, and each element can only be 0 or 1.

The Bloom filter has K hash functions in addition to a bit array. When an element is added to the Bloom filter, the following operations are performed:

  • K hash functions are used to evaluate the element value K times, resulting in K hashes.
  • Based on the resulting hash value, the value of the corresponding subscript in the bit array is set to 1.

For example, suppose the Bloom filter has three hash functions: F1, F2, F3 and a bit array arr. Now to insert 2333 into the Bloom filter:

  • Three hashes of the values yield three values, n1, n2, and n3.
  • Set arr[n1], arr[n2], and arr[3] to 1.

To determine whether a value is in the Bloom filter, the element is hashed three times to determine whether each element in the bit array is 1. If the value is 1, the value is in the Bloom filter. If there is a value that is not 1, the element is not in the Bloom filter.

Obviously, the size of an array, however large, is finite. So with the increase of the element and insert it into the elements of the will, the more the position of the bit is set to 1 in the array is, the more so, it will cause a situation: when an element, not in the bloom filter after the same rules of the hash computation, get the value in the array of query, could these positions before because the operation of the other elements is set to 1.

So, it’s possible that a nonexistent Bloom filter will be misidentified as being in a bloom filter. This is one of the drawbacks of bloom’s filter. However, if the Bloom filter determines that an element is not in the Bloom filter, then the value must not be in the bloom filter. The summary is:

  • Bloom filter says that an element is in, and it can be misjudged
  • The Bloom filter says an element is not there, so it must be not there

False is always False. True is maybe True.

Usage scenarios

The greatest use of a Bloom filter is that it can quickly determine whether an element is in a set. So it has the following three usage scenarios:

  • Web crawler to the URL to avoid crawling the same URL address
  • Spam filtering: Anti-spam, judging an email address from a list of billions of spam messages (same with spam messages)
  • In order to make the service down, some hackers will build a large number of keys that do not exist in the cache to initiate requests to the server. In the case of a large enough amount of data, frequent database queries may lead to DB failure. Bloom filter solves the problem of cache breakdown well.

More content

More algorithms interview content please pay attention to the public account “five minutes to learn algorithms” ~