A common requirement in everyday work is to determine whether an element is in a collection.

For example:

  • Given an IP address blacklist library, check whether the specified IP address is in the blacklist.
  • When receiving emails, check whether an email address is spam.
  • Check if an English word is spelled correctly in word processing software.

When faced with this kind of problem, it is often intuitive to use collections as data structures. For example, all IP addresses in the IP address blacklist database are stored in a set, and then the specified IP address is added to the set to check whether it exists. If it exists, the IP address matches the blacklist.

Through a section of Java code, to simulate the IP blacklist library storage and inspection.

public class IPBlackList {

	public static void main(String[] args) {
		Set<String> set = new HashSet<>();
		set.add("192.168.1.1");
		set.add("192.168.1.2 instead");
		set.add("192.168.1.4");
		System.out.println(set.contains("192.168.1.1"));
		System.out.println(set.contains("192.168.1.2 instead"));
		System.out.println(set.contains("192.168.1.3"));
		System.out.println(set.contains("192.168.1.4")); }}Copy the code

Execution Result:

true
true
false
true
Copy the code

The internals of collections are usually implemented using hash tables. Its advantage is that the query is very efficient, but its disadvantage is that it consumes more storage space.

In general, when the amount of data is relatively small, we will use collections for storage. Changing space for time can improve query efficiency while occupying less space.

However, when storing large amounts of data, consuming large amounts of space can become a problem. This data is usually stored in process memory to speed up queries. Memory on a machine is usually limited and should be used as efficiently as possible.

Hash tables, on the other hand, need to be balanced between space and efficiency. If you store the same number of elements, the smaller the hash table, the higher the probability of conflicts, and the more time it takes to resolve conflicts, affecting performance.

Bloom Filter can solve this problem well. The ability to store data with less memory on the one hand and very efficient query performance on the other.

Bloom Filter

Bloom Filter is a data structure proposed by Burton Howard Bloom in 1970. It’s actually a long binary vector and a series of random mapping functions.

Bloom filters can be used to efficiently retrieve whether an element is in a collection. Its advantage is that the space efficiency and query time are far better than the general algorithm, the disadvantage is that there is a certain error recognition rate, and difficult to delete (generally not supported, need additional implementation).

The Bloom filter is efficient because it is a probabilistic data structure that confirms that an element is definitely not in the set, or that it might be. It is possible because it has a false identification rate that makes it impossible to be 100% sure that the element is in the collection.

The basic principle of

The basic working principle of Bloom filter is not complicated, which is roughly as follows:

First, build a binary vector and set all bits to 0.

Then, K hash functions are selected to hash the elements K times and calculate the bitwise subscripts of the vectors.

Add elements

When an element is added to the set, K values are generated as subscripts by K hash functions acting on each element, and the corresponding bit of the vector is set to 1.

Check the element

If you want to check whether an element is in the set, use the same hash method to generate K subscripts and check that the corresponding bits of the vector are all 1’s. If all are 1, the element is most likely in the set; Otherwise (as long as one or more bits are 0), the element is definitely not in the set.

That’s the basic idea of a Bloom filter.

A simple example

Suppose you have a Bloon filter with a capacity of 15 bits and two hash functions.

Add a string a, hash twice to get subscripts 4 and 10, and mark the bits corresponding to 4 and 10 with 0 as 1.

Then add a string b, hash it twice to get subscripts 11 and 11, and mark the bits of 11 as 1 instead of 0.

Add another string c, hash it twice to get subscripts 11 and 12, and mark the bits 11 and 12 with zeros as 1.

Finally, add a string Sam, hash twice to get subscripts 0 and 7, and mark the bits corresponding to 0 and 7 with 0 as 1.

Above, we added four strings, each hashed twice, with the corresponding two bits marked as 1, and finally marked as 1 with six bits instead of eight.

This shows that different elements can be hashed into overlapping positions. If you have more elements, the probability of overlap is higher. If two elements overlap, we can’t tell if any element is in the set.

If you want to check whether the element exists in the set, you just need to hash twice in the same way, and search the corresponding bits of the resulting 2 subscripts in the Bloom filter. If the corresponding 2 bits are not all 1’s, the element is definitely not in the set. If the corresponding 2 bits are all 1, the element may or may not be in the set.

For example, if the string B is checked to see if it exists in the set, the hash results in two subscripts of 11. Check that 11 corresponds to bit 1. But that doesn’t necessarily mean that B is in the set. This is because the hashed subscript of the string C also contains 11, so it is possible that the string C is in the set but b is not, which causes false identification, also known as a false positive.

Again, the string foo is hashed with subscripts 8 and 13, respectively, and the corresponding bits are 0. Therefore, the string foo is definitely not in the collection.

Mathematical principles

The math behind bloom’s filter is this:

The probability of two completely random numbers collides is small, so you can store a lot of information in a very small amount of space with a very small misidentification rate.

Two ways to solve the false identification rate

White list

A common solution to the misidentification rate is to create a small whitelist of data that could be misidentified.

Take spam filtering as an example. Suppose we have a junk mail library that filters out junk mail as it is received.

In this case, the spam library can be stored in the Bloom filter, and when received, the bloom filter can be used to efficiently filter out most of the normal mail.

For a small number of spam hits, some of them may be normal.

Create a whitelist library. When a spam message is detected in the Bloom filter, query the whitelist to check whether the message is normal.

Messages that are not in the whitelist are moved to the trash by default. Through manual identification, when normal mail is found in the trash bin, it is whitelisted.

Back to the source to confirm

In many cases, bloom filters are used to intercept large amounts of data that is not in the collection at a low cost and efficiently.

Such as:

  • Google Bigtable, Apache HBase, and Apache Cassandra and PostgreSQL use Bloom filters to reduce disk look-ups for nonexistent rows or columns. Avoiding expensive disk lookups can greatly improve the performance of database query operations.
  • A Web browser that uses bloom filters to identify malicious urls in Google Chrome. All urls are first checked against the local Bloom filter, and the executed URL is fully checked only if the Bloom filter returns a positive result (the user is warned if that result also returns a positive result).
  • Block a large number of non-IP blacklist requests. For a small number of IP addresses that may be in the blacklist, query the blacklist database again.

This is a very typical application scenario for bloom filters, which filter out most of the requests and then process only a few ambiguous requests.

This approach, unlike whitelisted libraries, does not require another set of libraries to process. Instead, it uses data and logic that already exists.

Google Bigtable, for example, is a query for rows that need to be queried, but uses bloom filters to block most unnecessary requests. Whether the IP address is blacklisted also needs to be queried. It is also necessary to use bloom filter to block most IP addresses first.

And the above spam processing, for the possible spam, not through the complete spam database to check again for confirmation, but by adding the white list to judge. Because in general, whitelist libraries are smaller and easier to cache.

By back source, we mean a request that may have been mistaken for something else, and then finally goes back to the data source or logic for confirmation.

reference

En.wikipedia.org/wiki/Bloom_…

En.wikipedia.org/wiki/Bloom_…

zh.wikipedia.org/zh-cn/ Bloom filter

Llimllib. Making. IO/bloomfilter…

www.geeksforgeeks.org/bloom-filte…

The Beauty of Mathematics

About me

Public number: Binary road

Tutorial: 996 geek.com

Blog: binarylife. Icu