Hash hash

The principle of

The Hash (or Hash) function is widely used in computing, especially in the field of fast data lookups, and in the field of encryption.

Its function is to map a large data set onto a small data set (which is called a hash, or hash).

One application is a Hash table, a data structure that is accessed directly based on Key values. That is, it accesses records by mapping the hash value to a location in the table to speed up the lookup. Here is a typical hash function/representation intent:

The hash function has the following two characteristics:

  • If two hash values are not the same (according to the same function), then the original input for the two hash values is also not the same.
  • The input and output of a hash function are not uniquely corresponding. If two hash values are the same, the two input values are likely to be the same. But it can also be different, and this is called a “hash collision” (or “hash collision”).

Disadvantages: According to Dr. Wu Jun’s The Beauty of Mathematics, the spatial efficiency of the hash table is still not high enough. If you use a hash table to store 100 million spam addresses, each email address will have 8bytes. The hash table is only 50% efficient, so one email address will take up 16bytes. So 100 million email addresses take up 1.6GB, and storing billions of email addresses requires hundreds of GB of memory. Unless it’s a supercomputer, the average server can’t store it.

So introduce the following Bloom Filter.

Bloom Filter

The principle of

If you want to determine whether an element is in a set, the common idea is to store all the elements in the set and then compare them. Linked lists, trees, Hash tables (also known as Hash tables), and other data structures are all in this way. But as the number of elements in the collection increases, we need more and more storage space. At the same time, the retrieval speed is getting slower and slower.

Bloom Filter is a random data structure with high spatial efficiency. Bloom Filter can be regarded as an extension of bit-map, and its principle is as follows:

When an element is added to the set, it is mapped to K points in a Bit array using K Hash functions, setting them to 1. To retrieve it, we can (approximately) know if it is in the set by looking at whether all the points are 1:

  • If any of these points have zeros, then the retrieved element must be absent;
  • If both are 1, then the retrieved element is most likely in.

advantages

It tells us that the element either definitely is not in the set or may be in the set.

Its advantage is that the space efficiency and query time are much more than the general algorithm, the Bloon filter storage space and the insert/query time are constant O(k). In addition, hash functions are independent of each other and can be implemented in hardware parallel. Bloon filters do not need to store the elements themselves, which is an advantage in some cases where confidentiality is very strict.

disadvantages

But the disadvantages of Blom filters are as obvious as their advantages. The miscalculation rate is one of them. As the number of elements stored increases, the miscalculation rate increases. But if the number of elements is too small, then a hash table is sufficient.

(The remedy: create a small white list of information that could have been misjudged.)

Also, elements cannot be removed from Bloon filters in general. It is easy to think of changing the array of bits to an array of integers, incrementing the counter for each element inserted, so that the counter is subtracted when the element is deleted. However, it is not so simple to ensure that elements are removed safely. First we must ensure that the deleted element is indeed in the bloon filter. This is not guaranteed by the filter alone. Counter winding can also cause problems.

Example

It can judge whether an element belongs to a set quickly and with high space efficiency. Used to implement data dictionary, or set for intersection.

Google’s Chrome browser, for example, uses Bloom filters to identify malicious links (which can represent larger data sets with less storage space, or simply map each URL to a bit) much more efficiently, and with a false detection rate of less than 1 in 10,000. Another example: spam detection

Suppose we store 100 million E-mail addresses. We build a vector of 1.6 billion binaries (bits), or 200 million bytes, and then set all of those 1.6 billion binaries to zero. For each email address X, we use eight different random number generators (F1,F2,... ,F8) generates eight information fingerprints (f1, f2... F8). Then use a random number generator G to map the eight information fingerprints to eight natural numbers from 1 to 1.6 billion, G1, G2... The g8. Now we set the binaries of all eight positions to one. When we did this for all 100 million email addresses. A Bloon filter is created for those email addresses.

Let’s do this again:

A,B two files, each hold 5 billion URLs, each URL occupies 64 bytes, the memory limit is 4 gigabytes, let you find A file A,B common URL. What about three or even n files? Analysis: If you allow a certain error rate, you can use Bloom Filter. 4G memory can represent about 34 billion bits. Map URLs in one of these files to these 34 billion bits using Bloom Filter, and then read each URL in turn to see if it matches Bloom Filter, and if so, it should be the same URL (with an error rate)."

Network applications of Bloom Filter: a survey

To support me further, please scan the QR code below, you understand ~