Sentenced to weight problems

Suppose we want to write a crawler. Because of the complexity of the links between the networks, spiders crawling across the network is likely to form a “loop”, the crawler will enter an infinite circle, unable to find a way out, the program will crash.

So in order to avoid forming a “ring”, you need to know which urls the spider has visited, that is, how to judge the weight.

The scheme of heavy sentencing

  1. Save the URL you visited to the database, and the database management system can erase it for you.
  2. Use a Set to store visited urls. So you only need to get close to O(1) to see if a URL has been visited.
  3. The URL is hashed one-way, such as MD5 or SHA-1, and then saved to a Set or database.
  4. Bit – Map method. Create a BitSet that maps each URL to a bit through a hash function.

Defects in the scheme

  1. Disadvantages of Method 1: When the data volume becomes very large, the efficiency of relational database queries becomes very low. And is it too much to start a database query every time a URL comes in?
  2. Disadvantages of Method 2: it consumes too much memory. As the number of urls increases, the memory footprint increases. Even with 100 million urls, each URL counts only 50 characters and requires at least 5GB of memory, not counting the memory waste in the Set data structure.
  3. Disadvantages of Method 3: Method 3 saves several times more memory than method 2 because the length of the string summary after MD5 processing is only 128 bits and that after SHA-1 processing is only 160 bits.
  4. Disadvantages of Method 4: the memory consumption is relatively small, but the disadvantage is that the probability of single hash function conflict is too high.

The hash function

The hash function maps arbitrary length of data to a field of finite length. To intuitively explain, it is to mix a string of data M and output another fixed-length data H as the feature of this data (fingerprint). In other words, no matter how big the data block M is, its output value H is of fixed length. How does it work? Divide M into fixed lengths (for example, 128 bits), hash them one by one, and then iterate with different methods (for example, xOR between the hash value of the previous block and the hash value of the next block). What if it’s not 128 bits? You can complete it with a 0 or you can complete it with a 1, as long as you have a convention in the algorithm.

The emergence of BloomFilte

Probability of hash conflict for a single hash function In the case of a large amount of data, it is easy to occur hash conflict (that is, different urls can obtain the same result through hash function calculation).

BloomFilte uses k functions to reduce the possibility of hash conflicts. First, different hash functions calculate different hash values. Suppose there are 1 to K hash functions, and the probability of hash conflicts is P1, P2, p3…. for each hash function Pk, so if I take the results of k hash functions, collectively as one hash value, then the probability of hash conflicts is P1 * P2 * p3 *…. * PK. Apparently, the probability of conflicts is greatly reduced.

BloomFilte uses an array of bits of m size and k different hash functions. BloomFilte uses an array of bits of M size and k different hash functions of k different hash functions.

(2) For the first URL, the hash function is used to calculate once, and the calculation result is mapped to a certain bit of M, so that a URL will change k bits in m array to 1(if the corresponding bit is 1, there is no need to change).

(3) For the second URL, repeat Step 2. If the values of k hash functions calculated are found to be 1 in the corresponding positions, the URL is considered to be repeated; otherwise, it is not repeated

(4) Obviously, misjudgment may occur in the third step, because with the increase of URLS, many positions in the M array will definitely become 1, so in the comparison, all the K bits calculated by myself are 1, but they are the result of the common contribution of multiple urls. This is what the algorithm allows.

BloomFilte accuracy

(1) It is related to hash function. A good hash function should be able to map string to each Bit with approximately equal probability.

(2) Related to parameters

M: The length of the array

N: Determine the number of elements

K: the number of hash functions

According to the experimental results, the larger m/n is, the more accurate k is

Refer to the article

The BloomFilter principle

algorithm

Only 1 bottle out of 1000 bottles of poison is poisonous. The poison lasts for 24 hours. How many mice would it take to test the poison after 24 hours?

1000 mice < 1024, then 10 binary digits are representable.

Number 1000 bottles in base 10 from 0 to 999

  • Decimal number – Binary number
  • 0-0000000000.
  • 1-0000000001.
  • 2-0000000010.
  • .
  • 999-1111100111.

Since 10 binary digits represent all the bottles, you need 10 mice, each numbered a, B, C, D, E, F, G, H, I, J

Each mouse will drink water from all the bottles with a 1, representing the mapping between the mouse and the bottle

Such as:

  • A mouse drank water in all bottles that met the condition XXXXXXXXX1
  • B The rat drank water from all bottles conforming to the condition XXXXXXXX1x
  • .
  • J Mice drink water in accordance with 1XXXXXXXXX condition, all bottles

Results analysis

After 24 hours

(1) If the mice didn’t die, the first bottle was poisoned

(2) If some of the mice died, according to the mapping relationship between the mice and the bottle, all the mappings of the dead mice were made into ones and, and 1 was added to the obtained result, which was the bottle number (since the bottle was 1-1000 and the bottle number was 0-999).

  • For example, mice A and B died
  • So the bits and are xxxxXXXXX1 and XXXXXXXX1x bits and
  • The result is xxxxxxxx11 = 3, indicating that the fourth bottle is poisonous