2021.05.12

Learning in the geek time [time.geekbang.org/column/arti…].

Bitmap A bitmap is a situation where elements in an array can be used to identify values by associating an array index with values in the application.

Bitmap code

‘public class BitMap {private char[] bytes; private int nbits;

public BitMap(int nbits) { this.nbits = nbits; this.bytes = new char[nbits/16+1]; }

public void set(int k) { if (k > nbits) return; int byteIndex = k / 16; int bitIndex = k % 16; bytes[byteIndex] |= (1 << bitIndex); }

public boolean get(int k) { if (k > nbits) return false; int byteIndex = k / 16; int bitIndex = k % 16; return (bytes[byteIndex] & (1 << bitIndex)) ! = 0; }} ‘(2) Bloom filter

Bloem filters work by mapping an element as it is added to a set to K points in a bit array by K hash functions, setting them to 1. If any of these points have a 0, the element must not be there. If both are 1, the element being checked is most likely in.

Bloom filter to reduce the probability of hash collisions, use K hash functions to hash the same number, that will result in K different hashes, which we will call X1, X2, X3… , XK. We take the K numbers as subscripts in the BitMap, and assign the corresponding BitMap[X1], BitMap[X2], BitMap[X3]… The BitMap[XK] is set to true, that is, we use K bits to indicate the existence of a number. When we want to query the existence of a certain number, we use the same K hash functions to hash this number, and get Y1, Y2, Y3… , YK. Let’s see if all K hashes in the bitmap are true, and if they’re all true, then that number exists, and if any of them are not true, then that number doesn’t exist. Bloom filters are well suited for large-scale judgment scenarios that do not require 100% accuracy and allow a small probability of misjudgment.

Web crawler is a very important system in search engines. The same web page link may be included in multiple pages, which will lead to crawler repeatedly crawling the same web page in the process of crawling, so how to avoid these repeated crawling?

We use a Bloom filter (CPU intensive) to keep track of links that have been crawled. Assuming a billion pages need to be reweighted, we can store it in a bitmap 10 times the size (to reduce misjudgments). That’s 10 billion binary bits, which translates to about 1.2GB in bytes. If you use hash tables to determine weights, you need at least 100GB of space. By comparison, bloom filters reduce storage space consumption by a lot.