Easy to understand about the Bloom filter

Cache avalanche

Cache avalanche is when a large number of hotspot data in Redis expire (invalid) at the same time. Because the same expiration time is set, the concurrent volume of Redis requests is very large at this time, and all the requests will fall into the database.

How can we solve this problem?

Use mutex or queue to allow only one thread to query the database for the same key
The cache is updated periodically in advance to avoid simultaneous invalidation
Make the key expire at different times by adding random numbers
Cache never expires

The cache to penetrate

If you look at this picture, the userA query with an incorrect condition may have been performed, by this time there is no redis, according to the conventional process is to find the database, but this is an error condition query, the database will not exist, of course, also won’t to write values inside redis, returned to the user an empty, this operation once two times is good, but the more christi, I put redis is to block a block, Reduce the pressure on the database, now Redis has become a dummy, every time I still go to the database to find, this is called cache penetration, equivalent to redis does not exist, was penetrated, for this situation is very easy to solve, we can cache a redisAn empty stringorSpecial strings, such as &&The next time we query in Redis, when the value is null or &&, we will know that the value does not exist in the database, so we will not query in the database.Note: If the key does not exist in the cache, you must set the expiration time. Otherwise, if the database has added the key, the cache will be inconsistent with the database.

This is the case where the query is repeated for a non-existent value. What if the non-existent value applied to each query is different? Even if you cache a particular string each time, it doesn’t matter because its value is different. For example, our database user ID is 111,112,113,114 increments, but someone wants to attack you with a random key like -100, -936, -545. This time redis and database this value does not exist, the other people take the key each time is not the same, even if you cache it is useless, this time the database pressure is quite large, more terrible than the above situation, how to do, this time we today’s protagonist Bloom filter on the stage.

Let’s start with an interview question

Q: How do you quickly determine the existence of an element in a mass of elements (i.e., 1 billion unordered, indefinite, non-repetitive elements)? Well, the simplest idea is to put all this data into a data structure, such as List, Map, Tree, and so on. For example, map.get(). Let’s assume that one element is a 1-byte field, and one billion pieces of data takes about 900 gigabytes of memory. This for ordinary server is not, of course, the interviewer also don’t want to hear the answer, you is so stupid because we must be a good method to be used, ingenious solutions to, here introduce a kind of save space data structure, bitmap, he is an ordered array, there are only two values, the 0 s and 1 s. 0 means not exist, 1 means exist.

So now we need a mapping, so you have to know where something is, and you want to see if it’s a zero or a one, so how do you solve that, you have to use the hash function, and the hash function has two advantages, the first is the hash function, rightWhatever the length of the input value, the length of the output value is fixedThe second is hisIt’s evenly distributed, so if you squeeze it all together, how can you distinguish it? For example, MD5, SHA-1, these are common hashing algorithms.

We can go through the hash function and find out if there is, if we look at the red line, 24 and 147 have the same hash through the hash function, and we’ll call that caseHash collisions or hash collisions. Hash collisions are inevitable, and what we can do is reduce the probability of hash collisions,The first kind ofWe can expand the size of the dimensional array or the bitmap capacity, because our function is evenly distributed, so the larger the bitmap capacity, the less likely there is to be a hash collision at the same location. But more bitmap capacity means more memory consumption, so let’s see if we can solve this in other ways,The second,Way is through the calculation of more than a few hash function, what do you want to ah, 24 and 147 after a collision calculation is now, that I after five times, ten times, 100 times can calculate collision: that really is the fate, you can be together, but it is not the multiple hash function to calculate the better, because it will soon be filled with a bitmap, Besides, computation also takes time, so we need to find a balance between time and space.

For example: how much bitmap capacity do we need to store a million elements? How many hash functions do we need?

Bloom filter

Enter today’s hero, the Bloom filter. This affair has been studied, in 1970, when there is a known as bloomberg predecessors to determine whether mass element in the element existing problems are studied, which is exactly how much a bitmap capacity and how many hash function, it published a paper, put forward the container is called bloom filter.

So if you look at this picture, we’re looking at three elements of our set, and now we’re going to store, let’s say, a, through f1(a), F2 (a), f3(a), through three hash functions, store 1 in its place, and b and c in its place through those three functions. When take elements by f1 (a) function calculation, found that the position is 1, no problem, the second position is 1, the third position is 1, when we say that this is a in bloom filter is there, no problem, in the same way we look at the following d, through calculation of the three found that the result is 1, So can we say that d exists in the Bloom filter? Obviously no, if we look closely at d, the three ones that we get are actually f1(a), F1 (b), f2(c) stored in it, not d stored in it, and this is due to the hash collision, We call this case of False Positive Probability (FPP) when an element in a Bloom filter is misjudged to exist.

Let’s look at another element, the e element. We’re going to determine whether it exists in the container, and we’re going to use all three functions to evaluate it. The first position is 1, the second position is 1, and the third position is 0. So can the e element tell if it’s in the Bloom filter? And the answer is yes, e must not exist. You see, if e existed, when he put it in, all three of these positions were 1, and now one of them is 0, that means he didn’t put it in. Two important conclusions can be drawn from the illustration above

From the perspective of the container:

If the Bloom filter determines that an element exists in the set, it doesn’t necessarily

If the Bloom filter says no, it must not exist

From the element point of view:

If the element actually exists, the Bloom filter must determine that it does

If the element does not actually exist, the Bloom filter may determine that it does

Remember, boys.

Guava implements the Bloom filter

The reason why Java is written by more people and the base is large, because it is open source, embrace open source, many frameworks, many wheels, and there is more than one wheel of a function, light serialization has FastJSON, Jackson, GSON, you can choose any you want, the wheel of the Bloomberg filter is Google provided guava, Let’s use the code to see how it works

First introduce our rack package

      <dependency>
          <groupId>com.google.guava</groupId>
          <artifactId>guava</artifactId>
          <version>21.0</version>
      </dependency>
Copy the code

Here we put a million elements into the Bloom filter, and then we test the accuracy and misjudgment rates of 100 elements that exist and 9,900 elements that don’t

    // How much data to insert
    private static final int insertions = 1000000;

    // The expected miscalculation rate
    private static double fpp = 0.02;

    public static void main(String[] args) {

        // Initialize a Bloom filter that stores string data. The default error rate is 0.03
        BloomFilter<String> bf = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8), insertions, fpp);

        // Use to store all actual existing keys to determine whether they exist
        Set<String> sets = new HashSet<String>(insertions);

        // Store all actual keys for fetching
        List<String> lists = new ArrayList<String>(insertions);

        // Insert random string
        for (int i = 0; i < insertions; i++) {
            String uuid = UUID.randomUUID().toString();
            bf.put(uuid);
            sets.add(uuid);
            lists.add(uuid);
        }

        int rightNum = 0;
        int wrongNum = 0;

        for (int i = 0; i < 10000; i++) {
            // There are 100 multiples of 100 divisible by 100 between 0 and 10000
            String data = i % 100= =0 ? lists.get(i / 100) : UUID.randomUUID().toString();

            // The use of might does not look very confident, so if the Bloom filter determines that the hammer is present, we need to remove the set
            if (bf.mightContain(data)) {
                if (sets.contains(data)) {
                    rightNum++;
                    continue;
                }
                wrongNum++;
            }
        }

        BigDecimal percent = new BigDecimal(wrongNum).divide(new BigDecimal(9900), 2, RoundingMode.HALF_UP);
        BigDecimal bingo = new BigDecimal(9900 - wrongNum).divide(new BigDecimal(9900), 2, RoundingMode.HALF_UP);
        System.out.println("Out of 100W elements, determine the 100 actual elements that bloom's filter thinks exist:" + rightNum);
        System.out.println("Out of 100W elements, 9900 elements that do not actually exist are judged to exist:" + wrongNum + ", hit rate:" + bingo + ", misjudgment rate:" + percent);
    }
Copy the code

This is the final result

As we can see, this result confirms the above conclusion: the 100 real elements must exist in the Bloom filter; for the other 9,900 elements that do not exist, the Bloom filter still judges the existence of 216 elements, which is a misjudgment. The reason has been mentioned above, so the Bloom filter is not omnipowerful. But it helps us to resist most of the data that does not exist, which is very good, and has taken a lot of pressure off the database. In addition, the false error rate 0.02 is set by ourselves when we initialize the Bloom filter. If not, the default is 0.03.We don’t want to set zero when we set it!

The capacity of a bitmap is calculated based on the number of elements and the miscalculation rate.

long numBits = optimalNumOfBits(expectedInsertions, fpp);
Copy the code

Based on the size of the bit array, we further calculate the number of hash functions.

int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);
Copy the code

Using this formula, we calculate that it takes only 0.87 MB of memory to store 1 million elements, resulting in 5 hash functions.

Online computing website: Hur. st/ Bloomfilter…

Source: com/XHJ/bloom/memory/BloomFilterDemo. Java

Redis implements the Bloom filter

Use guava bloom filter implementation is to put the data in the local memory, our project is often distributed, we can put the data in redis, to realize the bloom filter with redis, which requires a mapping function to our own design, measure the length of the binary vector, paste the code below, you can use them directly, It has been tested.

/** * Bloom filter core class **@param <T>
 * @author jack xu
 */
public class BloomFilterHelper<T> {
    private int numHashFunctions;
    private int bitSize;
    private Funnel<T> funnel;

    public BloomFilterHelper(int expectedInsertions) {
        this.funnel = (Funnel<T>) Funnels.stringFunnel(Charset.defaultCharset());
        bitSize = optimalNumOfBits(expectedInsertions, 0.03);
        numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, bitSize);
    }

    public BloomFilterHelper(Funnel<T> funnel, int expectedInsertions, double fpp) {
        this.funnel = funnel;
        bitSize = optimalNumOfBits(expectedInsertions, fpp);
        numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, bitSize);
    }

    public int[] murmurHashOffset(T value) {
        int[] offset = new int[numHashFunctions];

        long hash64 = Hashing.murmur3_128().hashObject(value, funnel).asLong();
        int hash1 = (int) hash64;
        int hash2 = (int) (hash64 >>> 32);
        for (int i = 1; i <= numHashFunctions; i++) {
            int nextHash = hash1 + i * hash2;
            if (nextHash < 0) {
                nextHash = ~nextHash;
            }
            offset[i - 1] = nextHash % bitSize;
        }

        return offset;
    }

    /** * Calculate the length of the bit array */
    private int optimalNumOfBits(long n, double p) {
        if (p == 0) {
            p = Double.MIN_VALUE;
        }
        return (int) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
    }

    /** * Count the number of times the hash method is executed */
    private int optimalNumOfHashFunctions(long n, long m) {
        return Math.max(1, (int) Math.round((double) m / n * Math.log(2))); }}Copy the code

You may not have heard of redis bitmap, but you can say it’s a new data type, or you can say it’s not, because it’s still string in nature. I’ll also write an article later on about data types and how they can be used on the Internet.

/** * redis operation bloom filter **@param <T>
 * @author xhj
 */
public class RedisBloomFilter<T> {
    @Autowired
    private RedisTemplate redisTemplate;

    /** * Delete cached KEY **@param key KEY
     */
    public void delete(String key) {
        redisTemplate.delete(key);
    }

    /** * Add a value based on the given Bloom filter, which is used when adding an element@paramBloomFilterHelper Specifies the Bloom filter object@param key               KEY
     * @paramThe value value *@param<T> generic, you can pass any type of value */
    public <T> void add(BloomFilterHelper<T> bloomFilterHelper, String key, T value) {
        int[] offset = bloomFilterHelper.murmurHashOffset(value);
        for (int i : offset) {
            redisTemplate.opsForValue().setBit(key, i, true); }}/** * Add a value based on the given Bloom filter, which is used when adding a batch of elements, batch add performance is good, use pipeline mode (if cluster, please use the optimized RedisPipeline operation) **@paramBloomFilterHelper Specifies the Bloom filter object@param key               KEY
     * @paramValueList value, list *@param<T> generic, you can pass any type of value */
    public <T> void addList(BloomFilterHelper<T> bloomFilterHelper, String key, List<T> valueList) {
        redisTemplate.executePipelined(new RedisCallback<Long>() {
            @Override
            public Long doInRedis(RedisConnection connection) throws DataAccessException {
                connection.openPipeline();
                for (T value : valueList) {
                    int[] offset = bloomFilterHelper.murmurHashOffset(value);
                    for (int i : offset) {
                        connection.setBit(key.getBytes(), i, true); }}return null; }}); }/** * determine whether the given bloom filter has a value **@paramBloomFilterHelper Specifies the Bloom filter object@param key               KEY
     * @paramThe value value *@param<T> generic, you can pass any type of value *@returnWhether */ exists
    public <T> boolean contains(BloomFilterHelper<T> bloomFilterHelper, String key, T value) {
        int[] offset = bloomFilterHelper.murmurHashOffset(value);
        for (int i : offset) {
            if(! redisTemplate.opsForValue().getBit(key, i)) {return false; }}return true; }}Copy the code

Finally, there is the test class

    public static void main(String[] args) {
        RedisBloomFilter redisBloomFilter = new RedisBloomFilter();
        int expectedInsertions = 1000;
        double fpp = 0.1;
        redisBloomFilter.delete("bloom");
        BloomFilterHelper<CharSequence> bloomFilterHelper = new BloomFilterHelper<>(Funnels.stringFunnel(Charset.defaultCharset()), expectedInsertions, fpp);
        int j = 0;
        // Add 100 elements
        List<String> valueList = new ArrayList<>();
        for (int i = 0; i < 100; i++) {
            valueList.add(i + "");
        }
        long beginTime = System.currentTimeMillis();
        redisBloomFilter.addList(bloomFilterHelper, "bloom", valueList);
        long costMs = System.currentTimeMillis() - beginTime;
        log.info("Add {} values to bloom filter, time: {}ms".100, costMs);
        for (int i = 0; i < 1000; i++) {
            boolean result = redisBloomFilter.contains(bloomFilterHelper, "bloom", i + "");
            if(! result) { j++; } } log.info("Missed {}, validation result takes {}ms", j, System.currentTimeMillis() - beginTime);
    }
Copy the code

The pipelining method uses addList, which uses the setBit of the for loop as the base of the add method. This is very slow, but it can return a value that tells it whether or not it inserted successfully. Pipelining does not know that pipelining does not know. So choosing which method to use depends on your business scenario and how fast you need to insert it.

Redisson implementation

Well, that’s a distributed Bloom filter, and since Java is open source, there must be a good person to help us implement it. Otherwise, every time we write one, there may be inefficiencies and many unconsidered points. Then this mature framework is the famous Redisson!

The reference document is as follows: github.com/redisson/re…

Here’s how I write a single-node implementation:

/ * * *@author jack xu
 */
public class SingleTest {
    public static void main(String[] args) {

        Config config = new Config();
        config.setCodec(new org.redisson.client.codec.StringCodec());

        // Specify the single-node deployment mode
        config.useSingleServer().setAddress("Redis: / / 127.0.0.1:6379");

        RedissonClient redisson = Redisson.create(config);

        RBloomFilter<SomeObject> bloomFilter = redisson.getBloomFilter("sample");
        // Initialize the Bloom filter with an expected count of 55000000 elements and an expected error rate of 0.03
        bloomFilter.tryInit(55000000L.0.03);
        bloomFilter.add(new SomeObject("field1Value"."field2Value"));
        bloomFilter.add(new SomeObject("field5Value"."field8Value"));
        bloomFilter.contains(new SomeObject("field1Value"."field8Value")); }}Copy the code

The Redis-based Redisson cluster distributed Bloom filter provides the function of bloom filter data sharding in the Cluster Redis environment through the interface of RClusteredBloomFilter. A more efficient algorithm is optimized to free up cluster memory space by compressing unused bits. The state of each object will be distributed throughout the cluster. The maximum number of bits contained is 2^64. Note: This feature is limited to the Redisson PRO version.

RClusteredBloomFilter<SomeObject> bloomFilter = redisson.getClusteredBloomFilter("sample");
// Create a bloom filter with the following parameters
// expectedInsertions = 255000000
/ / falseProbability = 0.03
bloomFilter.tryInit(255000000L.0.03);
bloomFilter.add(new SomeObject("field1Value"."field2Value"));
bloomFilter.add(new SomeObject("field5Value"."field8Value"));
bloomFilter.contains(new SomeObject("field1Value"."field8Value"));
Copy the code

Bloom filter with count

If the database deletes the data, the bloom filter data is also deleted. But the Bloom filter doesn’t provide a way to delete it. Why doesn’t the Bloom filter provide a way to delete?

So let’s analyze it. Let’s say we remove the A, so all three positions have to be 0. But then when we say whether or not the element B exists, because one of the positions becomes zero, we say that the element B does not exist, but in fact it does exist. Because of hash collisions, elements can only be stored, not deleted.

So what if we wanted to implement the deletion function, similar to the chain address method of HashMap, we could increment a counter at each subscript position. If I hit this position twice, the counter will be 2. When deleting a, change the counter to 1 first. When b is removed, the counter is set to 0, and the corresponding bit below is set to 0.

We took the delete function of bf is called Counting Bloom Filter, source in: com/XHJ/Bloom/countBloom/CountingBloomFilterTest Java

Work position of bloom filter

The first step is to load all the data from the database into the Bloom filter. The second step is to query the bloom filter when there is a request. If bf says no, the third step is to return directly. If BF says yes, the process before going down.

Other application scenarios of the Bloom filter

Web crawler to the URL to avoid crawling the same URL address;
Anti-spam, judging whether a mailbox is spam from billions of spam lists;
Google Chrome uses a Bloom filter to identify malicious urls;
Medium uses a Bloom filter to avoid recommending articles that users have already read;
Google BigTable, Apache HBbase, and Apache Cassandra use Bloom filters to reduce searches for nonexistent rows and columns.

Good, bloom filter is over here, later in the interview the interviewer asked cache breakdown to do, I believe you should be able to answer, learning, like I said this easy-to-read out, then can also be applied in the work, such as authentication service, when a user logs in to a cloth filter under the judgment, Instead of directly going to Redis, database. This article source: github.com/xuhaoj/redi…