A 50-year-old technology - Bloom filters

preface

We talked earlier about Redis cache avalanche, penetration, and breakdown. In this article we mentioned that one of the solutions to cache penetration is the Bloom filter, but we didn’t talk about how to use the Bloom filter last time.

As a warm man’s elder brother, to make up for you, please call me IT old warm man.

What is a Bloom filter

The Bloom Filter, which was introduced in 1970 by a guy named Bloom, is now 50 years old, as old as my brother.

It’s actually a long binary vector and a bunch of random mapping functions, and binary, as you probably all know, stores data that’s either 0 or 1, and the default is 0.

It is used to determine whether an element is in a set. 0 indicates that certain data does not exist, and 1 indicates that certain data exists.

Do you understand? As aSunshine boyMy brother is drawing you a picture to help you understand:

Use of Bloom filter

Resolving Redis cache penetration (today’s focus)
In crawler, the crawler url is filtered, and the url that already exists in Braun is not crawled.
Spam filtering, to determine whether each address that sends mail is in bloom’s blacklist, if so, it is judged as spam.

The above is just a simple example of use, we can draw inferences from one another, flexible use in the work.

Bloom filter principle

In the process

A Bloom filter, as it says, is a collection of binary data. When a piece of data is added to the set, it undergoes the following (there are drawbacks, which will be discussed below) :

The data is computed by K hash functions and returns K computed hash values
These K hash values map to the corresponding K binary array subscripts
Change the binary data corresponding to K subscripts to 1.

For example, if the first hash function returns x and the second and third hash functions return y and z, then the binary of x, y, and z is changed to 1.

As shown in the figure:

The query process

The main function of bloom filter is to query a data, in this binary set, the query process is as follows:

This data is computed by K hash functions, corresponding to K hash values computed
Find the corresponding binary array subscript using the hash value
Judgment: If there is a location where the binary data is 0, then the data does not exist. If both are 1’s, the data is in the set. (There are drawbacks, which will be covered below)

Removal process

You generally can’t delete data in bloom filters, which is one of the drawbacks we’ll examine below.

Advantages and disadvantages of bloom filters

advantages

Since you store binary data, it takes up very little space
Its insert and query speed is very fast and the time complexity is O (K), reminiscent of the process of HashMap
Confidentiality is good because it doesn’t store any raw data itself, only binary data

disadvantages

Which brings us back to the shortcomings we mentioned above.

Adding data is done by calculating the hash value of the data, so it is quite possible that two different data will compute the same hash value.

For example, if “hello” and “hello” in the figure end up with the same hash value, they will change the binary data with the same subscript to 1.

At this point, you don’t know whether binary with subscript 2 means “hello” or “hello.”

The following disadvantages can be obtained:

First, miscarriage of justice

If the graph above does not contain “hello” but only “hello”, then a query using “hello” will determine that “hello” exists in the collection.

Because the hash value for “hello” and “hello” is the same, the binary data found by using the same hash value is also the same, which is 1.

Two, delete difficulty

I’m going to stop here and I think you can understand why, but as your warm brother, I’ll tell you.

Again, use the above example, because “hello” and “hello” have the same hash value and the corresponding array index is the same.

At this time, the elder brother wanted to delete “hello”, and changed the binary data in the subscript 2 from 1 to 0.

Do we delete “hello” at the same time? (0 means there is such data, 1 means there is no such data)

By this point, you’ve already understood the Bloom filter. I’m the warm guy.

Implement bloom filter

There are many ways to implement this, one of which is the one provided by Guava.

Introduction of Guava POM configuration

< the dependency > < groupId > com. Google. Guava < / groupId > < artifactId > guava < / artifactId > < version > 29.0 jre < / version > </dependency> Copy the codeCopy the code

Two, code implementation

And let’s measure the error rate by the way.

import com.google.common.hash.BloomFilter; import com.google.common.hash.Funnels; Public class BloomFilterCase {/** * how much data is expected to be inserted */ private static int size = 1000000; /** * private static double FPP = 0.01; /** * Private static BloomFilter<Integer> BloomFilter = bloomfilter. create(funnel.integerFunnel (), size, fpp); Public static void main(String[] args) {for (int I = 0; i < size; i++) { bloomFilter.put(i); } int count = 0; for (int i = size; i < size + 100000; i++) { if (bloomFilter.mightContain(i)) { count++; System.out.println(I + "error "); }} system.out. println(" count + count "); }} Copy the codeCopy the code

Running results:

There are 947 misjudgments in 100,000 data, which is about 0.01%, which is the misjudgment rate set in our code: FPP = 0.01.

Dive deep into code

The core bloomfilter.create method

@VisibleForTesting static <T> BloomFilter<T> create( Funnel<? Super T> Funnel, Long expectedInsertions, double FPP, Strategy Strategy) {... } Duplicate codeCopy the code

There are four parameters:

funnel: data type (typically in the Funnels tool class)
expectedInsertions: The number of values expected to be inserted
fpp: Misjudgment rate (default 0.03)
strategy: Hash algorithm

Let’s focus on the FPP parameter

FPP misjudgment rate

Scenario 1: FPP = 0.01

Number of misjudgments: 947
Memory size: 9585058 bits

Scenario 2: FPP = 0.03 (Default)

Number of misjudgments: 3,033
Memory size: 7298440 bits

Scene summary

The miscarriage rate can passfppParameter adjustment
The smaller the FPP, the more memory space is required: 0.01 requires more than 9 million bits and 0.03 requires more than 7 million bits.
The smaller the FPP, the more hash functions needed to calculate the more hash values to store in the corresponding array subscripts when adding data to the collection. (Forgot to look at the process of filtering the data stored in bloom above)

NumBits = 1 million int digits Theoretically save a million numbers, an int is 4 bytes 32 bits, need 481000000=32 million bits. If you use HashMap to store, you need 64 million bits for 50 percent HashMap storage efficiency. As you can see, BloomFilter has a very small storage space, about 1/10 of that of HashMap

The numHashFunctions above indicate that several hash functions are required to map different subscripts to whether the numbers exist (0 or 1).

The latest BAT interview questions for 2020 are compiled here for you, including various Java technology fields, very comprehensive, a total of more than 80 PDF documents.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

A 50-year-old technology — Bloom filters

preface

What is a Bloom filter

Use of Bloom filter

Bloom filter principle

In the process

The query process

Removal process

Advantages and disadvantages of bloom filters

advantages

disadvantages

Implement bloom filter

Introduction of Guava POM configuration

Two, code implementation

Dive deep into code

FPP misjudgment rate

A 50-year-old technology — Bloom filters

preface

What is a Bloom filter

Use of Bloom filter

Bloom filter principle

In the process

The query process

Removal process

Advantages and disadvantages of bloom filters

advantages

disadvantages

Implement bloom filter

Introduction of Guava POM configuration

Two, code implementation

Dive deep into code

FPP misjudgment rate

Related Posts

FileBeat + Pipeline parsing log save to ElasticSearch

Remember a Lua memory leak lookup

Use Go + WebSocket to quickly implement a chat service