Distributed Bloom filter practice

define

The Bloom filter is a concept introduced by Burton Howard Bloom in 1970. Consists of a long binary array and some hash functions.Copy the code

The principle of

For an element, the process of adding is to use some hash functions to calculate specific values, find the corresponding bits in the binary array according to the subscripts of these values, and change the value of the bits from 0 to 1. When verify whether there is an element, and through these first hash function to calculate the value of the corresponding according to these values for the array subscript to find the corresponding bit, if all bit bit is 1, it means that this element may exist, if one or more bit a value of 0, this element does not exist. Bloom filter experiment

Single-machine bloom filter used

First, we need to introduce the Google Guava package. Here I use Maven for project JAR dependencies

< the dependency > < groupId > com. Google. Guava < / groupId > < artifactId > guava < / artifactId > < version > 20.0 < / version > <scope>compile</scope> </dependency>Copy the code

According to the Bloom filter provided by Guava, we need to initialize the bloom filter parameters and provide some static methods to use.

Public class LocalBloomFilter {/** * create a LocalBloomFilter ** @author ZRH */ public class LocalBloomFilter {/** * create a LocalBloomFilter ** * Number of estimated data inserts * Allowed data error rate */ Private final static BloomFilter<String> BloomFilter = BloomFilter. Create (Funnels. StringFunnel (StandardCharsets. UTF_8), 10000000, 0.00000001); ** @param id * @return */ public static Boolean match(String id) {return bloomFilter.mightContain(id); } /** * add element to bloomFilter ** @param id */ public static void put(String id) {bloomfilter.put (id); }}Copy the code

After initializing the Bloom filter, we can populate the bloom filter with data.

/** * Initialize data to bloom filter ** @author ZRH */ @order @slf4j @configuration public class BloomFilterConfig implements */ @override public void afterPropertiesSet() {// Simulated data List<Integer> List = Lists.newArrayList(1, 2, 3, 4, 5, 6, 7, 8, 9); Log.info (" load data into the Bloom filter,size:{}", list.size()); if (! CollectionUtils.isEmpty(list)) { list.stream().forEach(item -> { LocalBloomFilter.put(item + ""); }); }}}Copy the code

We can write an interceptor to intercept and filter accordingly.

/** * Bloom_interceptor extends. ** @author ZRH */ @slf4j.Component Public Class BloomFilterInterceptor extends HandlerInterceptorAdapter {/ * * * * * @ custom interceptor filter method param request * @ param @ param response * o * @ return * @ throws IOException */ @Override public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object o) throws IOException {/ / if the end is the OPTIONS request if (HttpMethod. OPTIONS. The toString (). The equals (request) getMethod ())) { response.setStatus(HttpStatus.NO_CONTENT.value()); return true; } response.setHeader("Content-type", "application/json; charset=UTF-8"); String id = request.getParameter("id"); if (id == null) { String currentUrl = request.getRequestURI(); PathMatcher matcher = new AntPathMatcher(); Map<String, String> pathVariable = matcher.extractUriTemplateVariables("/filter/product/{id}", currentUrl); id = pathVariable.get("id"); } log.info(" Enter the Bloom filter interceptor..." ); If (localBloomfilter.match (id)) {return true; } // Not in the local bloom filter. Response. setHeader(" content-type ", "application/json"); response.setCharacterEncoding("UTF-8"); String result = new ObjectMapper(). WriteValueAsString (" Bloom filter block, parameter does not exist: "+ id); response.getWriter().print(result); return false; }}Copy the code

Write a test controller method that can be intercepted and filtered.
When an element with a Bloom filter is entered, it can be requested to pass

When an element is entered that does not have a Bloom filter, the request is intercepted

Redis long filter used

The above Bloom filter was created using the Google Guava package and stored data based on local JVM memory, which is obviously not available for distributed systems.
So if you want to use bloom filter in a distributed system, you need to integrate Redis. The redis setBit command is used to set the bit of the corresponding key.
First we need to define a custom Bloom filter handler

We need k hash functions, each of which can hash the key to an integer * 2. To initialize, we need an array of n bits, each bit initialized to 0 * 3. When a key is added to the set, k hash functions are used to calculate k hash values, and the corresponding bit position in the array is 1 * 4. To determine whether a certain key is in the collection, k hash functions are used to calculate K hash values, and the corresponding bits in the array are queried. If all bits are 1, it is considered to be in the collection. */ public class BloomFilterHelper<T> { private Long bitSize; Private int numHashFunctions; // Private int numHashFunctions; Funnel<T> Funnel; // Funnel<T> Funnel; /** * @param expectedInsertions Expect Number of insert data * @param FPP Allowed error rate * @param Funnel */ public BloomFilterHelper(Long) expectedInsertions, double fpp, Funnel<T> funnel) { this.funnel = funnel; bitSize = optimalNumOfBits(expectedInsertions, fpp); numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, bitSize); } /** * Calculate the size of the bit array * @param n estimate the number of inserted data * @param p allow error rate * @return */ Private long optimalNumOfBits(long n, double p) { if (p == 0) p = Double.MIN_VALUE; return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2))); } / number * @ * * * calculated hash function param n forecasts insert data quantity * @ param m bit @ return the array size * * / private int optimalNumOfHashFunctions (long n, long m) { return Math.max(1, (int) Math.round((double) m / n * Math.log(2))); } @param value element @return */ public Long[] mightContain(T value) {Long[] longs = new Long[numHashFunctions]; long hash64 = Hashing.murmur3_128().hashObject(value, funnel).asLong(); int hash1 = (int) hash64; int hash2 = (int) (hash64 >>> 32); for (int i = 1; i <= numHashFunctions; ++i) { int combinedHash = hash1 + i * hash2; if (combinedHash < 0) { combinedHash = ~combinedHash; } longs[i - 1] = combinedHash % bitSize; } return longs; }}Copy the code

Integrating Redis based on the Bloom filter processor, initializing bloom filter instances and some operation methods are provided for use.

/** * @author ZRH */ @configuration public class BloomRedisService {@autowired private StringRedisTemplate redisTemplate; / / private static final BloomFilterHelper BloomFilterHelper = new BloomFilterHelper(10000000L, 0.01, (Funnel<String>) (from, into) -> into.putString(from, Charsets.UTF_8)); /** * Add initial data to the Bloom filter ** @param key * @param value * @param <T> */ public <T> void addByBloomFilter(String key, T value) {/ / according to random hash function calculation element hash value Long [] mightContain = bloomFilterHelper. MightContain (value); / / to assignment the bits in the redis Arrays. The stream (mightContain). ForEach (o - > redisTemplate. OpsForValue () setBit (key + value, o, true)); } /** * Verify that data exist ** @param key * @param value * @param <T> * @return */ public <T> Boolean includeByBloomFilter(String Key, T value) {/ / according to random hash function calculation element hash value Long [] mightContain = bloomFilterHelper. MightContain (value); // Return! Arrays.stream(mightContain).anyMatch(o -> ! redisTemplate.opsForValue().getBit(key + value, o)); }}Copy the code

Add the underlying data to the Bloom filter

/** * Initialize data to bloom filter ** @author ZRH */ @order @slf4j @configuration public class BloomFilterConfig implements InitializingBean { /** * redis - key */ public final static String BLOOM_STRING_FILTER = "bloom:string:filter:"; @Autowired private BloomRedisService service; Public void afterPropertiesSet() {List<Integer> List = lists.newarrayList (1, 1); 2, 3, 4, 5, 6, 7, 8, 9); Log.info (" load data into the Bloom filter,size:{}", list.size()); if (! CollectionUtils.isEmpty(list)) { list.stream().forEach(item -> { // LocalBloomFilter.put(item + ""); service.addByBloomFilter(BLOOM_STRING_FILTER, item + ""); }); }}}Copy the code

Intercept filtering in a custom interceptor

/** * Bloom_interceptor extends. ** @author ZRH */ @slf4j.Component Public Class BloomFilterInterceptor extends HandlerInterceptorAdapter { @Autowired private BloomRedisService bloomRedisService; /** * Custom intercepting and Filtering methods ** @param Request * @param Response * @param o * @return * @throws IOException */ @override public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object o) throws IOException {/ / if the end is the OPTIONS request if (HttpMethod. OPTIONS. The toString (). The equals (request) getMethod ())) { response.setStatus(HttpStatus.NO_CONTENT.value()); return true; } response.setHeader("Content-type", "application/json; charset=UTF-8"); String id = request.getParameter("id"); if (id == null) { String currentUrl = request.getRequestURI(); PathMatcher matcher = new AntPathMatcher(); Map<String, String> pathVariable = matcher.extractUriTemplateVariables("/filter/product/{id}", currentUrl); id = pathVariable.get("id"); } log.info(" Enter the Bloom filter interceptor..." ); // if (localbloomfilter.match (id)) {// return true; / /} / / use in redis bloom filter the if (bloomRedisService. IncludeByBloomFilter (BLOOM_STRING_FILTER, id)) {return true; } // Not in the local bloom filter. Response. setHeader(" content-type ", "application/json"); response.setCharacterEncoding("UTF-8"); String result = new ObjectMapper(). WriteValueAsString (" Bloom filter block, parameter does not exist: "+ id); response.getWriter().print(result); return false; }}Copy the code

Finally, the bloom filter based on Redis is consistent with the bloom filter implemented by Guava.

The advantages and disadvantages

advantages

Because the element itself is not stored, but the data is stored in bits, the space is small and the data security advantages are great.
Because the time complexity is O(number of hash functions) to calculate the hash value through the random hash function and to search in the bit array with the hash value as the subscript, the search time is fast.

disadvantages

Different elements may have the same hash value (hash collision) after being calculated by the hash function. As a result, the elements that do not exist can pass the Bloom filter. Therefore, there is a certain misjudgment rate.
Because elements are stored in bits (0 does not exist, 1 does). For example, the hash subscripts of element A are 1, 5, and 8, and the hash subscripts of element B are 2, 6, and 8. If this was to remove the B element then we would have to set the bits in the array at index 2, 6 and 8 to 0, but element A also exists at index 8. So there are hash collisions, which are very unfriendly to data deletion.

Usage scenarios

Cache downtime, caching, breakdown, generally to determine whether a user in the cache, if in the return results directly, not in the query database, if the data to a wave of cold, can lead to a cache of breakdown, the avalanche effect, this can be a cloth index of the filter when the cache only in bloom filter to the query cache, if no query, Penetrates to the database. If it is not in the bloom, the direct response is returned.
WEB interceptor, if the same request is intercepted, to prevent repeated attacks. For the first request, the user puts the request parameters into the Bloom filter. For the second request, the user first determines whether the request parameters are matched by the Bloom filter. Improves the cache hit ratio.
Determining whether a user has read a video or article will, of course, lead to some misjudgment, but will not let the user see duplicate content.
Use the Bloom filter to reduce disk IO or network requests, because once a value must not exist, we don’t need to make subsequent expensive query requests.

The last

Here to share a true case about the use of Bloom filter in Tencent video Now live find page short video waterfall stream optimization
In view of the disadvantage that Counting Bloom Filter is unfriendly to data deletion, there is a kind of Counting Bloom Filter, which maintains a counter array, and what is stored in it is not bit element. In this way, several times of space is occupied to solve the shortcoming that Counting Bloom Filter cannot be deleted.
In real scenarios, the length of binary array is very long. After different hash calculations, the span of subscripts in the array may be very large. However, the discontinuous values will lead to a low CPU hit ratio, which is the performance bottleneck in the practice scenario of Bloom filter.
A better filter that supports add and remove is called the cuckoo filter. I’m not going to talk about it anymore, but in a later article I’m going to talk about this better performing cuckoo filter, and there’s a link here where you can look at the rules and principles of cuckoo filters and cuckoo hash visualization
Finally, learn humbly and make progress together _-

define

The principle of

Single-machine bloom filter used

Redis long filter used

The advantages and disadvantages

advantages

disadvantages

Usage scenarios

The last

Related Posts

Redis Learning (I) : Redis introduction

RabbitMQ Knowledge brain map – Message queue

HAProxy is highly available using Keepalived