In order to solve the problem that the Bloom filter could not delete elements, the cuckoo filter was born. The authors of the paper Cuckoo Filter: Better Than Bloom make an in-depth comparison between the Cuckoo Filter and Bloom Filter. Compared to the cuckoo filter, the Bloom filter has the following disadvantages: poor query performance, low space utilization efficiency, no reverse operation (delete), and no counting support.

Poor query performance is due to the fact that the Bloom filter needs to use multiple hash functions to detect multiple different loci in the bitmap, which are widely spread across memory, resulting in low CPU cache row hit ratio.

The spatial efficiency is low because the cuckoo filter is significantly more space-efficient than The Bloon filter at the same misjudgment rate, saving about 40% in space. However, bloom filters do not require bitmap length to be an exponent of two, as cuckoo filters do. From this point of view, it seems that bloom filters are more spatially scalable.

The problem of not supporting the reverse delete operation really hits the soft spot of bloom filter. In a dynamic system elements are always coming and going. Bloem filters are like blots. They come in and they leave, and they can’t clean up after they leave. For example, if you only have 1kW elements left in the system, but there are hundreds of millions of flowing elements in the system, the Bloom filter has no option but to store the traces of those missing elements there forever. Over time, the filter gets more and more crowded, until one day you realize its misjudgment rate is so high that you have to rebuild it.

The cuckoo filter claims in the paper to solve this problem by effectively supporting reverse deletion. And use it as an important selling point to lure you away from the Bloom filter and into the cuckoo filter.

But after a while of research, it turns out the cuckoo filter isn’t all it’s cracked up to be. The reverse delete support is so weak that you can’t use it. Before I explain this problem to the reader, let me explain the cuckoo filter in detail.

Cuckoo hash

The cuckoo filter comes from the cuckoo hashing algorithm, and the cuckoo hashing algorithm comes from life — the cuckoo who loves to occupy the nest. Cuckoos are promiscuous (free) and never build their own nests. It lays its eggs in other people’s nests and lets others help it hatch. When the cuckoo chick emerges, because of its relatively large size, it squeezes the other babies (or eggs) of its adoptive mother out of the nest — falling to their deaths.

The simplest cuckoo hash is a one-dimensional array structure, with two hash algorithms mapping the new element to two positions in the array. If one of the two positions is empty, the element can be put directly into it. But if both slots are filled, it will have to “occupy the nest”, kicking out one at random and taking the slot itself.

p1 = hash1(x) % l
p2 = hash2(x) % l
Copy the code

Unlike the cuckoo, the cuckoo hashing algorithm helps its victims find other nests. Because every element can be placed in two places, you can cram it into any one of them that’s free. So the sad, evicted egg will check to see if the other spot is open, and if it is, move to it and everyone will be happy. But what if the seat is also taken by someone else? Well, it will do it all over again, passing on the role of victim to others. The new victim then repeats the process until all the eggs have found their own nest.

As Lu Xun famously said, “Take your own nest and let others go!”

The problem is that if the array is so crowded that you can kick it hundreds of times without stopping, then the insertion efficiency can be severely affected. At this point, the cuckoo hash sets a threshold, and when continuous nesting behavior exceeds a certain threshold, the array is considered nearly full. You need to expand it and replace everything.

The other problem is that there could be a run cycle. For example, if two different elements are in exactly the same position after the hash, they can be in one position. But then a third element comes in, and it’s in the same position after the hash, and obviously you’re going to have a run cycle. But the probability that three different elements will be in the same position after two hashes is not very high, unless your hash algorithm is really bad.

The cuckoo hash algorithm treats this run cycle as if the array is too crowded and needs to be expanded (it’s not).

To optimize the

The average space utilization of the above cuckoo hash algorithm is not very high, perhaps only 50%. At that percentage, you quickly get a series of runs that exceed the threshold. The value of such a hash algorithm is not obvious, so it needs to be improved.

One of the improvements is to add hash functions so that each element has not just two nests, but three nests and four nests. This can greatly reduce the probability of collision and improve the space utilization to about 95%.

Another improvement is to hang multiple seats at each position of the array, so that even if two elements are hash in the same position, you don’t have to immediately “occupy the nest” because there are multiple seats and you can sit on any one of them. A run is only necessary if all these seats are taken. Obviously, this would also significantly reduce the number of runs. The space utilization of this scheme is only about 85%, but the query efficiency is very high. Multiple seats in the same location are contiguous in memory space, which makes efficient use of THE CPU cache.

So a more efficient solution would be to merge the two improvements above, such as using four hash functions to put two seats in each position. In this way, we can get both time efficiency and space efficiency. This combination can even increase space utilization by 99%, which is a remarkable space efficiency.

Cuckoo filter

A cuckoo filter, like a cuckoo hash, is a one-dimensional array, but unlike a cuckoo hash, which stores the entire element, a cuckoo filter stores only the element’s fingerprint information (a few bits, similar to a Bloom filter). Here the filter sacrifices data accuracy for spatial efficiency. It is the fingerprint information of the element that is stored, so there is a misjudgment rate, just like the Bloom filter.

First, the cuckoo filter will still only use two hash functions, but each position can hold multiple seats. These two hash functions are chosen specially because the filter can only store fingerprint information. When the fingerprint in this position is squeezed, it needs to compute another dual position. To compute this dual position, you need the element itself, so let’s recall the hash position formula.

fp = fingerprint(x)
p1 = hash1(x) % l
p2 = hash2(x) % l
Copy the code

We know the fingerprints of P1 and x, so we can’t figure out p2 directly.

Special hash function

The neat thing about the cuckoo filter is that it has a unique hash function that allows p2 to be computed directly from P1 and element fingerprints, rather than the full x element.

fp = fingerprint(x)
p1 = hash(x)
p2 = p1 ^ hash/ / xor (fp)Copy the code

And as you can see from the formula above, when we know fp and P1, we can immediately figure out p2. And similarly, if we know p2 and FP, we can just figure out p1 — duality.

p1 = p2 ^ hash(fp)
Copy the code

So we don’t need to know whether the current position is P1 or P2 at all, we just need to xor the current position and hash(FP) to get the dual position. And just make sure to hash(FP)! Lambda equals 0, so p1 factorial is guaranteed. = p2, so you don’t have to kick yourself and cause an endless loop.

You might ask why the hash function doesn’t modulo the length of the array. Actually, yes, but the cuckoo filter enforces the length of the array to be an exponent of two, so modulo the length of the array is equivalent to taking the last n bits of the hash value. In xOR operations, ignore all but the lowest n bits. Leaving the calculated position P low n bits is the final dual position.

// l = power(2, 8)
p_ = p & 0xff
Copy the code

The data structure

For simplicity, let’s assume that the fingerprint takes up one byte and each position has four seats.

type bucket [4]byte  // One bucket, 4 seats
type cuckoo_filter struct {
  buckets [size]bucket // One dimensional array
  nums int  // The number of elements to hold
  kick_max  // Maximum number of runs
}
Copy the code

Insert algorithm

The insertion needs to take into account the worst-case scenario, which is a run cycle. So there needs to be a maximum run limit

def insert(x):Fp = fingerprint(x) p1 = hash(x) p2 = p1 ^ hash(fp) // Try to add the first positionif! buckets[p1].full(): buckets[p1].add(fp) nums++returnTrue // Try to add a second positionif! buckets[p2].full(): buckets[p2].add(fp) nums++returnRand (p1, p2) c = rand(p1, p20
  whileC < kick_max: // old_fp = buckets[p]. Replace_with (fp) fp = old_fp // calculate the binary position p = p ^ hash(fp) // try to join the binary positionif! buckets[p].full(): buckets[p].add(fp) nums++return true
    c++
  return false
Copy the code

Search algorithm

The search is very simple, in the two hash locations of the bucket to look for their fingerprints ok.

def contains(x):
  fp = fingerprint(x)
  p1 = hash(x)
  p2 = p1 ^ hash(fp)
  return buckets[p1].contains(fp) || buckets[p2].contains(fp)
Copy the code

Delete algorithm

Delete algorithm and search algorithm is similar, also very simple, in two buckets to erase their fingerprints on the OK.

def delete(x):
  fp = fingerprint(x)
  p1 = hash(x)
  p2 = p1 ^ hash(fp)
  ok = buckets[p1].delete(fp) || buckets[p2].delete(fp)
  if ok:
    nums--
  return ok
Copy the code

A glaring weakness

So far so good! The cuckoo filter looks perfect! Delete function and get the number of elements of the function, much more powerful than the Bloom filter, and seems to be very simple logic, above a few lines of code is done. If the insert operation returns false, it means that capacity needs to be expanded, which is also obvious.

but! What if a cuckoo filter inserts the same element multiple times in a row?

According to the logic above, there is no doubt that this element’s fingerprint will occupy all the seats in both positions — eight seats. The values in all eight seats are the same, they’re all fingerprints of this element. If the insertion continues, a run cycle immediately occurs. From p1 to P2, and from P2 to P1.

You might wonder, could you check before you insert and ask if the element already exists in the filter? That does solve the problem, and you don’t have a run cycle when you insert the same elements. However, there will be a high probability of error deletion. Since there is a high probability that different elements will be hashed to the same location, and the fingerprint is only one byte, 256 possibilities, there is also a high probability that the same fingerprint will appear in the same location. If two elements have the same hash position and the same fingerprint, then the insert check considers them equal.

Insert x, check that it contains y. As a result, only one fingerprint (X’s fingerprint) will be stored. So deleting y is the same thing as deleting x. This leads to a higher rate of miscarriage of justice.

The paper does not deceive us, it also addresses this question. (Readers need not understand the second sentence)

In real-world applications, it is almost impossible to ensure that an element is not inserted a specified number of times. If you think you can do it, think about how! Do you also have to maintain an external dictionary to keep track of the number of insertions per element? What about the storage space for the external dictionary?

Because deletion cannot be perfectly supported, the number of internal elements cannot be accurately estimated.

prove

Let’s use the open source cuckoo filter library to prove the above reasoning

go get github.com/seiflotfy/cuckoofilter
Copy the code

The cuckoo filter stores one byte of fingerprint information for each element, and there are four seats in the same place. We try to insert the same element 15 times.

package main

import (
	"fmt"
	"github.com/seiflotfy/cuckoofilter"
)

func main(a) {
	cf := cuckoo.NewFilter(100000)
	for i := 0; i < 15; i++ {
		var ok = cf.Insert([]byte("geeky ogre"))
		fmt.Println(ok)
	}
}

-------
true
true
true
true
true
true
true
true
false
false
false
false
false
false
false
Copy the code

We found that we can only insert the same element a maximum of eight times. Each time we return false we go through hundreds of runs until we hit the maximum number of runs.

If eight seats in two locations store the same element, then the waste of space is also serious, and the efficiency of the space is cut down to 1/8, which is no match for the efficiency of the Bloom filter.

If deletion is not supported, the cuckoo filter is comparable purely in terms of spatial efficiency. This does do a little better than the Bloom filter, but the space requirements of the cuckoo filter, which have to be exponential of 2, again make space efficiency a little less efficient.

Related projects

Paper on Cuckoo filters: Practically Better Than Bloom

Redis cuckoo filter module: github.com/kristoff-it…

The most influential cuckoo filter C library: github.com/efficient/c…