This is why’s 83rd original article


The brain-sucking LFU

A few days ago, I saw such a discussion in an APP:


I saw an interesting comment:


LFU is really hard. It hurts to scratch my skull.

If LRU is Easy, then changing the middle letter from R (Recently) to F (Frequently), LFU, is hard.

If you don’t know Frequently, it doesn’t matter. It’s a Frequently word.


So the full name of LFU is the Least Frequently Used, the Least Frequently Used strategy.

Obviously, the emphasis is on frequency of use.

The full name of LRU algorithm is Least Recently Used. Least recently used algorithm.

The emphasis is on time.

When the dimension of statistics is changed from time to frequency, what happens in the implementation of the algorithm?

I’ll do a comparison with the LRU algorithm I’ve written before.

LRU vs LFU

The idea of the LRU algorithm is that if a piece of data has not been accessed in the recent past, it is highly unlikely that it will be accessed in the future. Therefore, when the specified space is full of data, the data that has not been accessed for the longest time should be eliminated.

When you’re weeding out data, you’re just looking at how long it’s been in the cache.

When LFU cache is full and data needs to be weeded out, it looks at the number of accesses. The more accesses, the less likely data is weeded out.

However, some data may be accessed the same number of times.

How do you deal with that?

If the number of accesses is the same, then consider the length of time the data is in the cache.

In other words, LFU algorithm, first look at the number of accesses, if the number of accesses is consistent, then look at the cache time.

Let me give you a concrete example.

Assuming we have a cache capacity of 3, we access the data in the following order:


If data elimination is carried out according to LRU algorithm, the results of ten visits are as follows:


At the end of the ten accesses, all that’s left in the cache are elements B, C, and D.

Do you think there’s a little something wrong?


Element A is accessed 5 times out of 10, and element A is eliminated, right?

If we follow the LFU algorithm, the last three elements left in the cache should be B, C, and A.

In this way, LFU is more reasonable and comfortable than LRU.

Suppose we want to implement an LFUCache:

class LFUCache {



    public LFUCache(int capacity) {



    }

    

    public int get(int key) {



    }

    

    public void put(int key, int value) {



    }

}

Copy the code

So what should the thinking be?

Let me show you.

LFU scheme one – a bidirectional linked list

If I had not been exposed to THE LFU algorithm at all, I could only think of the following solution:

Because you need both frequency and time order.

We could make a linked list, sort it by frequency, sort it by time if it’s the same frequency.

Since we need to remove nodes in this process, we use bidirectional linked lists for efficiency.

Let’s say we have a cache capacity of 3, and we’re going to do that again.

We define the frequency as FREQ, so after the first three accesses, that is, after the three requests are completed:


The list should look like this:


All three elements have access frequency of 1.

For the first three elements, value=a is the element that has not been accessed for the longest time with the same frequency, so it is the next element in the head node, waiting to be eliminated at any time.

Then there is a request for value=a:


When the request comes in, the frequency (freq) of the nodes in the list whose value=a becomes 2.

At this time, it is the most frequent, should not be eliminated.

As a result, the linked list looks like this:


Then there are three consecutive requests for value=a:


At this point, the list changes focus on the frequency (freq) of the node value=a:


Then the B request came in:


The freq of node B changes from 1 to 2, and the position of node also changes:


Then, C requests:


What do you think will happen at this point?

The current access frequency of c in the list is 1. When this request comes in, the frequency of C in the list becomes 2.

Well, the frequency of value=b is also 2.

Crash. So what do you say at this point?

Said before: the same frequency, look at the time.

The node whose value=c is being accessed should also be discarded.

The linked list should look something like this:


Then came the last request:


D element, which has never been in the list before, and the list is full.

It’s time for elimination.

Select * from head where value=b and insert value=d;


Finally, all requests are complete.

The three elements left in the cache are D, C, and A.

Here’s how it works:


Of course, this is just showing the list changes.

We’re actually putting key-value pairs.

So there should also be a HashMap to store the mapping between keys and linked list nodes.

This is so simple that I can think of it with my toes, so I won’t go into it.

If you write code slowly, you should be able to write it.

The double-linked list scheme above is the one that most people can think of directly.

The interviewer is looking for an O(1) solution.

The current solution is O(N) time.


The O (1) solution

If we’re going to come up with an O(1) solution, we’re going to have to analyze it carefully, not just think about it.

Let’s analyze the requirements.

First point: we need to query the value corresponding to the key.

Use HashMap to store key-value pairs.

The query time complexity is O(1).

Second point: whenever we operate a key, whether query or add, we need to maintain the key frequency, denoted as freQ.

Because we need to frequently operate the freQ corresponding to the key, that is, to obtain the freQ of the specified key in the case of time complexity O(1).

Now, tell me out loud, what data structure do you use?

Do you need another HashMap to store the mapping between keys and FREqs?

Third point: If there is no more space in the cache, when you need to flush data, delete the smallest freq key.

Delete the smallest key in freq.

Minimum freq?

How do we know which key has the smallest freQ?

As mentioned earlier, there is a HashMap that stores the mapping between keys and FREQs.

Of course we can iterate over the HashMap to get the smallest freq key.

But oh, my friends, if traversal happens, is time still order one?

So what to do?

Pay attention, here comes the climax. You can’t learn it, you can’t break it.

We can make a variable to record the smallest freq, let’s call it minFreq, ok?


Now we have the minimum frequency (minFreq), and we need to obtain the key corresponding to the minimum frequency. The time complexity is O(1).

Come on, friend, tell me out loud, what data structure do you think of again?

HashMap again?

Ok, we now have three hashMaps to introduce to you:

  • A HashMap that stores keys and values, that is, HashMap

    .
    ,value>
  • A HashMap that stores keys and FREq, namely HashMap

    .
    ,freq>
  • A HashMap that stores freq and key, that is, HashMap

    .
    ,key>

Each of them is doing its job in order to have order one time.

But we can combine the first two HashMaps.

Let’s make an object that contains two properties, value and freq.

Let’s say this object is called Node, and it looks like this, defaulting to frequency 1:

class Node {

    int value;

    int freq = 1;

// Constructor omitted

}

Copy the code

Now we can replace the first two hashmaps with a single HashMap<key,Node>.

Similarly, we can add a key attribute to Node:

class Node {

    int key;

    int value;

    int freq = 1;

// Constructor omitted

}

Copy the code

Since Node contains a key, you can replace the third HashMap<freq,key> with HashMap<freq,Node>.

At this point, we’re missing a very critical piece of information, which is this point down here.

Fourth: There may be multiple keys with the same minimum FREQ, so remove the element that has been in the cache for the longest time.

To do this, we need to find a Node using freq, which is the HashMap<freq,Node> hash table.

MinFreq can be used to query multiple nodes. MinFreq can be used to query multiple nodes.

So the HashMap<freq,Node> hash table should look like this:

HashMap<freq, collection >.

The question then becomes: what collection should we use to hold this Node object?

Okay, so let’s just figure out what this set has to satisfy.

First, when you need to delete a Node.

Since this collection contains data that is accessed equally frequently, it is desirable that this batch of data be timed so that the oldest nodes can be deleted quickly.

What data structure can you think of?

Isn’t that a two-way list?

Then, when you need to access Node.

When a Node is accessed, its frequency must increase by one.

Take this example:


Suppose the minimum access frequency is 5, and 5 corresponds to 3 Node objects.

At this point, I want to access the object of value=b, and that object will be removed from the value of key=5.

And then the frequency plus one, which is 5 plus 1 is 6.

Add it to the value set for key=6 and it looks like this:


This means we need to support fast deletion of any node.

We can customize a bidirectional linked list according to the above requirements.

But in the Java collection class, there is a collection that meets the above condition of being ordered and supporting quick deletion.

That’s LinkedHashSet.

So, HashMap<freq, set > is HashMap<freq,LinkedHashSet>.

To sum up.

We need two hashMaps, HashMap<key,Node> and HashMap<freq,LinkedHashSet>.

Then you need to maintain a minimum access frequency, minFreq.

Capacity is the maximum capacity that the cache can support.

Didn’t.

Some friends must ask: why don’t you give me a code?


The analysis comes out, and the code rolls itself out.

Think clearly before you write code, even if the interview did not write bug-free code, it is basically close to ten.

LFU algorithm in Dubbo

Dubbo supports LFU algorithms after version 2.7.7:


Its position is the source: org.apache.dubbo.com mon. Utils. LFUCache


The code is not long, more than 200 lines in total, and it is a little different from the LFU implementation we mentioned above.

You can see it doesn’t even maintain minFreq.

But these are not important, hit a breakpoint debugging will soon be able to analyze the author’s code thinking.

Importantly, I found a bug while looking at Dubbo’s LFU algorithm.

This is not a bug in the implementation of the LFU algorithm. I have seen that there is no problem in the implementation of the algorithm.

The bug is that Dubbo has implemented the LFU caching algorithm, but as a user, it cannot use it.

What’s the problem?

Let me show you.

Source code tells me that this configuration can use lFU cache policy:


However, when I configure it this way and make the call, it looks like this:


You can see that the cache policy for the current request is indeed LFU.

But an error is thrown:


No such extension org.apache.dubbo.cache.CacheFactory by name lfu

There is no LFU strategy.

This isn’t playing with me?


Let’s look at the specific reasons.

At org.apache.dubbo.com mon. The extension. ExtensionLoader# getExtensionClasses access only to four caching strategies, didn’t we want LFU:


So, an exception is thrown here:


Why didn’t we find the LFU we wanted?

That depends on whether you are familiar with SPI.

There is indeed no LFU configuration in the SPI file:


That’s a bug.

So what’s the solution?

It’s very simple. Add it and you’re done.


Harm, accidentally contributed a line of source code to Dubbo.


One last word (for attention)

If you find something wrong, you can raise it backstage and I can fix it.

Thank you for reading, I insist on original, very welcome and thank you for your attention.

I’m Why, a programmer who mainly writes code, often writes articles, and occasionally shoots videos.