Find TopK using the heap

A very important and classic application of heap is the “Top K problem”.

I’ve abstracted this problem of finding Top K into two categories. One is for a static data set, which means that the data set is predetermined and never changes. The other is for dynamic data sets, that is, the data set is not determined in advance, and there is data added to the set dynamically.

For static data, how to find the first K big data in an array containing N data?

We can maintain a small top heap of size K, go through the array sequentially, and compare the data from the array with the top elements of the heap. If it is larger than the heaptop element, we remove the heaptop element and insert it into the heap. If it is smaller than the top of the heap, no processing is done and the array continues to be iterated. After all the data in the array is traversed, the data in the heap is pre-K big data. It takes O(n) time to go through the array, and O(logK) time for one heaping-operation, so in the worst case, n elements are all in the heap at once, so the time is O(nlogK).

To calculate Top K for dynamic data is real-time Top K.

Take an example: a data set has two operations, one is to add data, the other is to ask the current K big data. If K big data is recalculated based on the current data before each query, the time complexity is O(nlogK), and N represents the size of the current data.

In fact, we can always maintain a small top heap of K size, and when data is added to the collection, we can compare it to the elements at the top of the heap. If it is larger than the heaptop element, we remove the heaptop element and insert it into the heap. If it is smaller than the top of the heap, it is not processed. In this way, whenever we need to query the current pre-K big data, we can return it in real time.

TopK application case

Case description

Now that you’ve seen how the heap works in some of the scenarios above, consider this: suppose you have a log file with 1 billion search terms, how do you quickly get to the Top10 most popular search terms?

There are many more advanced solutions to this problem, such as MapReduce. Now that you’re limiting the scenarios to a single machine and the available memory to 1GB, how do you solve this problem?

First because users search keywords, there should be a lot could be repeated, so the first thought is to statistics each search keyword frequency, then can check it on the hash table and balanced binary tree or some other support quick search, insert data structure, to record and the number of occurrences of keywords.

Choose hash table, solve ideas and shortcomings

Assuming the hash table is selected, the solution is as follows:

  1. Scan the 1 billion search terms sequentially, and look it up in the hash table every time it finds a keyword;
  2. If so, add one to the number of corresponding keywords; If it does not exist, it is inserted into the hash table with a count of 1;
  3. And so on, after the billion search terms have been iterated, the hash table stores the non-repeating search terms and how many times they occurred.
  4. Then, according to the method of finding Top K by heap mentioned above, a small Top heap with a size of 10 is established, and the hash table is traversed to take out each search keyword and the corresponding occurrence times in turn.
  5. Then compare it with the keyword on the top of the heap. If the keyword appears more frequently than the keyword on the top of the heap, delete the keyword on the top of the heap and add the keyword that appears more frequently to the heap.
  6. Similarly, after iterating through the search terms in the entire hash table, the search terms in the heap are the Top 10 most frequently used search terms.

Looking back on the analysis, the above solution is actually not rigorous, or there are loopholes. How to say, 1 billion keywords or many, assuming 1 billion search keywords is not repeated in 100 million, if the average length of each search keywords is 50 bytes, the storage of 100 million keywords about 5 gb of memory space, and hash tables due to the need to avoid frequent conflict, will not choose to load factor, So you’re probably consuming more memory. The machine only has 1GB of available memory, so you can’t fit all the search terms into memory at once. What optimization should we do at this point?

Hash algorithm + hash + heap to achieve the Top10 most popular search keyword ideas

That’s where the hash algorithm comes in, the hash algorithm gives you the same hash for the same data. According to this feature of the hash algorithm, we can first hash 1 billion search keywords into 10 files. The specific implementation process is as follows:

  1. Create 10 empty files 0001, 02… 09;
  2. Iterate over those billion keywords and hash them through some hash algorithm;
  3. The hash value is then modulo with 10, and the result is the file number that should be assigned to the search term.
  4. After slicing the 1 billion keywords, each file will have about 100 million keywords. If you subtract the duplicates, it will only have about 10 million keywords. Each keyword is 50 bytes on average, so the total size is 500MB, so 1GB of memory can fit perfectly.
  5. For each file containing 100 million search keywords, the hash table and heap are used to calculate the Top 10 respectively.
  6. Finally, put the Top 10 words together and take the Top 10 words that appear most frequently.