@ the Author: Runsen

Last time we introduced heap sort, this time we introduce the common application scenario TopK problem of heap sort.

Use the heap to solve the TopK problem

TopK problem is a typical application scenario of heap sort.

Let’s say we want to find the K element with the largest value, K less than 10,000, in a mass of data, say 10 billion integers. What do you do about it?

The object is Leetcode problem 215: the KTH largest element in an array.

Specific link: leetcode-cn.com/problems/kt…

Finds the KTH largest element in the unsorted array. Note that you need to find the KTH largest element of the sorted array, not the KTH distinct element.

Example 1: input:,2,1,5,6,4 [3] and k = 2 output: 5 example 2: input:,2,3,1,2,4,5,5,6 [3] and k = 4 output: 4Copy the code

Classical TopK problems include: maximum (small) K number, the first K high-frequency elements, and the KTH largest (small) element

This TopK problem is essentially a sorting problem, there are ten sorting algorithms, there are many sorting algorithms have not been introduced.

Why is the best answer to the TopK problem heap sort? In fact, in terms of space and time complexity, although quicksort is the best sorting algorithm, but for 10 billion elements from the largest to the smallest, and then output the first K element values.

However, whether we use quicksort or heapsort, when sorting, we need to read all the elements into memory. In other words, 10 billion integer elements take up about 40GB of memory, which doesn’t sound like something a typical civilian computer could do. (A typical civilian computer has less memory than that, like the 32GB ON which I write my articles.)

It is well known that both quicksort and heap sort can be O(nlogn)O(nlogn)O(nlogn) O(nlogn), but for quicksort, data is accessed sequentially. For heap sort, the data is accessed in a hop. For example, in heap sort, one of the most important operations is the heap of data. Therefore, the time complexity of quicksort is better than that of heap sort.

But quicksort is a new array, and the space complexity is O(n)O(n)O(n) O(n), much lower than heap sort O(1)O(1)O(1) O(1). For large data volumes, heap sort should be preferred.

If the built-in module of HEAPQ is used, it is a line of code to find the KTH largest element in the array. The Nlargest interface in HEAPQ is encapsulated and returns an array, which needs to be sliced.

import heapq
class Solution:
    def findKthLargest(self, nums: List[int], k: int) - >int:
        return heapq.nlargest(k,nums)[-1]
Copy the code

And, of course, the usual way to do this is to sort by hand, find the KTH largest element in the array and build the smallest heap, find the KTH smallest element in the array,

Create a minimum heap of size K for the first K elements of numS, and maintain a small top heap of size K. K nodes in the heap represent the maximum K elements, and the top of the heap is obviously the minimum value of the K elements.

So you just walk through the array, and when the binary heap is K, when you see an element that’s bigger than the top of the heap, you pop it off the top of the heap, and you push it in, and you keep maintaining the maximum K elements. At the end of the walk, the top element of the heap is the KTH largest element. Time complexity O(NlogK)O(NlogK)O(NlogK).

class Solution:
    def findKthLargest(self, nums: List[int], k: int) - >int:
        heapsize=len(nums)
        def maxheap(a,i,length) :
            l=2*i+1
            r=2*i+2
            large=i
            if l<length and a[l]>a[large]:
                large=l
            if r<length and a[r]>a[large]:
                large=r
            iflarge! =i: a[large],a[i]=a[i],a[large] maxheap(a,large,length)def buildheap(a,length) :
            for i in range(heapsize//2, -1, -1):
                maxheap(a,i,length)

        buildheap(nums,heapsize)
        for i in range(heapsize-1,heapsize-k,-1):
            nums[0],nums[i]=nums[i],nums[0]
            heapsize-=1
            maxheap(nums,0,heapsize)
        return nums[0]  
Copy the code

Conversely, if you want to minimize the first k, then use the largest heap, so the perfect solution to the TopK problem is heap sort. So only you see the KTH of the array… , immediately comes to mind heap sort.

If the data size is small, the time complexity, space complexity requirements are not high, there is no need to “high” algorithm, write a fast row is perfect.

TopK problem is like a search engine that receives a large number of user search requests every day. It will record the search keywords entered by these users and then analyze them statistically offline to get the Top10 hottest search keywords.

GitHub, portal ~, GitHub, portal ~, GitHub, Portal ~