This article originated from personal public account: TechFlow, original is not easy, for attention


Today’s article introduces a useful Python library called Heapq.

Heapq stands for heap queue. Heap and queue are both data structures, which we will introduce in detail in a later article. Today, we will only introduce the usage of HEAPQ. If you don’t know the principles of heap and queue, you can ignore them.

Before introducing its usage, we need to know the definition of priority queues. Queues are very basic and simple data structures. We can imagine that all the elements in a queue are in a row. New elements can only join the queue from the end of the queue, and elements can only exit the queue through the head of the queue. In a priority queue, each element in the queue is assigned a priority, so that the elements in the queue are automatically sorted by priority, with the highest priority placed first.

In other words, HeAPq in Python is a library that maintains priority queues, which can be easily implemented by calling it.


The largest or smallest K elements


Let’s look at a practical problem. Suppose we have N random elements, but we only care about the largest K elements or the smallest K elements. We want to extract this from the entire array, what do we do?

This problem is so common in practice that you can just give an example. For example, when a user enters a search term, we find a lot of content based on the user’s search term. We want to filter out the text users are most likely to click on based on the algorithm, and the machine learning model can give each text a predicted score. And then we need to pick the K results with the highest score. The nlargest and NSmallest interfaces in the HeAPQ library provide a convenient way to do this.

Let’s look at an example:

import heapq

nums = [14.20.5.28.1.21.16.22.17.28]
heapq.nlargest(3, nums)
# 22] [28, 28,
heapq.nsmallest(3, nums)
# [1, 5, 14]
Copy the code

Nlargest and Nsmallest of Heapq take two arguments. The first argument is K, which is the number of returned elements, and the second argument is the array passed in. Heapq returns the largest or smallest K in the array passed in.

So here’s the question, what if the element in our array is an object? What should I do?

For those of you who have seen Python custom keyword sorting, you can do this using anonymous functions, just like sorting.


Anonymous functions


As we all know, in Python you can define a function with def. Functions defined by def have function names, so they are called named functions. In addition to named functions, Python also supports anonymous functions. As the name implies, a function without a function name. In other words, it is the same as a normal function except that it has no name.

Beginners may wonder, how can a function be called without a name?

It’s normal to have this confusion because you’re used to procedural programming and don’t understand object-oriented deeply enough. In many high-level languages, everything is an object. A class, a function, an int is an object. Since functions are objects, they can also be used for passing, not only passing, but also returning. That’s the concept of functional programming, and we won’t go into that much here.

Of course, ordinary functions can also be passed, with the same effect. In programming, however, some functions are used only once. There is no need to define a separate function. Using anonymous functions is very convenient.

For example, let’s say I have a function that looks like this:

def operate(x, func):
  return func(x)
Copy the code

The operate function takes two arguments, the first being the variable x and the second a function. It calls func inside the function and returns the result of the func call. Now I’m going to do something like this, and I want to figure out what func should be based on the remainder of the integer x mod 4. If the remainder of 4 is 0, I want to raise it to the first power, if the remainder is 2, I want to square it, and so on. If we follow the normal approach, we need to implement four methods and pass them one by one.

This is certainly possible, but it is cumbersome, and if you use anonymous functions, you can greatly reduce the amount of code:

def get_result(x):
  if x % 4= =0:
    return operate(x, lambda x: x)
  elif x % 4= =1:
    return operate(x, lambda x: x ** 2)
  elif x % 4= =2:
    return operate(x, lambda x: x ** 3)
  else:
    return operate(x, lambda x: x ** 4)
Copy the code

In the above code, we defined anonymous functions with the lambda keyword, avoiding the need to define four functions for passing. Of course, there’s a simpler way to write this problem, and you can solve it with just one function.

Let’s look at the syntax for lambda to define anonymous functions, starting with the lambda keyword, which indicates that we are defining an anonymous function. And then the argument to this anonymous function, we’re only using one variable x, so we only need to write one x. If we need more than one argument, separate them with commas, or use no arguments. After the arguments are written, we separate them with a colon, followed by the result that is returned.

We can also assign an anonymous function to a variable, which can then be called as if it were a normal function:

square = lambda x: x ** 2

print(square(3))
print(operate(3, square))
Copy the code


Custom sort


So going back, if we want to sort heapq is an object. Then HeAPQ does not know which parameter in the object should be used as the sorting criterion, so at this time, we need to define a function to obtain the keyword and pass it to HeAPQ, so that the sorting can be completed.

For example, we now have a batch of computers that we want HeAPQ to sort by price:

laptops = [
    {'name': 'ThinkPad'.'amount': 100.'price': 91.1},
    {'name': 'Mac'.'amount': 50.'price': 543.22},
    {'name': 'Surface'.'amount': 200.'price': 21.09},
    {'name': 'Alienware'.'amount': 35.'price': 31.75},
    {'name': 'Lenovo'.'amount': 45.'price': 16.35},
    {'name': 'Huawei'.'amount': 75.'price': 115.65}
]

cheap = heapq.nsmallest(3, portfolio, key=lambda s: s['price'])
expensive = heapq.nlargest(3, portfolio, key=lambda s: s['price'])
Copy the code

When calling nlargest and NSmallest, we pass in an additional parameter key. We pass in an anonymous function that returns the price of the object. In other words, we want HeAPQ to sort by the price of the object.


Priority queue


In addition to returning the maximum and minimum number of K, HEAPQ also implements the interface of priority queue. We can simply call the heapq.heapify method, enter an array, and return a heap (equivalent to a priority queue) generated from that array.

We can also maintain the heap from scratch by calling heAPQ’s push and POP. Next, we will use HEAPQ to implement a priority queue by ourselves, the code is very simple, I think you should be able to learn in a moment.

First, the implementation of the priority queue:

import heapq

class PriorityQueue:
  
  def __init__(self):
    self._queue = []
    self._index =0
    
  def push(self, item, priority):
    We pass in two arguments, one is the array to store the elements, and the other is the element to store, which is a tuple.
    Priority is negative because the heap has small to large rows by default
    heapq.heappush(self._queue, (-priority, self._index, item))
    self._index += 1
  
  def pop(self):
    return heapq.heappop(self._queue)[- 1]
Copy the code

Secondly, let’s actually look at the application situation:

q = PriorityQueue()

q.push('lenovo'.1)
q.push('Mac'.5)
q.push('ThinkPad'.2)
q.push('Surface'.3)

q.pop()
# Mac
q.pop()
# Surface
Copy the code

So far, the application of HEAPQ has been introduced, but not really finished.

We need to analyze the complexity of operations in HEAPQ. We will skip the heap for the moment and look at Nlargest and NSmallest first. I found the source code of this library on Github, and in the annotation of the method, the author wrote down the complexity of the method, and after sorting, the cost of the first K is 50/50:

def nlargest(n, iterable, key=None):
    """Find the n largest elements in a dataset. Equivalent to: sorted(iterable, key=key, reverse=True)[:n] """
Copy the code

We all know the expected complexity of sortingIf you know anything about heaps, the complexity of the heap for one insert is zero. If we limit the length of the heap to K, we can only keep K elements after n inserts. The complexity per insertion is zeroN insertions, so the overall complexity is zero.

If K is a little bit smaller, it might be a little bit less expensive than sorting, but only to a limited extent. So is there a way to filter out the first K or the first K as quickly as possible without sorting?

I’ll keep it in suspense and we’ll come back to it later in the article.

Today’s article is here, if you feel that there is a harvest, please feel free to point to the attention of it, your lifting a finger is very important to me.

The resources

Python CookBook Version3

wikipedia