I. Description of the problem:

Median is the number in the middle of an ordered list. If the list length is even, the median is the average of the middle two numbers.

For example, the median of [2,3,4] is 3 and the median [2,3] is (2 + 3) / 2 = 2.5

Design a data structure that supports the following two operations:

  • Void addNum(int num) – Adds an integer from the data stream to the data structure.
  • Double findMedian() – Returns the median of all elements so far.

Example:

AddNum (1) addNum(2) findMedian() -> 1

Second, problem analysis

1. Suppose there is a data stream: [3, 1, 9, 25, 12…]

For each input data, append it at the end of the list, sort it, and then calculate the median. The process is as follows:

3 - > [3] : 1 - > 3 [3, 1] - > [1, 3] median (1 + 3) / 2 = 2.0 9 - > [1, 3, 9] - > [1, 3, 9] : 3 25 - > [1, 3, 9, 25] - > [1, 3, 9, 25] median (3 + 9) / 2 = 6.0, 12 - > [1, 3, 9, 25, 12] - > [1, 3, 9, 12, 25] : 9

The final output median list is: [3, 2.0, 3, 6.0, 9…]

2, calculation process summary: sort first, then fold in half

As shown in the figure:



It can be seen that the median can be found in two cases:

  • When the list length is odd: just take the middle number
  • When the list length is even: Calculate the average of the middle two numbers

3. Basic implementation

There are many ways to implement the algorithm, the most intuitive is to use the list sort(), and then take the median.

def running_medians_naive(iterable):
    medians = []
    values = []
    for i,x in enumerate(iterable):
        values.append(x)
        values.sort()
        if i % 2 == 0:
            medians.append(values[i//2])
        else:
            medians.append((values[i//2]+values[i//2+1])/2)
    return medians

running_medians_naive([3, 1, 9, 25, 12])

Looking at the time complexity of the implementation of the above algorithm, we can see that we need to sort() the list every time, which is time-consuming.

4, Heap implementation

Carefully observing the calculation process, we can see that taking the median is only related to 1-2 values in the center of the list, as shown in the figure:

Much like the idea of Heap Heap, you can use Heap Heap to reduce the time complexity by thinking of the left half of the median as large-top Heap max-heap and the right half as small-top Heap min-heap.

The Python implementation consists of two parts: the max-heap heap and the median fetching algorithm.

4.1, the Max – heap heap
class Heap: def __init__(self, key=lambda x:x): self.data = [] self.key = key @staticmethod def _parent(idx): return (idx-1)//2 @staticmethod def _left(idx): return idx*2+1 @staticmethod def _right(idx): return idx*2+2 def heapify(self, idx=0): while idx < len(self.data): left = Heap._left(idx) right = Heap._right(idx) mid_idx = idx if left < len(self.data) and self.key(self.data[left]) > self.key(self.data[mid_idx]): mid_idx = left if right < len(self.data) and self.key(self.data[right]) > self.key(self.data[mid_idx]): mid_idx = right if mid_idx ! = idx: self.data[mid_idx], self.data[idx] = self.data[idx], self.data[mid_idx] idx = mid_idx else: break def add(self, x): self.data.append(x) idx = len(self.data) - 1 while idx > 0: p = Heap._parent(idx) if self.key(self.data[idx]) > self.key(self.data[p]): self.data[idx], self.data[p] = self.data[p], self.data[idx] idx = p else: break def peek(self): return self.data[0] def pop(self): ret = self.data[0] self.data[0] = self.data[len(self.data)-1] del self.data[len(self.data)-1] self.heapify() return ret def __bool__(self): return len(self.data) > 0 def __len__(self): return len(self.data) def __repr__(self): return repr(self.data)
4.2. Take the median
def running_medians_heap(iterator):
    medians = []
    max_heap = Heap(lambda x:x)
    min_heap = Heap(lambda x:-x)
    for i, x in enumerate(iterator):
        if i % 2 == 0:
            max_heap.add(x)
            min_heap.add(max_heap.pop())
            medians.append(min_heap.peek())
        else:
            min_heap.add(x)
            max_heap.add(min_heap.pop())
            medians.append((max_heap.peek() + min_heap.peek()) / 2)
    return medians

running_medians_heap([3, 1, 9, 25, 12])

Specific algorithm principle, please refer to the video: https://www.bilibili.com/vide…

If you have any comments or suggestions, please contact our official account or email: bjstorm#foxmail.com