High-performance sorting that breaks memory limits

Vegetable vegetable’s raise salary application is still waiting for approval….

As a technician, technical problems still need to be solved. After the analysis of online logs, logs adopt the hourly mechanism, one log file per hour, and there are multiple log files in the same hour, that is to say, the logs at the same time may be scattered in multiple log files, which is the main reason why Y always merges. Each log file is about 500 MB, and there are about 100 of them. At this point, what should you do if you read this article? How about 2 minutes of meditation!!

Problem analysis

There are several difficulties in achieving the total requirement of Y:

How can I sort all log files by time
The total size of the log files is 500M by 100, or about 50GB, so it is impossible to load them all into memory
Frequently sort and find the smallest element as the program executes.

So what do we do? One solution is this: the heap

The solution

Heap definition

Heap (English: heap) is a general term for a special type of data structure in computer science. A heap is usually an array object that can be thought of as a tree. A heap always satisfies the following properties:

The value of a node in the heap is always no greater than or less than its parent
A heap is always a complete binary tree (a complete binary tree requires a full number of nodes except for the last one, which is left)

For a heap where each node is greater than or equal to the value of each node in the subtree, we call it the “big top heap.” A heap whose value of each node is less than or equal to the value of each node in the subtree is called a “small top heap.”

Heap implementation

Full binary trees are best stored in arrays (linked lists are also possible). Why do you say that? Using arrays to store full binary trees is very storage efficient. Because we don’t need to store Pointers to the left and right child nodes, we can find the left and right child and parent nodes of a node simply by using the index of the array.As you can see from the figure above, 0 is empty, which wastes a lot of storage space, but it is very convenient to calculate the position of the element in the array: the subtree of the element with subscript X is 2x, and the subtree of the element with subtree X is 2x+1. It’s very simple to make a heap, just follow the path of the elements, compare them up or down and switch places.

Add elements

When adding elements, we usually adjust the heap in a bottom-up way. We insert the new element in the last free place of the array, find the parent element according to the index and superscript principle of the heap, and if the value is less than the parent element (the big top heap), then swap. As shown in figure:For a large top heap, the element at the top of the heap is the largest element. After deleting this element, we need to bring the second largest element to the top of the heap. And so on, until all the elements along the path have been adjusted.

Asp.net Core emulation code

The following code can be used directly in a production environment with little or no modification

Small top heap implementation code

// </summary> // </summary> Class MinHeap<T> where T: IComparable { private T[] container; Private int capacity; private int capacity; Private int count; private int count; // Number of data stored in the heap public MinHeap(int _capacity) {container = new T[_capacity + 1]; capacity = _capacity; count = 0; Public bool AddItem(T item) {if (count >= capacity) {return false; } ++count; container[count] = item; int i = count; Container [I].CompareTo(Container [I /2]) < 0) {container[I] = container[I]; container[i] = container[i / 2]; container[i / 2] = temp; i = i / 2; } return true; Public T GetMinItem() {if (count == 0) {return default(T); } T result = container[1]; return result; } public bool DeteleMinItem() {if (count == 0) {return false; } container[1] = container[count]; container[count] = default(T); --count; UpdateHeap(container, count, 1); return true; Private void UpdateHeap(T[] a, int n, int I) {while (true) {int maxPos = I; If (I * 2 <= n &&a [I].CompareTo(a[I * 2]) > 0) {maxPos = I * 2; } if (i * 2 + 1 <= n && a[maxPos].CompareTo(a[i * 2 + 1]) > 0) { maxPos = i * 2 + 1; } if (maxPos == i) { break; } T temp = container[i]; container[i] = container[maxPos]; container[maxPos] = temp; i = maxPos; }}}Copy the code

Simulate log file contents

Class LogInfoIndex: IComparable {class LogInfoIndex: IComparable {public int FileIndex {get; set; } // Specific log file contents public LogInfo Data {get; set; } public int CompareTo(object obj) { var tempInfo = obj as LogInfoIndex; if (this.Data.Index > tempInfo.Data.Index) { return 1; } else if (this.Data.Index < tempInfo.Data.Index) { return -1; } return 0; }} class LogInfo {// public int Index {get; set; } public string UserName { get; set; }}Copy the code

Generate a mock logger

static void WriteFile() { int fileCount = 0; while (fileCount < 10) { string filePath = $@"D:\log\{fileCount}.txt"; int index = 0; while (index < 100000) { LogInfo info = new LogInfo() { Index = index, UserName = Guid.NewGuid().ToString() }; File.AppendAllText(filePath, JsonConvert.SerializeObject(info)+ "\r\n"); index++; } fileCount++; }}Copy the code

The content of the document is as follows:

The test program

static void Main(string[] args) { int heapItemCount = 10; int startIndex = 0; StreamReader[] allReader = new StreamReader[10]; MinHeap<LogInfoIndex> container = new MinHeap<LogInfoIndex>(heapItemCount); While (startIndex< heapItemCount) {string filePath = $@"D:\log\{startIndex}.txt"; System.IO.StreamReader reader = new System.IO.StreamReader(filePath); allReader[startIndex] = reader; string content= reader.ReadLine(); var contentObj = JsonConvert.DeserializeObject<LogInfo>(content); LogInfoIndex item = new LogInfoIndex() { FileIndex= startIndex , Data= contentObj }; container.AddItem(item); startIndex++; } while (true) {var heapFirstItem = container.getMinItem (); if (heapFirstItem == null) { break; } container.DeteleMinItem(); File.AppendAllText($@"D:\log\total.txt", JsonConvert.SerializeObject(heapFirstItem.Data) + "\r\n"); var nextContent = allReader[heapFirstItem.FileIndex].ReadLine(); If (string. IsNullOrWhiteSpace (nextContent)) {/ / if one of the file is read Skip the continue; } var contentObj = JsonConvert.DeserializeObject<LogInfo>(nextContent); LogInfoIndex item = new LogInfoIndex() { FileIndex = heapFirstItem.FileIndex, Data = contentObj }; container.AddItem(item); Dispose(); Dispose(); Dispose(); Dispose(); Dispose(); } console. WriteLine(" finish "); Console.Read(); }Copy the code

The results are as follows:

More interesting articles

Distributed large concurrent series

Architectural Design Series

Series of interesting algorithms and data structures

Design Pattern series

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

High-performance sorting that breaks memory limits

Problem analysis

The solution

Heap definition

Heap implementation

Further reading

Asp.net Core emulation code

Small top heap implementation code

Simulate log file contents

Generate a mock logger

The test program

High-performance sorting that breaks memory limits

Problem analysis

The solution

Heap definition

Heap implementation

Further reading

Asp.net Core emulation code

Small top heap implementation code

Simulate log file contents

Generate a mock logger

The test program

Related Posts

Java Architecture – High concurrency solution

Flink-BroadcastState

How to Implement a Service Grid with Rancher and Kong?