Vegetable vegetable’s raise salary application is still waiting for approval….

As a technician, technical problems still need to be solved. After the analysis of online logs, logs adopt the hourly mechanism, one log file per hour, and there are multiple log files in the same hour, that is to say, the logs at the same time may be scattered in multiple log files, which is the main reason why Y always merges. Each log file is about 500 MB, and there are about 100 of them. At this point, what should you do if you read this article? How about 2 minutes of meditation!!

Problem analysis

There are several difficulties in achieving the total requirement of Y:

  1. How can I sort all log files by time
  2. The total size of the log files is 500M by 100, or about 50GB, so it is impossible to load them all into memory
  3. Frequently sort and find the smallest element as the program executes.

So what do we do? One solution is this: the heap

The solution

Heap definition

Heap (English: heap) is a general term for a special type of data structure in computer science. A heap is usually an array object that can be thought of as a tree. A heap always satisfies the following properties:

  • The value of a node in the heap is always no greater than or less than its parent
  • A heap is always a complete binary tree (a complete binary tree requires a full number of nodes except for the last one, which is left)

For a heap where each node is greater than or equal to the value of each node in the subtree, we call it the “big top heap.” A heap whose value of each node is less than or equal to the value of each node in the subtree is called a “small top heap.”

Heap implementation

Full binary trees are best stored in arrays (linked lists are also possible). Why do you say that? Using arrays to store full binary trees is very storage efficient. Because we don’t need to store Pointers to the left and right child nodes, we can find the left and right child and parent nodes of a node simply by using the index of the array.As you can see from the figure above, 0 is empty, which wastes a lot of storage space, but it is very convenient to calculate the position of the element in the array: the subtree of the element with subscript X is 2x, and the subtree of the element with subtree X is 2x+1. It’s very simple to make a heap, just follow the path of the elements, compare them up or down and switch places.

  1. Add elements

When adding elements, we usually adjust the heap in a bottom-up way. We insert the new element in the last free place of the array, find the parent element according to the index and superscript principle of the heap, and if the value is less than the parent element (the big top heap), then swap. As shown in figure:For a large top heap, the element at the top of the heap is the largest element. After deleting this element, we need to bring the second largest element to the top of the heap. And so on, until all the elements along the path have been adjusted.

Further reading

  1. The top element of the small top heap is actually the smallest element of the heap, and the top element of the large top heap is the largest element of the heap. And that’s one of the great things about heap sort, taking the smallest element or the largest element is order one.
  2. Remove elements we should note, if with the method of top-down switching elements, in many cases caused a serious imbalance of heap (left and right subtrees depth is large), in order to prevent similar situation, we can put the last element mentioned pile tip, and then adjust the strategy, because the last element is always in the final stage, It doesn’t make a big difference between the left and right subtrees.
  3. For a heap with duplicate elements, one way to solve this problem is to assume that the first element is the largest, and that the last element to enter the heap is smaller than the first one, so that the search is always thorough. Another way is to store a linked list of the same elements in each element of the heap, similar to a hash table. However, this requires special treatment when deleting this element.
  4. The time complexity of removing data from the heap and inserting data into the heap is O(logn).
  5. The process of constantly tweaking the heap is essentially a sorting process, and in some scenarios, we can use the heap to sort.

Asp.net Core emulation code

The following code can be used directly in a production environment with little or no modification

Small top heap implementation code
// </summary> // </summary> Class MinHeap<T> where T: IComparable { private T[] container; Private int capacity; private int capacity; Private int count; private int count; // Number of data stored in the heap public MinHeap(int _capacity) {container = new T[_capacity + 1]; capacity = _capacity; count = 0; Public bool AddItem(T item) {if (count >= capacity) {return false; } ++count; container[count] = item; int i = count; Container [I].CompareTo(Container [I /2]) < 0) {container[I] = container[I]; container[i] = container[i / 2]; container[i / 2] = temp; i = i / 2; } return true; Public T GetMinItem() {if (count == 0) {return default(T); } T result = container[1]; return result; } public bool DeteleMinItem() {if (count == 0) {return false; } container[1] = container[count]; container[count] = default(T); --count; UpdateHeap(container, count, 1); return true; Private void UpdateHeap(T[] a, int n, int I) {while (true) {int maxPos = I; If (I * 2 <= n &&a [I].CompareTo(a[I * 2]) > 0) {maxPos = I * 2; } if (i * 2 + 1 <= n && a[maxPos].CompareTo(a[i * 2 + 1]) > 0) { maxPos = i * 2 + 1; } if (maxPos == i) { break; } T temp = container[i]; container[i] = container[maxPos]; container[maxPos] = temp; i = maxPos; }}}Copy the code
Simulate log file contents
Class LogInfoIndex: IComparable {class LogInfoIndex: IComparable {public int FileIndex {get; set; } // Specific log file contents public LogInfo Data {get; set; } public int CompareTo(object obj) { var tempInfo = obj as LogInfoIndex; if (this.Data.Index > tempInfo.Data.Index) { return 1; } else if (this.Data.Index < tempInfo.Data.Index) { return -1; } return 0; }} class LogInfo {// public int Index {get; set; } public string UserName { get; set; }}Copy the code
Generate a mock logger
static void WriteFile() { int fileCount = 0; while (fileCount < 10) { string filePath = $@"D:\log\{fileCount}.txt"; int index = 0; while (index < 100000) { LogInfo info = new LogInfo() { Index = index, UserName = Guid.NewGuid().ToString() }; File.AppendAllText(filePath, JsonConvert.SerializeObject(info)+ "\r\n"); index++; } fileCount++; }}Copy the code

The content of the document is as follows:

The test program
static void Main(string[] args) { int heapItemCount = 10; int startIndex = 0; StreamReader[] allReader = new StreamReader[10]; MinHeap<LogInfoIndex> container = new MinHeap<LogInfoIndex>(heapItemCount); While (startIndex< heapItemCount) {string filePath = $@"D:\log\{startIndex}.txt"; System.IO.StreamReader reader = new System.IO.StreamReader(filePath); allReader[startIndex] = reader; string content= reader.ReadLine(); var contentObj = JsonConvert.DeserializeObject<LogInfo>(content); LogInfoIndex item = new LogInfoIndex() { FileIndex= startIndex , Data= contentObj }; container.AddItem(item); startIndex++; } while (true) {var heapFirstItem = container.getMinItem (); if (heapFirstItem == null) { break; } container.DeteleMinItem(); File.AppendAllText($@"D:\log\total.txt", JsonConvert.SerializeObject(heapFirstItem.Data) + "\r\n"); var nextContent = allReader[heapFirstItem.FileIndex].ReadLine(); If (string. IsNullOrWhiteSpace (nextContent)) {/ / if one of the file is read Skip the continue; } var contentObj = JsonConvert.DeserializeObject<LogInfo>(nextContent); LogInfoIndex item = new LogInfoIndex() { FileIndex = heapFirstItem.FileIndex, Data = contentObj }; container.AddItem(item); Dispose(); Dispose(); Dispose(); Dispose(); Dispose(); } console. WriteLine(" finish "); Console.Read(); }Copy the code

The results are as follows:

More interesting articles

  • Distributed large concurrent series
  • Architectural Design Series
  • Series of interesting algorithms and data structures
  • Design Pattern series