Data structure must be familiar to everyone, for a mature programmer, familiar with and master data structure and algorithm is also one of the basic skills. Data structure itself is actually a collection of data stored or organized according to characteristic relations. Special structures often bring different processing efficiency in different application scenarios.

The commonly used data structures can be divided into linear structure and nonlinear structure according to the characteristics of data access. Linear structures include common linked lists, stacks, queues, etc., while nonlinear structures include trees, graphs, etc. There are many kinds of data structures. This paper will introduce and explain the commonly used data structures theoretically by means of diagrams, so as to facilitate everyone to master the basic knowledge of common data structures.

In this paper, an outline

1 array

Arrays are arguably the most basic and common data structures. Arrays are generally used to store the same type of data, which can be accessed and updated by array names and subscripts. The elements of an array are stored sequentially and consecutively in memory in the same order. The spacing of memory addresses between adjacent elements of an array is generally the size of the array data type.

2 list

Compared with arrays, linked lists add pointer fields to build chained storage data in addition to data fields. Each node in the list contains data for that node and a pointer to the address of the next node. Because the next data element is searched and accessed through a pointer, the linked list has a higher degree of freedom.

This shows that when adding or deleting nodes, only the pointer address of the previous node needs to be changed without changing other nodes. However, everything has two poles, and Pointers bring high degrees of freedom at the expense of data search efficiency and the use of extra space.

Generally common is a single linked list with a head and tail, the pointer field is linked back, but also can form a bidirectional or circular linked list.

Linked lists and arrays

Linked lists and arrays need to be selected according to their advantages and disadvantages in actual use. The similarities and differences between linked lists and arrays are also one of the most frequently discussed points in interviews. The differences between singly linked lists and arrays are compared and summarized here.

Jump table 3

As can be seen from the above comparison, although linked list improves the degree of freedom by adding pointer fields, it leads to the deterioration of data query efficiency. Especially when the linked list length is very long, the query of the data has to be queried from the beginning, which is less efficient. The creation of skip list is to solve the problem of long linked list and to speed up the query efficiency of original linked list by adding multilevel index of linked list. This can increase the time complexity of the query from O(n) to O(logn).

With the addition of multi-level indexes, hop tables can achieve efficient dynamic insertion and deletion, which is as efficient as red-black trees and balanced binary trees. Redis and levelDB are currently available to jump tables.

As can be seen from the above figure, the indexing-level pointer field not only points to the next index position, but also has a down pointer to the lower level of the linked list position, so as to achieve the purpose of jumping query.

4 the stack

Stack is a relatively simple data structure, often described in one sentence, last in first out. The stack itself is a linear table, but there is only one slot in the table that allows data in and out. This pattern can be referenced in coelenterates… That is, eating and defecating in the same mouth…

Common operations on the stack include push and pop, which correspond to pushing and pushing of data. Access to the top of the stack, determine whether the stack is empty, determine the size of the stack, and so on. Due to the last-in-first-out characteristics of the stack, it can often be used as a temporary container for data operations, to control the order of data, and can be combined with other data structures to obtain a lot of flexible processing.

5 the queue

Queues are siblings of stacks, and correspond to lifO of stacks. Queues are fifO data structures. As the name implies, data storage in queues is like queuing, with the data stored first being pushed out first. Often with the stack together, can play the greatest strength.

6 tree

As a data structure in the form of a tree, the relationship between data nodes is similar to that of a tree. A limited number of nodes are arranged according to different hierarchical relationships, thus forming the parent-child relationship between data. The common representation of a number is closer to an “upside-down tree” because it puts the roots up and the leaves down.

The data of a tree is stored in nodes, and each node has zero or more children. The node with no parent is at the top and becomes the root node; No non-root node has one and only one parent; Each non-root node can be divided into multiple disjoint subtrees.

This means that trees are hierarchical, fathers and sons are clear, families are clear; This is the main difference between a tree and a graph.

Although the tree seems to be very advanced, in fact, it can be regarded as a high version of the linked list. The realization of tree is to expand the pointer field of linked list, adding multiple addresses to point to child nodes. At the same time, the “linked list” stood up, so as to highlight the hierarchical relationship between nodes, easier to analyze and understand.

Many structures can be derived from trees. If the pointer field is set to double Pointers, the most common binary tree can be formed, that is, each node has at most two subtrees of the tree structure. According to the arrangement and number of nodes, binary trees can be further divided into complete binary trees, full binary trees, balanced binary trees, red black trees and so on.

Complete binary tree: except for the last node, the number of nodes in other layers reaches the maximum value. At the same time, the nodes in the last layer are arranged from left to right. Full binary tree: all nodes except the last one have two children.

Balanced binary trees

A balanced binary tree, also known as an AVL tree, is a binary sorted tree and has the following properties: it is an empty tree or the absolute value of the height difference between the left and right subtrees is not more than 1, and both the left and right subtrees are balanced binary trees.

Binary sort tree: is an empty tree, or: if its left subtree is not empty, the value of all nodes in the left subtree is less than the value of its root node; If its right subtree is not empty, the value of all nodes in the right subtree is greater than the value of its root node. Its left and right subtrees are also binary sort trees respectively. Tree height: maximum balance factor of node hierarchy: left subtree height – right subtree height

Binary sorting tree means that the data in the binary tree is sorted, the order is left node < root node < right node, which indicates that the middle order traversal result of binary sorting tree is ordered. (also do not understand the binary tree four traversal way [pre-order traversal, middle-order traversal, post-order traversal, hierarchical traversal] students quickly remedial remedial!

Balanced binary trees are generated to solve the linear arrangement of binary sorting trees during insertion. Since the binary sorting tree itself is ordered, when a highly ordered sequence is inserted, the generated binary sorting tree will continue to insert data on the number of words in a certain direction, resulting in the final binary sorting tree will degenerate into a linked list, so that the query and insertion efficiency of the binary tree deteriorates.

The emergence of balanced binary tree can solve the above problems, but in the construction of balanced binary tree, different adjustment methods are needed to keep the binary tree balanced after inserting data. The four main adjustments are LL (left rotation), RR (right rotation), LR (first left rotation then right rotation), and RL (first right rotation then left rotation). I’m going to introduce you to a simple single rotation operation, left and right. LR and RL are essentially just a combination of LL and RR.

After inserting a node, the node balance factor on the path should be modified along the search path. When the balance factor is greater than 1, it needs to be balanced. Starting from the node where the imbalance occurs, take the nodes directly below the two layers along the path just backtracked. If the three nodes are in a straight line, a single rotation is used for balancing; if the three nodes are in a broken line, a double rotation is used for balancing.

Sinistral: S is the node that currently needs sinistral, and E is the parent node of the current node.

The operation of left-rotation can be expressed simply as follows: rotate the left child of the current node S to the right child of the current node’s parent E, and rotate the parent E to the left child of the current node S. Available animation representation:

Dextral: S is the node that currently requires left-handed rotation, and E is the parent node of the current node. Right monotone is a mirror rotation of left monotone.

The left-handed operation can also be expressed simply as follows: rotate the left child of the current node S and the right child of E to the left child of the current node S, and rotate the current node S to the right child of the left child E. Available animation representation:

Red and black tree

In order to achieve high balance of balanced binary tree (AVL), the height difference between left and right subtrees must be less than or equal to 1. The benefit of high balance is that it provides higher search efficiency, with the worst search time complexity being O(logN). However, the cost of maintaining this high balance is that many rotations are required to achieve complex balance when inserting and deleting tree nodes. This results in inefficient insertion and deletion of AVL.

To solve this problem, can we find a structure that is efficient for both search and insert and delete? The red black tree then applied to fight.

Red-black trees have five characteristics:

  1. Every node is either red or black.

  2. The root is black.

  3. Every leaf (leaf refers to NIL pointer or NULL at the end of the tree) is black.

  4. If one node is red, then both of its sons are black.

  5. For any node, each path to NIL at the end of the leaf tree contains the same number of black nodes.

Red black tree VS balanced binary tree

In addition to the tree structures mentioned above, there are many tree structures that are widely used in database, disk storage, and other scenarios. B trees, B+ trees, etc. I’m not going to talk about it here, but I’ll talk about it next time when I talk about the principles of storage. (It’s actually laziness)

7 the heap

Once you know binomial trees, it’s easy to understand heaps. A heap is usually an array object that can be thought of as a tree. The heap is usually implemented not through pointer fields, but by building a one-dimensional array that corresponds to the parent nodes of a binary tree, so the heap is always a complete binary tree.

For the ordinal number n of any parent node (where n equals 0), the ordinal number of its children must be 2n+ 1,2 n+2, so a heap can be represented directly as an array.

Not only that, but the heap also has the property that the value of a node in the heap is always no greater than or less than its parent. The heap with the largest root node is called the maximum heap or large root heap, and the heap with the smallest root node is called the minimum heap or small root heap.

Heap is often used to realize the priority queue, in the interview often test questions are related to sorting, such as heap sorting, topK questions and so on. Since the root node of the heap is the maximum or minimum value in the sequence, the extreme value in the data sequence can be screened out during the process of heap construction and heap reconstruction, so as to achieve the purpose of sorting or selecting topK value.

8 a hash table

A hash table, also known as a hash table, is a mechanism for accessing data directly through key-value pairs. In junior high school, we learned that an operation that can obtain a corresponding y value from an x value through a function is called mapping. The implementation principle of hash table is the principle of mapping. By setting a key word and a mapping function, the address of accessing data can be directly obtained to achieve O(1) data access efficiency. In the mapping process, the preset function is a mapping table, also known as a hash function or hash function.

The definition and selection of hash function are the key to implement hash table. The following hash functions are commonly used:

Direct addressing: Take the value of a keyword or a linear function of the keyword as a hash address. Numerical analysis: Through the analysis of the data, find the less conflicting parts of the data, and construct hash addresses. For example, the student id of students, usually of the same class, the first part of which does not differ much, so the last part is used to construct the hash address. Square middle method: When it is impossible to determine which bits of the keyword are relatively evenly distributed, you can first calculate the square value of the keyword, and then take the middle bits of the square value as the hash address. This is because the squared middle bits are related to each of the keywords, so different keywords have a high probability of producing different hash addresses. Random number method: a random function that takes the random value of the keyword as the hash address. This method is usually used when the keyword length is different. The remainder p is the hash address after the keyword is divided by a number m not larger than the table length N of the hash table. This method can also be used after using other methods. The choice of m is very important for this function, so you can either take a prime number or you can just use n.

Once you’ve determined the hash function, you do get a unique value address from a key value. But there are special cases. That is, different keys may access the same address. This phenomenon is called conflict.

After the conflict occurs, it is very dangerous to overwrite or lose the data resulting from the same address when the operation is performed on different keys. Therefore, it is often necessary to use conflict resolution in designing hash tables.

There are many common conflict handling methods, including the following:

Open address (also known as open addressing) : When we need to store a value, we hash the Key and find that the address already has a value. Do not place it at this address, otherwise the previous mapping will be overwritten. At this point, a probe is performed on the calculated address and then hashed, such as moving an address back, if no one is occupied, the address is used. If the maximum length is exceeded, the total length can be mod. The address moved here is the increment of the order when the collision occurred. Rehash: After a conflict, use the other parts of the keyword to continue computing the address. If there is still a conflict, use the other parts to continue computing the address. The disadvantage of this approach is that time increases. Chain address method: chain address method is actually the Key hash after the value of the same address, do a linked list. In fact, many high-level language implementations use this approach to handle conflicts. Public overflow zone: This method is to create a public overflow zone, when there is a conflict address, the new address in the public overflow zone.

At present, the most commonly used conflict resolution method is the linked address method, which can achieve the purpose of caching the conflicting data through the combination of array and linked list.

Considering the problem caused by the long linked list, we can also use red-black tree to replace the linked list to deal with conflicting data, so as to improve the query stability of hash table.

Figure 9

Diagrams may not touch much more than the above structures, but they often appear in real application scenarios. For example, a road map in traffic, a common mind map can be seen as a concrete representation of a map.

A graph structure usually consists of vertices and edges. Vertices are usually represented by circles, and edges are the lines between these circles. Edges can also be weighted differently depending on the relationship between vertices. The default weight is 1. In addition, graphs can be divided into directed graphs and undirected graphs according to the directivity of edges.

Graph structure is very simple to represent with abstract graph lines, the relationship between vertices and edges is very clear. However, it is not easy to store the relationship between vertices and edges in specific code implementation.

Adjacency matrix

At present, the commonly used graph storage method is adjacency matrix, which stores whether two vertices are connected or the edge weight between two vertices through the two-dimensional matrix of all vertices.

The adjacency matrix can be used to obtain the relation of any two vertices directly from the two-dimensional relation. However, to store the matrix, a complete two-dimensional array is required. If the number of vertices in the graph is too many, the size of the two-dimensional array will increase dramatically, thus occupying a large amount of memory space.

According to the actual situation, the vertices in the graph are not connected between any two vertices, and not all of them need to store their edge weights. So the stored adjacency matrix actually has a lot of zeros. Although the matrix with high sparsity can be used to store key information by means of sparse representation, it increases the complexity of graph storage.

Therefore, in order to solve the above problems, an adjacency list that can only store connected vertices comes into being.

Adjacency list

In an adjacency list, each vertex of a graph is the head node of a linked list, followed by adjacent vertices that the vertex can reach directly. Compared with undirected graph, the situation of directed graph is more complicated, so it is used for example analysis here.

The adjacency list can obtain the vertices that can be reached from a vertex, thus saving the storage space of unconnected vertices. However, this is not enough. For a directed graph, the effective information in the graph includes not only the information “pointed out” from the vertex, but also the information “pointed in” from other vertices. Here, “pointing out” and “pointing in” can be expressed by the degree of out and in.

Entry: The sum of the number of times a vertex of a digraph is an endpoint. Output: The sum of the number of times a vertex of a digraph is the starting point.

It can be seen that adjacency list can only calculate the output degree of a directed graph, but not the output degree. This problem is easily solved by adding a table to store adjacent vertices that can reach a vertex. This list is called the inverse adjacency list.

Inverse adjacency list

An inverse adjacency list is similar to an adjacency list except that the vertex of the graph is linked to adjacent vertices that can reach that vertex. That is, adjacency lists follow the arrows in the graph to find adjacent vertices, and inverse adjacency lists follow the arrows in the graph to find adjacent vertices.

Curb table

A cross-linked list seems simple, requiring only the same vertex to be linked to adjacent vertices that start and end at that vertex.

But that’s not the optimal representation. Although the intermediate vertex storage space is shared in this way, the repeated vertices in the linked nodes of the adjacency list and the inverse adjacency list are not reused, but are stored again. Therefore, the representation of the figure above can be further optimized.

The cross list can be optimized to store positive and inverse adjacency lists with extended vertex structure and edge structure :(the bottom arc head can be regarded as the arrow end of the edge, and the arc tail can be regarded as the dot end of the edge)

Data: Used to store data in the vertex; Firstin pointer: used to join the list of other vertices with the current vertex as the arc head, that is, the vertex pointed in from other vertices; Firstout pointer: a linked list of other vertices with the current vertex as the arc tail, that is, the vertex pointed out from this vertex;

An edge structure identifies an edge by storing two vertices, and links adjacent vertices with Pointers that represent the two vertices respectively:

Tailvex: used to store the number of vertices as arc tails; Headvex: used to store the number of vertices as arc heads; Headlink pointer: used to link the next node that stores the vertex as the arc head; Taillink pointer: used to link the next node that stores the vertex as an arc tail;

For example, vertex A is the starting point to vertex E. So in the adjacency list, vertex A points to vertex E through edge AE (that is, edge 04), and the firstout pointer to vertex A needs to point to tailvex of edge 04. At the same time, starting from B can reach A, so in the inverse adjacency list, vertex A points to B through edge AB (i.e. edge 10), and the firstin pointer of vertex A needs to point to the arc head of edge 10, i.e. the headLink pointer. And so on.

Cross-linked lists represent the directionality of edges in a seemingly cumbersome way, increasing the directionality of Pointers between reserved vertices with as little storage as possible. The specific operation may be difficult to understand at a time. It is recommended to look at the figure above several times to understand the meaning of pointer pointing and the representation of forward and backward adjacency lists.

10 summary

The data structure is broad and profound, not as secretive as higher mathematics, nor as mysterious as quantum mechanics, but it has a powerful power in all fields of computer science. This paper attempts to theoretically introduce nine data structures by means of diagrams, but in fact it is not enough.

Even simple array, stack, queue and other structures, in the actual use and the underlying implementation will have a lot of optimization design and use skills, which means that they need to really use them flexibly, can be considered as a true sense of familiarity and mastery. But this article serves as a summary of common data structures, which you may want to revisit when you’ve forgotten some of them.