preface

This article is a summary of Big Talk Data Structures. When I read this book at that time, I felt very shocked. There is rarely a book that speaks human language in China. But because data structures and algorithms are inherently boring, parts of the content still seem like gobbledegook. Therefore, it is possible to extract the “human words” in this paper.

Contents summary

First we need to understand the concepts of lookup tables, records, keywords, primary keywords, static lookup tables, dynamic lookup tables, and so on.

Sequential table lookup, although unsophisticated, is the basis for many subsequent lookups. Pay attention to the sentry technique, which can improve performance in a simple algorithm that is already difficult to improve.

Ordered search, we focus on the idea of half search, it has a qualitative leap in performance than the original sequential search, from 0 (n) to 00ogn). Then we explained two other excellent ordered search: interpolation search and Fibonacci search, the three have advantages and disadvantages, I hope you should carefully experience.

Linear index lookups, we cover dense index, partitioned index, and inverted index. Indexing technology is widely used in file retrieval, database, search engine and other technical fields, is the basis of further learning these technologies.

Binary sort tree is the most important data structure for dynamic search. It can make insert and delete more efficient while considering search performance. But in order to achieve the optimal state, the binary sorting tree is best constructed as

Balanced binary trees are best. Therefore, we need to learn about the data structure of balanced binary tree (AVL tree), to understand how AVL tree deals with the problem of balance. This part is the focus of this chapter and needs to be studied carefully.

B tree is a data structure specifically designed for access between memory and external memory. Since the search performance of internal and external memory depends more on the number of reads, the balance and hierarchy of B-tree should be considered in the design. When we explain, we first understand how to construct, insert and delete elements through the simplest B-tree (2-3 trees), and then deepen the two-three-four tree to finally understand the principle of B-tree. After that, we also introduce the design idea of B+ tree.

Hash table is a very efficient search data structure, in principle and the previous search is not the same, it avoids the key word repeatedly compared between the cumbersome, but directly step in place to find the results. Of course, this leads to the disadvantage of having no connection between the records. It should be said that hash tables are very good for data that requires high lookup performance and no relationship between records. In learning to pay attention to the selection of hash functions and the method of handling conflicts.

1. Search in sequence

Sequential Search, also known as linear Search, is the most basic Search technology.

Its search process is: start from the first (or the last) record in the table, compare the keyword and the given value one by one, if the keyword of a record is equal to the given value, the search is successful, find the checked record; If the key does not compare with the given value until the last (or first) record, the table does not contain the record being looked up and the lookup is unsuccessful.

2. Ordered table lookup

If we just put the books on the shelf, it is still difficult to find a book, that is, we need to search one by one. But if we organize our bookshelves in alphabetical order, it is relatively easy to find a book. To put it bluntly, books are ordered, and when a linear list is ordered, it is always helpful for searching.

This section reference: zhuanlan.zhihu.com/p/31895830

(1) Binary search

Binary Search technology, also known as Binary Search.

Its premise is that the records in a linear table must be keycode ordered (usually from small to large order), and the linear table must be stored sequentially.

The basic idea of split search is: in an ordered table, the intermediate record is taken as the comparison object. If the given value is equal to the key word of the intermediate record, the search succeeds. If the given value is less than the key of the intermediate record, the search continues in the left half of the intermediate record. If the given value is greater than the key of the intermediate record, the search continues in the right half of the intermediate record. The above process is repeated until the search succeeds, or all the search areas have no record and the search fails.

(2) Interpolation search

Interpolation Search is a Search method that compares the keyword key to be searched with the maximum and minimum records in the Search table

Its core lies in the calculation formula of interpolation.

Mid = low + (high-low)* (key-a [low])/(key-a [high] -a [low])Copy the code

It should be said that the average performance of interpolation search algorithm is much better than half search for large table length and relatively uniform distribution of keywords. Conversely, if the array is distributed like {0,1,2,2000,2001… .999998, 999999} Interpolation may not be a good choice for such extremely uneven data.

(3) Fibonacci search

  1. Determine the length (or maximum element value) of the bonacci array based on the length of the array to be found

  2. Create the bonacci array of the length according to the length in 1, and then generate the bonacci column array assignment by F(0) = 1, F(1) = 1, F(n) = F(n-1) + F(n-2)

  3. Create a fill array with the maximum length of the Pebonacci array in 2. Copy the original elements into the fill array. If there are any unassigned elements left, fill it with the value of the last element in the original array

  4. Perform a keyword lookup on the populated array, and remember to determine if the element is from the later populated part of the element

3. Linear index lookup

Speak in front of the several more efficient lookup methods are based on an orderly, but in fact, many data sets may be growing very fast, for example, some large microblogging site or BBS posts and respond to the total article ensnares millions every day, or some server log records may also be a lot of data, It is expensive to ensure that records are all ordered by one of the keys, so this data is usually stored in order.

So for such a lookup table, how can we quickly find the data we need? The solution is indexes.

The ultimate purpose of data structure is to improve the speed of data processing, index is designed to speed up the search of a data structure. Index is the process of associating a keyword with its corresponding record. An index is composed of several index items, and each item should contain at least the information of the keyword and its corresponding record location in the memory. Indexing is an important technique for organizing large databases and disk files.

The index can be divided into linear index, tree index and multilevel priming. We will only introduce linear indexing techniques here. The so-called linear index is to organize the index item set into linear structure, also known as index table. We focus on three kinds of linear indexes: dense index, partitioned index and inverted prime index.

(1) Dense index (book entry position)

Dense index refers to a linear index in which each record in a data set corresponds to an index item

A dense index may deal with thousands of data, so for a dense index table, the index items must be arranged in order according to the key code.

Index order also means that when we want to find keywords, we can use half-fold, interpolation, Fibonacci and other ordered search algorithms, which greatly improve the efficiency.

For example, in the figure, I want to search for the record whose keyword is 18. If I directly search from the data table on the right, I can only search in sequence and need to search for the result six times. If you search from the index table on the left, you only need two half searches to get the pointer corresponding to 18, and finally find the result.

This is obviously a dense index advantage, but if the data set is very large, say hundreds of millions, it means that the index has to be the same size of the data set. On a computer with limited memory, it may need to visit the disk repeatedly, and the lookup performance will be greatly reduced.

(2) Block index (Library)

Dense indexes Because index entries have the same number of records as data sets, space costs a lot. In order to reduce the number of index items, we can divide the data set into blocks to make them orderly, and then build an index item for each block to reduce the number of index items.

Chunking order is to divide the records of a data set into several blocks, and these blocks need to meet two conditions:

  • Intra-block disorder means that the records within each block are not required to be in order. Of course, if you can make in-block order better for lookups, but that costs a lot of time and space, so in general we don’t want in-block order.

  • For example, the keywords of all records in block 2 must be greater than those in block 1, and the keywords of all records in block 3 must be greater than those in block 2. Because only order between blocks is possible to bring efficiency in lookups.

For a partitioned ordered data set, each block corresponds to an index item, which is called partitioned index. As shown in the figure, the index item structure of the partitioned index we defined is divided into three data items:

  • The maximum key code, which stores the largest key in each block, has the advantage that the smallest key in the next block after it can also be larger than the largest key in this block;

  • The number of records in the block is stored for use in the loop.

  • Pointer to the data element at the beginning of the block to start traversing the records in the block.

A lookup in a partitioned index table is performed in two steps:

  1. Find the block in the block index table for the keyword to be looked up. Since the block index table is ordered between blocks, it is easy to get the results by half-folding and interpolation algorithms. For example, if we look for 62 in the data set in the figure, we can quickly get 62 in the third block from the index table in the upper left corner from 57<62<96.
  2. Find the corresponding block according to the block head pointer, and find the key codes in the block sequence. Because blocks can be unordered, they can only be looked up sequentially.

(3) Inverted index (search engine)

In this case, the word list is the index table, and the general structure of the index items is:

  • Secondary key codes, such as “English words” above;
  • Record a table of numbers, such as “Article number” above.

The record number table stores the record number of all records with the same secondary key (either a pointer to a record or the primary key of that record). Such an index method is inverted index.

Inverted indexes are derived from the need to find records based on the value of attributes (or fields, sub-keys) in practical applications. Each entry in such an index table contains an attribute value and the address of the records that have that attribute value. Because it is not the record that determines the attribute value, but the attribute value that determines the position of the record, it is called an inverted index.

Of course, the search technology in reality is very complicated. For example, we not only need to know the keyword to be searched in an article, but also want to know where the keyword appears in the article, which requires us to make some improvements to the record number table. For example, if the number of articles is hundreds of millions, it is not necessary to use long numbers, but can be compressed. For example, if the number of three articles is “112, 115, 119°, we can record it as” 112, +3, +4 “, that is, only record the difference value, so that each keyword takes up only one or two bytes. Even keywords can be compressed. For example, if the first record is “and” and the second one is “Android”, then the latter one can be changed to “<3,roid>”, which can also play a role in data compression. For example, when searching, although you are told that there are tens of thousands of records to find, but in fact, what is really shown to you is only the first 10 or 20 data, and only when you click on the next page, you will get some index records behind, which can also greatly improve the efficiency of the overall search.

4. Binary sort tree

Binary Sort Tree (Binary Sort Tree)

It is either an empty tree or a binary tree with the following properties:

  • If its left subtree is not empty, the value of all nodes in the left subtree is less than the value of its root structure.
  • If its right subtree is not empty, the value of all nodes in the right subtree is greater than the value of its root node.
  • Its left and right subtrees are also binary sort trees respectively.

The purpose of constructing a binary sort tree is not to sort, but to improve the speed of searching and inserting and deleting keywords. In any case, the search on an ordered data set is always faster than that on an unordered data set, and the nonlinear structure of a binary sorting tree also facilitates insertion and deletion.

(1) binary sort tree query

(2) binary sort tree insertion

Binary sorting tree is stored in the form of links, keeping the advantages of link storage structure in the execution of insert or delete operations without moving elements, as long as the appropriate insert and delete location, only need to modify the link pointer. The time performance of insert and delete is good.

(3) delete binary sort tree

According to our analysis of three cases of node deletion:

  • Leaf node: Delete directly

  • Only nodes of the left or right subtree: the only child inherits from the father

  • So if we look at the code, we’re going to recursively sort the tree T by two and find the key, and then delete it.

(4) Balanced binary tree (AVL tree) and its realization

A balanced binary tree is a binary sort tree in which the height difference between the left and right subtrees of each node is at most equal to 1. It is a highly balanced binary sort tree.

So what is high equilibrium? This means that either it is an empty tree, or its left and right subtrees are both balanced binomial trees, and the absolute value of the depth difference between the left and right subtrees is no more than 1.

We call the value of the left subtree depth minus the right subtree depth of any node in the binary tree Balance Factor (BF), so the Balance Factor of all nodes in the binary tree can only be 1, 0 and 1. A binary tree is unbalanced as long as the absolute value of its equilibrium factor is greater than 1.

Looking at Figure 8-7-2, why is Figure 1 a balanced binary tree and figure 2 not? Here is to test our understanding of the definition of balanced binary tree, its premise is a binary sorting tree, 59 is bigger than 58 in the figure on the right, but it is the left subtree of 58, which is not in line with the definition of binary sorting tree. Figure 3 is not a balanced binary tree because the left subtree of node 58 is 2 in height and the right subtree is empty. The difference between the two is greater than the absolute value of 1, so it is not balanced either. Figure 4, properly adjusted, meets the definition and is therefore a balanced binary tree.

The basic idea of the construction of balanced binary and sorted tree is to check whether the balance of the tree is damaged by the insertion whenever a node is inserted in the process of the construction of balanced binary and sorted tree. If so, the least unbalanced subtree is found.

On the premise of keeping the characteristics of binary sorting tree, the link relation between nodes in the least unbalanced subtree is adjusted and the corresponding rotation is carried out to make it a new balanced subtree.

Any point where BF is greater than or less than negative 1, that point rotates, turns right if it’s greater than 1, turns left if it’s less than negative 1

5. Multi-path lookup tree (B-tree)

All of the data structures we’ve talked about are in memory, so it’s all about in-memory computation time.

But what if the data set we’re trying to manipulate is so large that memory can’t handle it? For example, tens of millions of records in the database data tables, tens of thousands of files in the hard disk. In this case, the processing of the data requires constant loading and unloading of memory pages from storage devices such as hard disks.

When it comes to such external storage device, about the time complexity of computing will change, access to the elements in the set time is needed to find the element is the number of functions, we must consider to the external storage devices such as disk access time, and will make how many times a separate access to the device.

If you want to find a text file on a disk with hundreds of thousands of files, it makes a difference whether you design an algorithm that reads the disk tens of thousands of times or dozens of times. At this point, in order to reduce the number of external storage device access, we need a new data structure to deal with such problems.

A node can only store one element, so when there are so many elements, either the degree of the tree is very large (the maximum number of subtrees a node can have) or the height of the tree is very large, or both must be large. This makes the number of memory access and external storage very large, which obviously becomes a bottleneck in time efficiency, which forces us to break the limit of only one element per node, so the concept of multi-way search tree is introduced.

A multipath lookup tree, in which the number of children per node can be more than two, and multiple elements can be stored at each node. Because it is a lookup tree, there is a certain sort relationship between all the elements.

Here, how many elements can each node store, and how many children it has, are critical. To that end, we’ll explain its four special forms: 2-3 trees, 2-3-4 trees, B trees, and B+ trees.

(1) 2-3 trees

A 2-3 tree is a multipath lookup tree where each node has two children (we call it a 2) or three children (we call it a 3).

A 2-node node contains one element and two children (or no children), and, like a binary sorting tree, the left subtree contains less elements than the element, and the right subtree contains more elements than the element. However, unlike a two-sort tree, this two-node node has either no children, or two children, not just one child.

A 3-node contains two elements, one small, one large, and three children (or no children). A 3-node has either no children or three children. If a 3 node has children, the left subtree contains elements smaller than the smaller element, the right subtree contains elements larger than the larger element, and the middle subtree contains elements between the two elements. And all the leaves in 2-3 trees are on the same level. A valid 2-3 tree is shown in Figure 8-8-2.

A, 2-3 tree insertion

In fact, the complexity of 2-3 trees lies in the insertion of new nodes and the deletion of existing ones. After all, each node can be two or three, and keeping all the leaves on the same level requires a complex operation.

2-3 Tree insertion can be divided into three cases:

1) For an empty tree, insert a 2 node, which is easy to understand.

2) Insert the node into a 2-node leaf. It should be said that since it has only one element, it only needs to be upgraded to 3 nodes. See Figure 8-8-3. We want to insert element 3 from the 2-3 tree on the left, which is smaller than 8 and smaller than 4, so the only place we can think about is where leaf node 1 is, so the natural idea is to make this node into a 3 node, as shown on the right. Of course, who is to the left and who is to the right depends on the size of the inserted element compared to the current leaf node. For example, if you insert a 0, then this node is 0 to the left and 1 to the right.

Insert a new element into node 3. Since node 3 itself is the maximum node capacity of a 2-3 tree (it already has two elements), it needs to be split and one of the two elements in the tree or three inserted elements needs to be moved up one level. Therein lies the complication.

In the first case, shown in Figure 8-8-4, you need to insert element 5 to the left. After traversing, the element 5 is smaller than 8 and larger than 4, so it should be inserted at node 3 with elements 6 and 7. The problem is, 6 and 7 are already 3, you can’t add more. It is found that its parent, 4, is a 2, so it is considered to upgrade it to a 3, so that it has to have three children, so it comes to the idea of splitting 6 and 7, making 6 and 4 a 3, making 5 its middle child and 7 its right child, as shown on the right side of figure 8-8-4.

In the other case, as shown in Figure 8-8-5, element 11 is inserted to the left. After traversal, the element 11 is smaller than 12 and 14 and larger than 9 and 10, so it should be inserted at node 3 with elements 9 and 10. Similarly, 9 and 10 can’t add any more nodes. Its parent nodes 12 and 14 are also a 3 node and can no longer insert elements. And then up here, the parents of 12 and 14, node 8 is a 2. The idea was to split 9 and 10, 12 and 14, so that root 8 would be upgraded to 3, and the final result would look like figure 8-8-5 on the right.

B. Deletion of 2-3 trees

For the deletion of 2-3 trees, it should not be difficult if the previous insertion is well understood. The deletion of 2-3 trees is also divided into three cases. Instead of inserting, we start at node 3.

1) The deleted element is located in a 3-node leaf node. This is very simple. You only need to delete the element at this node, which will not affect the structure of other nodes in the whole tree. As shown in Figure 8-8-7, deleting element 9 only requires changing this node to node 2 with element 10.

2) The element to be deleted is located on a node of 2, i.e. the node to be deleted is a node with only one element. If we follow the previous tree understanding, we can delete, but the current 2-3 tree definition tells us that we can not do so. For example, as figure 8-8-8 shows, if we delete node 1, then node 4 is originally a 2 (it has two children), and it does not satisfy the definition.

Therefore, we need to deal with the case that the deleted leaf is 2 nodes in four cases.

In case one, the parent of this node is also 2, and has a right child of 3. As shown in Figure 8-8-9, if node 1 is deleted, then only left-rotation is required, so that 6 becomes the parent, 4 becomes the left child of 6, and 7 is the right child of 6.

Case two, the parent of this node is 2, and its right child is also 2. As you can see in Figure 8-8-10, if you delete node 1, if you turn it left you will have no right children, so you need to deform the whole tree. The idea is, we want node 7 to become node 3, so we need to bring down an element 8 that is slightly larger than 7, and then we need to replace node 8 with an element 9 that is slightly larger than 8. So you have the middle of figure 8, figure 8, figure 10, and then you rotate it left to the right.

In case three, the parent of this node is a 3 node. As shown in FIG. 8-8-11, node 10 is deleted at this time, which means that parents 12 and 14 cannot become 3 nodes. Therefore, this node is split, and 12 and 13 are combined to become the left child.

In case 4, if the current tree is a full double tree, deleting any leaves will make the whole tree fail to meet the definition of 2-3 trees. As shown in FIG. 8-8-12, when leaf node 8 is deleted (in fact, it is the same as deleting any node), we have to consider reducing the number of layers 2-3 by combining the parents of 8 and its left subtree 6 into one node with 3 nodes, and then combining 14 and 9 into 3 nodes, finally forming the appearance of the figure on the right.

3) The deleted element is in a branch node that is not a leaf. In this case, we usually traverse the tree in middle order to get the precursor or successor of the element, and consider them to fill in the bits.

If the branch we want to remove is 2. As shown in FIG. 8-8-13, we need to delete 4 nodes. After analysis, it is found that its precursor is 1 and its successor is 6. Obviously, since 6 and 7 are 3 nodes, only 6 is needed to fill the position, as shown in FIG. 8-8-13 on the right.

If the branch node we want to delete is an element of 3 nodes, as shown in Figure 8-8-14, we want to delete 12 of 12 and 14 nodes. In this case, after analysis, it is obvious that 10 of the left child, which is 3 nodes, should be raised to the appropriate deletion position.

(2) two-three-four trees

A two-three-four tree is really an extension of a two-three tree, including the use of four nodes. A 4-node node contains three elements and four children (or none), and a 4-node node either has no children or has four children. If a 4 node has children, the left subtree contains elements smaller than the smallest element; The second subtree contains elements greater than the smallest element and less than the second element; The third subtree contains elements larger than the second element and smaller than the largest element; The right subtree contains elements greater than the largest element.

If we build a 2-3-4 tree with an array of (7,1,2,5,6,9,84,3}, the process is shown in figure 8-8-15. Figure 1 is the result of inserting 7, 1, and 2 respectively. Since the three elements meet the single 4-node definition of a 2-3-4 tree, there is no need to split it at this time. Then insert element 5, because the definition of 4-node has exceeded, so it is split into the shape of Figure 2. The following figure is actually a 2-3-4 tree that forms in Figure 7 as elements are inserted.

Figure 8-8-16 shows the evolution of deletion nodes of a 2-3-4 tree, and the deletion sequence is 1, 6, 3, 4, 5, 2 and 9.

(3) B tree

B-tree is a balanced multi-path lookup tree. 2-3 and 2-3-4 trees are special cases of B-trees. The largest number of children in a node is called the order of a B-tree, so a 2-3 tree is a third-order B-tree, and a 2-3-4 tree is a fourth-order B-tree.

For example, when talking about two-three-four trees, the diagram transformed into a B-tree after nine numbers were inserted is shown on the right side of Figure 8-8-17. The gray square on the left represents the number of elements in the current node.

The process of searching in B tree is a cross process of searching the node with pointer and searching the keyword in the node.

For example, if we want to find the number 7, we first read the root node 3, 5 and 8 elements from external memory (such as hard disk), and find that 7 is not in the middle, but between 5 and 8. Therefore, we read the node 6 and 7 of external memory through Az to find the desired element.

As for the insertion and deletion of B trees, the method is similar to that of 2-3 and 2-3-4 trees, but the order may be larger.

As we mentioned at the beginning of this section, if the frequent exchange of data between memory and external memory causes a bottleneck in time efficiency, how can the B-tree structure reduce the number of times?

Our external storage, such as hard drives, divides all the information into equal pages. Each hard drive reads or writes one or more complete pages. For a hard drive, a page may be 211 to 214 bytes long.

In a typical B-tree application, the amount of hard disk data to process is too large to fit into memory all at once. So we adjust the b-tree so that the order of the b-tree (or the elements of the nodes) matches the size of the page stored on the hard disk.

For example, a B-tree of order 1001 (i.e., one node contains 1000 keywords) and height 2 can store more than 1 billion keywords. As long as the root node is kept in memory persistently, it will take no more than two hard disk reads to find a certain keyword in this tree. This is like our ordinary number of money is a number, and the bank clerk is five, ten, or even dozens of a number of money, speed is of course much faster than ordinary people.

This way, with limited memory, we can get the maximum amount of data per disk access. Because B trees can have many more elements per node than binary trees, they improve performance by reducing the number of nodes and data blocks that must be accessed, unlike binary trees. It can be said that the b-tree data structure is prepared for data interaction between internal and external storage.

(4) B+ tree

Despite all the benefits of the b-tree, there are drawbacks. In the case of a tree structure, we can find the elements of the tree sequentially through middle-order traversal, all in memory. But in the B-tree structure, we go back and forth between nodes, which means we have to make multiple visits to the pages of the hard disk, as shown in Figure 8-8-18. We want to walk through the B-tree, assuming that each node belongs to a different page of the hard disk, and we walk through all the elements for middle order. Page 2→ Page 1 page 3→ Page 1→ Page 4→ Page 1→ Page 5 And every time we walk through a node, we walk through the elements in the node, which is pretty bad. Is it possible to have traversals that access each element only once?

Similarly, in order to solve the basic problems such as traversal of all elements, we add a new element organization mode on the basis of the original B tree structure, which is B+ tree.

A B+ tree is a variant of a B tree that is required by the file system. Note that it is not technically a tree as defined in Chapter 6. In a B tree, each element appears only once in the tree, either at a leaf node or a branch node. In a B+ tree, elements that appear in a branch are listed again as their mid-order successors (leaf nodes) at that branch location. In addition, each leaf holds a pointer to the next leaf.

For example, figure 8-8-19 shows an illustration of a B+ tree, in which the gray keywords are the keywords in the root node listed again in the leaf node, and all the leaf nodes are linked together.

6. A hash table

In order table lookup, we used to say that if you want to find a key, start from the table head, compare the key with a[I] value “=” or “≠”, until there is equal, the search is successful, return I.

When the ordered table is searched, we can use the “<” or “>” of a[I] and key to split the search, until the search is equal to return I. Finally, our goal is to find the I, which is actually the relative subscript, and then through the sequential storage location calculation method, LOC (al) = LOC (A1) + (I 1) × C, that is, the memory location of the first element plus I -1 cell location, to get the final memory address.

At this point, we find that the previous method “comparison” is inevitable in order to find the results, but is it really necessary? Is it possible to find the memory location of the record directly through the keyword key?

Hash technology is to establish a definite correspondence relation F between the storage location of the record and its key, so that each key corresponds to a storage location F (key). During the search, the mapping F (key) of the given value key is found according to the determined corresponding relation. If this record exists in the search set, it must be at the position of F (key).

This correspondence f is called the Hash function, also known as the Hash function. The idea is to use Hash techniques to store records in a contiguous storage space called a Hash table or Hash table. Then the record location corresponding to the keyword is called the hash address.

(1) Hash table storage

The whole hash process is really two steps.

1) During storage, hash function is used to calculate the hash address of the record and store the record according to the hash address.

2) When looking up a record, we use the same hash function to calculate the hash address of the record and access the record at this hash address. It is easy to say, where to store, where to find, since access using the same hash function, so of course the results are the same.

(2) Hash table lookup

The hashing technique is best suited for solving problems by finding records that are equal to a given value. For lookups, simplifying the comparison process greatly improves efficiency. But there’s a silver lining, and hashing doesn’t have many of the capabilities of conventional data structures.

For example, the same keyword, which can be used for many records, is not suitable for hashing. A class of dozens of students, their gender has male and female, you use the keyword “male” to find, corresponding to the records of many students, this is obviously not appropriate. Only if the class student number or ID number to hash storage, at this time a number only corresponds to a student is more appropriate.

Similarly, hashtables are not suitable for range lookup, such as looking up a class of 18 – to 22-year-olds. It is also impossible to obtain the order of the records in the table, and results such as maximum and minimum values cannot be computed from the hash table.

(3) Hash function

Design principles of two hash function * * * * : 1, simple calculation, save time, if you use an algorithm to calculate the address to calculate the along while, I use you why) 2, uniform hash address distribution, to ensure the effective use of the storage space, reduce the time of dealing with conflict) * * choose five factors of hash function * * : 1, calculate the time required to hash address. 2. The length of the keyword. 3. Hash table size. 4. Distribution of keywords. 5. Record the frequency of searches.Copy the code

Common hash function constructors

(4) Methods to deal with hash conflicts

F (key1) =f (keyz); f (keyz) =f (keyz)

(5) hash table search

Let’s do a simple analysis of the performance of hash lookup. Without conflicts, the hash lookup is the most efficient of all the lookups we cover in this chapter because it has a time complexity of 0 (1). Unfortunately, I only said “if”, the hash without conflicts is an ideal, in practical applications, conflicts are inevitable. So what does the average search length of a hash lookup depend on?

1. Whether the hash function is uniform

The quality of the hash function directly affects the frequency of conflicts. However, since different hash functions have the same possibility of conflicts for the same set of random keywords, we can ignore its influence on the average search length.

2. Methods to deal with conflicts

The same keyword, the same hash function, but different ways to handle conflicts, will make the average search length different. For example, linear detection may produce heap when dealing with conflicts, so it is obviously not as good as quadratic detection, while chain address method does not produce any heap when dealing with conflicts, so it has better average search performance.

3. Hash table loading factor

The so-called loading factor x= the number of records filled in the table/the length of the hash table. A indicates how full the hash table is. If your hash table length is 12 and the number of entries in the table is 11, then the loading factor x=11/12=0.9167 is very likely to cause a collision when the last keyword is filled. That is, the average lookup length of a hash table depends on the loading factor, not on the number of records in the lookup set.

No matter how large the number of records n is, we can always choose an appropriate loading factor to limit the average search length to a range, and our hash search time really is 0 (1). To do this, we usually make the hash table larger than the lookup set. This is a waste of space, but the search efficiency is greatly improved. Overall, it is well worth it.