This article is based on data Structures C Language Edition 2, edited by Weimin Yan, Dongmei Li and Weimin Wu

1. Basic concepts of search

  • Lookup table: A collection of data elements of the same type. Because the data elements in a collection are completely loose, lookup tables are a more convenient data structure, which can be implemented with linear tables, tree tables, and hash tables
  • Keyword: the value of a data element or a data item in a record, which can be used to identify a data element.
  • Primary key: a key that uniquely identifies a record
  • Sub-key: a key that identifies several records
  • To find the: Determines a record or data element in a lookup table whose key is equal to the given value, based on a given value
    • If such a record exists, the search succeeds, and the search results can give information about the entire record or indicate its position in the lookup table
    • If there is no record in the table whose key is equal to the given value, the search is unsuccessful, and the result of the search can be an empty record or a null pointer
  • Dynamic lookup table: A table that is modified (insert or delete) while searching is called a dynamic lookup table
    • If there is a record in the table whose key is equal to the given value, the lookup success is returned
    • If not, insert a record whose key is equal to the given value
  • Static lookup table: The table is not modified during lookup
  • Average search length: The expected number of keywords to be compared with a given value in order to determine the position of the record in the search table. This is called the average search length of the search algorithm when the search is successful

2, linear table search

2.1 Sequential Search

Sequential search: The system starts from one end of the table and compares the key words of the records with the specified value. If the key words of a record are the same as the specified value, the search succeeds. On the other hand, if the entire table is scanned and no key is found equal to the given value, the search failure sequence search method is applicable to the sequential storage structure of linear tables, as well as to the chain storage structure

Set up the sequence lookup of the watch

  • Each time the for loop ends, the condition is whether the current index is at the end, and the loop internally determines whether the key is the same

      int search_seq(SSTable ST,KeyType key){
          for(int i = ST.length; i >= 1; i--){
              if(ST.R[i].key == key) return i;
          }
          return 0;
      }
    Copy the code
  • Set a sentinel, such that the first element of the order table is not stored, but is used as a sentinel, and the termination condition of the for loop becomes that the record with the key is either found or actually found. This improvement can halve the average time required for a sequential lookup when st.Length >= 1000

      int search_seq(SSTable ST,KeyType key){
          ST.R[0].key = key;
          int i;
          for(i = ST.length; ST.R[i].key == key ; i--){
          }
          return i;
      }
      
    Copy the code
  • Time complexity: O(n)

  • Sequential search advantages:

    • Method is simple
    • There are no requirements for the structure of the table, which applies to sequential structures as well as chain structures
    • Keywords are available whether ordered or not
  • Sequential search disadvantage

    • The search efficiency is low when the search length is large

2.2 Split search

Split search: also known as binary search, is a high efficiency search method. The linear table must adopt the sequential storage structure. The elements in the table are arranged in order according to keywords.

In order from smallest to largest, the search process is, starting from the middle of the table:

  • If the given value is equal to the key of the intermediate record, the lookup succeeds
  • If the given value is greater than or less than the key of the intermediate record, the search is performed in the half of the table that is greater than or less than the intermediate record, and the process is repeated until the search is successful

Half search each search comparison will reduce the search scope in half, compared with sequential search, will improve the search efficiency

  • Time complexity: O(logn)
  • Split search advantages
    • The comparison times are small and the search efficiency is high
  • Split in half to find shortcomings
    • It has high requirements on the table structure and can only be used for ordered tables with sequential storage
    • This is not suitable for linear tables that require frequent changes in data elements, given that on average half of the elements are moved when inserting and deleting

2.3 Block Search

Block search, also known as index sequential search, performance between the sequential table and split search between a search method, in addition to the table itself, but also need to build an index table.

An index table is used to divide a linear table into multiple sub-tables. An index item is stored in the index table for each sub-table. Each index item contains a keyword item and a pointer item.

The index table arranges entries according to keywords, while the table stores data in A block order. For example, if the keywords of index A are larger than those of index B, the keywords of all data in INDEX A are larger than the maximum keywords in index B

  • Block search the search process
    • Determine the block where it resides, search the index in sequence or in half, and find the keyword that meets the requirements of A
    • If it is found, the record is returned. If it is not found after traversing the subtable, it indicates that there is no record corresponding to this keyword
  • Average search length: Lb + Lw, where Lb represents the average search length of the block where it is determined, and Lw represents the average search length of elements in the block
  • The advantages of block lookup:
    • When inserting and deleting elements in a table, you only need to find the corresponding block of the element to insert and delete operations in the block. The block is out of order, so it is easy to insert and delete
    • If a linear table needs both quick lookup and dynamic change, block lookup is appropriate
  • Disadvantages of block lookup
    • Depending on the index table, you need to add the index table as a new data structure
    • Index tables need to be sorted

3, tree table search

Sequential table lookups are better suited to static lookups because inserting or deleting elements requires moving multiple records. If dynamic lookup is required, it is suitable to use several special binary trees as the organization form of lookup table.

3.1 Binary sorting tree

Binary sort trees are also called binary search trees

  • Definition of binary sort tree

    • A binary sort tree is either an empty tree or a complete binary tree with the following properties
    • If its left subtree is not empty, the value of all nodes in the left subtree is less than the value of its root node
    • If its right subtree is not empty, then the value of all nodes in the right subtree is greater than the value of its root node
    • The left and right subtrees are binary sort trees respectively
  • Middle order traversing a binary sorting tree yields an ordered sequence with increasing node values

3.1.1 Binary sorting tree search

  • To find the way

    • If the binary sort tree is empty, the search fails and a null pointer is returned
    • If the binary sort tree is not empty, the root keyword is compared to the keyword being looked for
      • If the root keyword is equal to the keyword searched, the search succeeds and the node is returned
      • If the root keyword is greater than the searched keyword, the left subtree is recursively searched
      • If the root keyword is greater than the searched keyword, recursively search the right subtree
  • Average search length

    • In the worst case, this tree is a single-branch tree, consisting of n nodes, and it takes (n + 1)/2 searches on average

    • At best, this tree is a full binary tree, consisting of n nodes, with an average search length of the same order of magnitude as logn

3.1.2 Binary sort tree insertion

  • Insert the way
    • If the binary sort tree is empty, the node to be inserted is used as the root
    • If the binary sort tree is not empty, the keyword of the node to be inserted is compared with the keyword of the current root node
      • If the keyword of the current root node is less than that of the node to be inserted, insert the node into the right subtree
      • If the keyword of the current root node is greater than that of the node to be inserted, insert the node into the left subtree

3.1.3 Deletion of binary sorting tree

  • The node to be deleted is the leaf node, so you only need to change the pointer of the parent’s child to null

  • Only the left or right subtree is deleted. If the deleted node is the left subtree of the parent node, the subtree of the deleted node is the left subtree of the parent node, and the right node is vice versa

  • The deleted nodes left and right subtrees are not empty, we can start from the middle order traversal of binary sorting tree is from small to large sequence,… PL P PR … , there are two ways to deal with it

    • Linking PL to the parent of P and PR as the right subtree to the node of the original PL’s maximum keyword adds depth to the tree
    • Replace P with the node of the maximum keyword of PL. PR does not need to move. If the node of the maximum keyword already has a left subtree, move the left subtree to the right subtree of the parent node
  • Time complexity: O(logn)

3.2 Balanced binary trees

The performance of binary sorting tree search depends on the structure of the tree. If the data is ordered, the search time complexity is O(n); if the structure of the tree is reasonable, the search time complexity is O(logn). The smaller the height of the tree, the fewer times the search is required.

  • The concept of balanced binary trees
    • Balanced binary trees are either empty trees or binary sorting trees with the following characteristics
    • The depth difference between the left and right subtrees is no more than 1
    • Left and right subtrees are also balanced binomial trees
  • Balance factor: The difference between the depth of the left and right subtrees of the node. If the tree is a balanced binary tree, the balance factor of each node can only be -1, 1, 0

3.2.1 Adjustment method of balanced binary tree

Find the ancestor node whose absolute value of the balance factor nearest to the inserted node exceeds 1. The subtree rooted to this node is called the minimum unbalanced subtree, and the scope of readjustment is limited to this minimum unbalanced subtree

In general, assuming that the root of the minimum unbalanced subtree is A, the rules of adjustment after loss of balance can be divided into the following four situations

  • LL type
    • Insert the node into the left subtree of the left subtree of the left subtree of the root node B of A, and increase the balance factor of A from 1 to 2, requiring A clockwise rotation to the right
      • A’s left subtree B becomes the root of the entire tree, and A becomes the right subtree of B
      • Change the right subtree of B to the left subtree of A if B originally has A right subtree

  • The RR type
    • Insert the node into the right subtree of the right subtree of the root node B of A, and the balance factor of A increases from -1 to -2, which requires A left counterclockwise rotation
      • A’s right subtree B becomes the root of the entire tree, and A becomes the left subtree of B
      • If B originally had A left subtree, then we need to change the left subtree of B to the right subtree of A

  • LR type
    • Insert A node into the right subtree C of the left child root node B of A, and the balance factor of A increases from 1 to 2, requiring two rotations
      • Rotate C and B counterclockwise so that B becomes the left subtree of C. If C already has a left subtree, then C’s left subtree becomes B’s right subtree
      • So this is going to be the case for LL, so if YOU rotate C and A clockwise, A becomes the right subtree of C, and if C had A right subtree, then the right subtree of C becomes the left subtree of A

  • RL type
    • The node is inserted into the left subtree C of the right sub-root node B of A, and the balance factor of A changes from -1 to -2, requiring two rotations
    • I rotate C and B so that B becomes the right subtree of C, and if C had a right subtree before, make it the left subtree of B
    • I rotate C and A so that A becomes the left subtree of C, and if C already has A left subtree, make it the right subtree of A

3.3 B tree

The lookup method described earlier applies to small files stored in computer memory and is collectively known as the internal lookup method. These methods are not applicable if the file is large and stored in external memory for lookup. Internal search is based on node as a unit to search, so it needs to repeatedly exchange internal and external storage, is very time-consuming.

B-tree, also known as B-tree, is a balanced multi-fork tree suitable for external search, directory management in disk management system, and index organization in database system mostly use b-tree this data structure.

3.3.1 Definition of B-tree

  • A B-tree of order m is either an empty tree or an M-fork tree satisfying the following properties
    • Each node in the tree has at most m subtrees
    • If the root is not a leaf, there are at least two subtrees
    • All non-terminal nodes except the root have at least the entire subtree m/2 up
    • All leaf nodes appear at the same level without information, which is usually called failed nodes. The failed nodes do not exist and the Pointers to these nodes are empty. The purpose of introducing failed nodes is to facilitate the analysis of the searching ability of B-tree
    • All non-terminal nodes have at most M-1 keywords, and each node contains the number of keywords of the current node, keywords, and the number of keywords +1 pointer. In order to record the parent node, the pointer field of the parent node will also be added
      • Multiple keywords are arranged in order of size

      • All the keywords in the subtree pointing to the left pointer of the keyword are smaller than the keyword, and all the keywords in the subtree pointing to the right pointer of the keyword are greater than the keyword

  • B tree
    • Balance: Each leaf node is at the same level
    • Ordered: The keywords in each node of the tree are ordered, and the keywords in the left subtree of key word K1 are all smaller than K1, while the keywords in the right subtree are all larger than K1
    • Multipath: some nodes have one keyword and two subtrees, some nodes have two keywords and three subtrees

3.3.2 B tree Search

Search the key word in order in the node, and then go to the next node to search the key word in order after the location is determined. In this way, traverse the single node layer by layer, and finally find the key word, that is, find the leaf node, that is, not found

3.3.3 B-tree Insertion

B tree is a balanced multi-fork tree. If it is an M-fork tree, all other nodes except the root node must have at least m/2 rounded up subtrees. That is, outside the root node, every node must have at least m/ 2-1 rounded up -1 keyword, and at most M-1 keyword. For example, in a 3-fork tree, except the root node, all nodes must contain at least one keyword and at most two keywords

  • To insert a node is to add a keyword to a non-terminal node at the lowest level
    • If the number of keywords does not exceed m-1, the insertion is complete
    • More than 1 m – a keyword, it is necessary to split, the nodes in the same layer is divided into two, with intermediate keyword as boundaries, the nodes in two, the key is inserted into the middle on the parent node, if the parent node is full, split the parent node in the same way, the worst has been divided to the root node, the height of the tree plus one

3.3.4 Deleting a B tree

Delete operation is to delete a keyword and its adjacent pointer. After deletion, the tree should still meet the definition of B tree. * If the keyword of the node after deletion is less than m/2 rounded up – 1, then it needs to merge * Delete pointer * If the pointer to be deleted is null, If the pointer is to a subtree, you need to…

3.4 B + tree

B+ trees are a variant of B trees and are more suitable for file indexing systems

  • The difference between B+ tree and B tree

    • There are n subtrees that contain n keywords
    • All leaf nodes contain information about all keywords and Pointers to records containing these keywords. Leaf nodes themselves are connected sequentially depending on the size of the keywords
    • All non-terminal nodes can be considered as index parts and contain only the largest or smallest keywords in the child root node
    • B+ trees have two head Pointers, one to the root node and the other to the node where the minimum keyword is located, so there are two ways to search, one is to search sequentially from the minimum keyword, and the other is to search randomly from the root node
  • B+ tree search: if the key on a non-terminal node is equal to a given value, the search continues until the leaf node is found

  • Insertion of B+ tree: insertion is only performed on leaf nodes. When the number of keywords exceeds M, splitting is required. After splitting, the maximum keywords of two nodes are added to the parent node

  • B+ tree deletion: delete only among leaf nodes. After the maximum keyword in leaf nodes is deleted, a replacement keyword needs to be selected in non-terminal nodes. If the keyword is deleted so that the number of keywords in the node is less than m/2, the whole keyword needs to be merged

Hash table lookup

The previous linear table search, tree table search are based on keyword comparison, search time and the length of the table, when a lot of records in the table, the search needs a lot of keyword comparison, search speed will be slow. If you can establish some kind of direct relationship between the location of an element and its keywords, you can do lookups without having to compare it, or with a small number of comparisons.

The method of finding records by keywords and corresponding relations is called Hash Search. It directly calculates the address of elements by performing some operation on the keywords of elements without repeated comparison. Therefore, Hash Search is also called Hash or sequence method.

  • Hash function and hash address: establish a definite correspondence H between the record storage location P and the key, so that P =H(key), call this correspondence H hash function, p hash address
  • Hash table: A finite contiguous address space used to store the data discipline of the corresponding hash address computed by the hash function, usually a one-dimensional array with the hash address as the index of the array
  • Conflict: It is possible to reach the same hash address for different keywords. These two keywords are synonyms of each other

4.1 Hash function construction method

  • There are many ways to construct a hash function. Generally, the following factors should be considered in a case-by-case analysis
    • The length of the hash table
    • The length of the keyword
    • Distribution of keywords
    • The time required to compute the hash function
    • Record the frequency of searches
  • A good hash function has the following characteristics
    • The function calculation should be simple, and each keyword should have only one hash address corresponding to it
    • The range of functions should be within the length of the table, and the calculated hash addresses should be evenly distributed to minimize collisions
  • A constructor
    • Number analysis: know the keyword set in advance. The number of digits of each keyword is more than the number of address digits of hash table. Each keyword consists of N digits, such as K1K2… Kn, several bits with relatively uniform distribution can be extracted from keywords as hash addresses
    • Square method: select the hash function may not know all the keywords, take several bits as the hash address may not be appropriate. The middle number of digits after the square of a number is related to each digit of the number. The middle number of digits can be used as a hash address or a combination of them, which is also random
    • Folding method: Divides the keyword into several numbers with the same number of digits, and takes the superposition and carry of these numbers as the hash address
      • Shift stack: read each split number in positive order
      • Boundary overlay: S-type reads each split number
    • Remainder method: Set the length of the hash table to M, select a number p not larger than M, and use P to remove the keyword. The remainder after the division is the hash address, that is, H(key) = key % p. Appropriate p should be selected. Generally, P is the largest prime number smaller than the length of the table. This method is the most common way to construct hash functions.

4.2 Methods for Handling conflicts

In practice, it is difficult to avoid sending collisions completely, so choosing an effective method is a key problem of hashing.

The way conflicts are handled has to do with how the hash table itself is organized. According to the different organizational forms, it is usually divided into two categories: open address method and chain address method

4.2.1 Open address method

The records are stored in the hash table array. When the initial hash address of a record key H0 = H(key) conflicts, another address H1 is calculated based on H0 using appropriate methods. If H1 still conflicts, H2 is calculated based on H1 until the position without conflict is found.

Hi = (H(key) + di) % m, I = 1,2… ,k(k <= m – 1)

  • Linear detection methodWhen a conflict occurs, the table is searched from the next cell of the conflicting address in sequence. If the last location is not found, the table is searched back
    • Di = 1, 2, 3, 4,.. ,m – 1
    • As long as the linear list is not filled up, we can find the position
    • A second aggregation occurs, with two numbers with different first hash addresses competing for the same subsequent hash address
  • Secondary detection method:
    • Di = 1 ^ 2, and 1 ^ ^ 2, 2, 2-2 ^ 2,… ,+k^2,-k^2 (k <= m/2)
    • Secondary clustering can be avoided, but not necessarily found
  • Pseudo random detection method:
    • Di = sequence of pseudo-random numbers
    • Secondary clustering can be avoided, but not necessarily found

4.2.2 Chain address method

The records with the same hash address are put into a single linked list, which is called a synonym linked list. If there are M hash addresses, there are m single linked lists. The head pointer of the single linked list is stored in a one-dimensional array