This is the 15th day of my participation in the August More Text Challenge. For details, see:August is more challenging

For frequently used lookup tables, you want the ASL = 0 record to have a defined relationship between the position in the table and its keywords

HASH

define

  • According to the set hash function H(key) and the selected method of handling conflicts, a set of keywords is mapped to a finite address set (interval) with continuous address, and the “image” of keywords in the address set is used as the corresponding storage location in the table. The lookup table constructed in this way is called “hash table”.

Construction of the HASH function

  • Structure principle
    • The function itself is easy to compute
    • The calculated addresses are evenly distributed, that is, the probability of f(k) corresponding to different addresses is equal for any keyword K, in order to reduce the conflict as much as possible

Direct addressing

  • The hash function is a linear function of the keyword

    • H(key) = key
    • H(key) = a * key + b
  • This method is only suitable for: address set size = = keyword set size

  • Advantage: The hash address is the value of a linear function of the key

  • Disadvantages: To occupy continuous address space, low space efficiency

Numerical analysis

  • Assume that each keyword in the keyword set is composed of s-bit numbers (U1, U2,… , us), analyze the whole of the keyword set and extract a number of evenly distributed bits or their combination as addresses
  • This method is only suitable for preestimating the frequency of various numbers on each digit of the whole keyword


Middle square

  • The middle bits of the square value of the keyword are used as the storage address. The purpose of “square the keyword” is to “widen the difference,” while the middle of the square can be influenced by the entire keyword
  • This method is suitable for the phenomenon that each digit in the keyword has a high frequency of repeating some numbers

Folding method

  • Divide the keyword into parts and take the sum of them as the hash address. There are two methods of stacking: shift stacking and interbound stacking
  • This method is suitable for:

Keywords have a lot of digits


Except for the remainder method

  • Hash(key)=key mod p (p is an integer)
    • P ≤ M (table length)
    • P should be the largest prime number less than or equal to m

Why do you want to put a limit on P?

  • Given a set of keywords: 12, 39, 18, 24, 33, 21 if p=9, their corresponding hash function value will be:

Three, three, zero, six, six, three

  • It can be seen that if p contains the prime factor 3, all keywords containing the prime factor 3 are mapped to the address of “multiples of 3”, thus increasing the possibility of “conflict”

Random number method

  • H(key) = Random(key) (Random is pseudo-random function)
  • This method is used to construct hash functions for keywords of unequal length

consideration

  1. Execution speed (that is, the time required to calculate the hash function)
  2. The length of the keyword
  3. Size of the hash table
  4. Keyword distribution
  5. To find the frequency

The way you construct a hash function depends on the set of keywords you’re building the table and the principle is to keep the possibility of collisions as small as possible

How to handle conflicts

1. Open addressing


The basic idea

  • If there is a conflict, look for the next empty hash address. As long as the hash table is large enough, the empty hash address will always be found and the data elements will be stored

Linear detection method

  • Hi=(Hash(key)+di) mod m (1≤ I < m) M – 1, and di = I

In case of conflict, find the next empty address to save

  • Advantage: as long as the hash table is not filled, it is guaranteed to find an empty address cell for the conflicting elements
  • Disadvantages: can store the synonym of the ith hash address in the I +1 address, so that the element that should be stored in the I +1 hash address becomes the synonym of the I +2 hash address… , resulting in the “aggregation” phenomenon, reducing the search efficiency

Secondary detection method

Di = 12, -12, 22, -22… Plus or minus k2

Pseudorandom detection

Hi=(Hash(key)+di) mod m (1≤ I < m)

Where: m is the hash table length and di is a random number

Open addressing method to create a hash table steps

  • Take the key of the data element and calculate its hash function value (address). If the storage space corresponding to the address is not occupied, the element is stored. Otherwise, perform step2 to resolve the conflict
  • Computes the next storage address for the keyword key based on the selected conflict handling method. If the next storage address is still in use, continue with step2 until a usable storage address is found

Open addressed hash table storage structure

/* ------------- Open address hash table storage structure ------------- */

int hashsize[] = {997. };typedef struct{
	ElemType* elem;
	int count;  // The number of current data elements
	int sizeindex;  // HashSize [sizeIndex] indicates the current capacity
} HashTable;

#define SUCCESS 1
#define UNSUCCESS 0
#define DUPLICATE -1

Status SearchHash(HashTable H, KeyType K, int &p, int &c){
	// Look for records with key K in the open addressing hash table H
	p = Hash(K);  // Get the hash address
	while(H.elem[p].key ! = NULLKEY && !EQ(K, H.elem[p].key))
		collisiion(p, ++c);  // Obtain the next probe address p
	if(EQ(K, H.elem[p].key)) return SUCCESS;  // Return the location p of the data element to be checked
	else return UNSUCCESS;  // The search was not successful
}
Copy the code

2. Then the HASH method


  • H2(key) is another hash function, whose value should be mutually prime with m

3. Chain address method


The basic idea

  • The records of the same hash address are linked into a single list. M hash addresses are set as m single lists, and then m single list header Pointers are stored in an array to form a dynamic structure

Advantages:

  • Non – synonyms do not conflict, no “aggregation” phenomenon
  • Dynamic application of node space on linked list is more suitable for uncertain table length

Hash table lookup

For a given value K, calculate the hash address I = H(K)

  • If r[I] = NULL, the search fails
  • If r[I].key = K, the search is successful; otherwise, “find the next address Hi” until r[Hi] = NULL or r[Hi].key = K

Case v01

  • Linear detection resolves conflicts

Case v02

  • The chain address method handles conflicts

Analysis of hash table lookups

From the lookup process, we know that the average lookup length of a hash table lookup is not actually equal to zero

Factors that determine the ASL of a hash table lookup

  • Hash function of choice
  • The chosen method of conflict management
  • Degree of hash table saturation, loading factor α= size of n/m values (n — number of records, m — length of table)

The larger α is, the more records there are in the table, indicating that the table is full, the greater the possibility of conflict, and the more times of comparison in search

  1. Hash table technology has a good average performance, better than some traditional technology
  2. The chain address method is superior to the open address method
  3. The remainder method is superior to other types of functions for hashing

Examples of hash table applications

The compiler uses hash table to manage identifiers

  • How to construct a hash function
    • Converts each character in the identifier to a non-negative integer
    • Combine the resulting integers into one integer (you can add the first, middle, and last character values together, or you can add all character values together)
    • To adjust the number of results within the range of 0~ m-1, the method of modulus can be used, Ki%M (M is a prime number)