We’ve talked about linked lists, stacks and queues, and today we’re going to talk about a new data structure, hash, which is a very popular data structure. So let’s get this hash table straight together. The framework of the article is as follows

Before we talk about hash tables, let’s imagine the following scenario.

Yuen hutch through back to ancient times, with learning from modern cooking craft, opened a yuen restaurant, during the early opening, the store business is very hot, but the customer checkout is puzzled, whenever the invoicing, wife of shop-owner always according to the menu one by one to find the price (traversal search), each time for half a day, so the check place always queued up to, Customers say the user experience is not great. Yuan Chu thought this is not a way ah, let the customer always waiting, too affect the customer experience. So Yuan Kitchen will first put the menu in alphabetical order (binary search), and then search according to the initial search, so when the checkout can greatly improve the efficiency of retrieval! But? There are not many customers on weekdays, the owner’s wife can fully cope with it, but on holidays, there will be long lines. So is there a better way? That’s right! Can’t we just learn all the prices by heart? We know the price of each dish very well, and when we pay, we simply add it up. So Yuan Hutch and the owner’s wife work overtime to recite. The next time we check out, we’ll know the price right away. Since then, there has never been a long queue at the cashier’s desk, yuan’s restaurant opened accidentally became the first restaurant in the world.

Let’s take a look at the evolution history of the wife of the owner of yuan Ji Restaurant.

The process of late checkout above simulates our hash table search, so how to use the search in the computer?

Hash table lookup steps

Hash table ——- is one of the most useful basic data structures. Hash table is a data structure that accesses directly according to the value of the key code. Implementation of hash table is often called hasing. Hashing is a technique for performing inserts, deletes, and lookups in constant mean time. Let’s take a look at the hashing process.

Our entire hashing process is divided into two main steps

(1) Calculate the hash address of the record through the hash function, and store the record according to the hash address. It’s like spicy fish we put it in Sichuan, sweet and sour fish we put it in Shandong. But it’s important to note that whatever record we have we need to use the same hash function to compute the address and store it.

(2) When we search, we use the same hash function to calculate the hash address of the record and access the record at this hash address. Because we’re using a hash function to save and get, we’re going to get the same result.

We talked about a hash function in the hash process, so what is a hash function?

Let’s say that some function is f, so that

Storage location = F (keyword)

Input: Keyword Output: Storage location (hash address)

That way we can find the location of the desired record without comparing it by looking up the keyword. This storage technique is called hashing. Hash technology is to establish a definite corresponding relation F between the storage location of the record and its key, so that each key corresponds to a storage location F (key). See below

F here is what we call a hash function. We use hash technology to store records in a contiguous storage space, which is the protagonist of this article —— hash table

In the figure above, we have a hash function to map keywords to the hash table, but have you considered the case that the keywords are mapped to the same slot, which is f(k4) = f(k3). This is what we call a conflict, and k3 and k4 are called synonyms for the hash function F, and if this happens, it will let us look for errors. Fortunately, we can find effective ways to resolve conflicts.

The first thing we can do with hash functions is we can carefully design them to have as few collisions as possible, so we should follow these rules when we create hash functions

(1) It must be consistent, assuming that when you type in chili chicken, you get watching, then every time you type in chili chicken, you must also get watching. If not, hash tables are useless.

(2) The calculation is simple. Suppose we design an algorithm to ensure that all the keywords will not conflict, but the algorithm is complicated and will take a lot of time. In this case, the search efficiency will be greatly reduced, but the gain is not worth the loss. So our hash function should not take longer to compute than other lookup techniques compared to keywords, otherwise why not use other lookup techniques?

(3) Evenly distributed hash addresses We have just talked about the problems caused by conflicts, so the best solution is to distribute hash addresses as evenly as possible in the storage space. This ensures efficient use of space and reduces the amount of time spent dealing with conflicts.

Now that we know about hash tables, hash functions, etc., let’s look at some common rules for constructing hash functions. What these methods have in common is that they all change the original number into another number according to certain rules.

Hash function constructor

Direct addressing

If we design a hash table for dishes whose profits are 0-9, we can directly use it as the address, then f(key) = key;

So this is the case.

If you feel familiar with the above diagram, yes, the array we often use is actually a hash table, the key is the index index of the array, and then we directly access the elements of the array through the index index.

In addition, let’s assume that the cost of each dish is 50 yuan, then we can also use profit + cost as the address, then f(key) = key + 50. In other words, we can hash the address based on the value of the linear function.

F of key is equal to a key plus b a and b are constants

Advantages: simple, uniform, no conflict.

Application scenario: You need to know the distribution of keywords in advance. This method is suitable for small and continuous search tables

Numerical analysis method

The method is also a very simple method, is to analyze our keywords, take a paragraph, or its displacement, superposition, as the address. For example, if the first 6 digits of our student number are the same, but the last 3 digits are different, we can use the student number as the key, and the last 3 digits as our hash address. If we are still prone to conflict, we can extract the number and then process it. Our goal is to provide a hash function that allocates the keyword to each position in the hash table. Here we have a new approach, extraction, which is often used in hash functions.

Advantages: simple, uniform, suitable for the large number of keywords

Application scenario: The keyword bits are large, the distribution of the keyword is known, and several bits of the keyword are even

Folding method

In fact, this method is also very simple, is to process our keyword and then used as our hash address, the main idea is to divide the keyword from left to right into several parts of the same number of digits, and then add up, and according to the length of the hash table, take the last few bits as the hash address.

For example, if our keyword is 123456789, we divide it into three parts 123, 456, 789 and add them up to 1368. Then we take the next three digits, 368, as our hash address.

Advantages: You do not need to know the keyword situation in advance

Application scenario: Suitable for a large number of keywords

Division and hash

In the division hash method used to design the hash function, the key is mapped to one of the P slots by taking the remainder of the key divided by P. For the hash function of length M, the formula is

f(k) = k mod p (p <= m)

For example, if the hash table length is 12, that is, m = 12, and our parameter p is also set to 12, then f(k) = 100%12 = 4 when k = 100

Since you only need to do a division operation once, division hashing is very fast.

As can be seen from the above formula, the key point of this method is the value of P. If the p value is not well chosen, synonyms may easily be generated. See the situation below. We have a hash table of length 6, and if we choose 6 as the p value, it is possible that all the keywords get the address number 0.

So what are the rules that we should follow when we use division and hash to pick p values?

  • M should not be a power of 2, because if m is 2 to the p, then f of k is the lowest p number of k. Example 12%8 = 4, the binary representation of 12 is 1100, and the last three bits are 100.
  • If the hash table length is m, p is usually the smallest prime number less than or equal to the table length (preferably close to m) or a composite number that does not contain less than 20 prime factors.

Composite number: A composite number is a number of integers greater than 1 that is divisible by other numbers (except 0) besides 1 and itself.

Prime factor: A prime factor (or prime factor) in number theory is a prime number that divisible exactly into a given positive integer.

The 2, 3, and 5 here are prime factors

Again, in the example above, we choose 5 as the p value according to the rule, and let’s look again. Now we see that there’s only a conflict between 6 and 36, which is a lot better.

Advantages: High calculation efficiency, flexible

Application scenario: The keyword distribution is not known

Multiplication and hash

The multiplicative hash method for constructing hash functions consists of two main steps

  • Multiply the keyword k by the constant A(0 < A < 1) and extract the fractional part of k A
  • I’m going to multiply m times that, and I’m going to round it down

The hash function is zero

F (k) = ⌊ m(kA mod 1) ⌋

KA mod 1 here means taking the decimal part of keyA, i.e. KA – ⌊kA⌋.

Advantages: the choice of m is not particularly critical, generally choose it as a power of 2 (m = 2 ^ p,p is some integer)

Application scenario: Do not know the keyword

Square the middle

If the key is 321, its square is 103041, and the middle three bits are 030 or 304 for the hash address. For example, if the keyword is 1234, its square is 1522756, and the middle three bits are 227 as the hash address.

Advantages: flexible, wide range of application

Application scenario: The keyword distribution is unknown and the number of bits is not large.

Random number method

Therefore, the random function value of the keyword is its hash address. F of key is equal to random of key. Random here is a random function.

Advantages: Easy to implement

Application scenario: The keyword length varies

Can a string be used as a key? Of course, it is also possible, all kinds of symbols we can convert into some kind of number to treat, such as we often contact with ASCII code, so it is equally applicable.

The above are commonly used hash function construction methods, in fact, their central idea is the same, the keyword after processing into another number, and this number is our storage location, is there a sense of spy passing information.

A good hash function can help us avoid collisions as little as possible, but it can’t avoid them completely. So what should we do when we encounter conflicts? Here are some common ways to handle hash collisions.

A method to handle hash conflicts

F (key1) = f(key2); f(key1) = f(key2) No hurry. Let’s take our time and look down.

Open address method

Before we get into open address, consider the following scenario.

Yuan Ji restaurant, bell bell bell, bell bell bell telephone ring

Dapeng: Lao Yuan, book a private room for me. I will take some clients to your place to talk business today.

Yuan Hutch: dapeng, you commonly used that pack room was ordered by the person.

Dapeng: Lao Yuan, you don’t defend your righteousness. Why don’t you keep me here? Then find me an empty room.

Yuan Kitchen: Good brother

Oh, there are no telephones in ancient times, so I guess I’ll have to take a few cell phones with me.

The above scenario is actually a way to handle conflicts —– open address method

Open address method is in the event of conflict, go looking for an empty hash address, as long as the list is enough big, can always find an empty hash address, and will record, in order to use the open addressing method to insert an element that needs to continuously check the hash table, or called probe, we commonly used linear detection, detection, random detection.

Linear detection method

Let’s first look at linear detection, the formula:

F, (key) = (di) (key) + f (di MOD m = 6… M – 1)

F (key) = key mod 12; f(key) = key mod 12; f(key) = key mod 12

So let’s figure out what f of key is for each key, as shown in the table below

If f(37) = F (25) = 1, then we need to use the above formula F (37) = f(37) + 1 mod 12 = 2. This is actually how we book private rooms. So let’s see what happens when we put all of these numbers into a hash table.

We call this open address approach to conflict resolution linear detection. Let’s simulate the stored procedure of linear detection by video.

In addition, when we resolve conflicts, 48 and 37 compete for the same address although they are not synonyms, which is called accumulation. Because stacking makes us constantly dealing with conflicts, insertion and lookup efficiency are greatly reduced.

So now that we’ve seen how linear detection works, let’s think about the situation, what happens if our last digit is 34 instead of 21?

So if we continue to use linear detection, we need to get the result through constant mod. Small data is fine, but if it is large, it is too slow, but it is clear that there is an empty room in front of it. If we move forward, we only need to move once. Don’t worry, our predecessors have already worked out a solution for us

Secondary detection method

And actually, after understanding our last example, this is pretty straightforward, it’s a no-brainer, but what I did was I changed the di

Linear detection: f, (key) = (di) (key) + f (di MOD m = 6… M – 1)

Second detection: f, (key) = (f (key) + di) MOD m (di = 1 ^ 2, and 1 ^ 2, 2 ^ 2, 2 ^ 2… Q ^ 2, – q ^ 2, q < = m / 2)

Note: this is negative 1^2 instead of (-1)^2

So for 34, when di is negative 1, we can find the empty position.

The purpose of the secondary detection method is to keep keywords from clustering in a certain area. Another interesting way to do this is to calculate the displacement using a random function, so let’s see.

Random detection method

As you can see, this is a new problem, as we said in the first rule for the construction of hash functions

(1) It must be consistent, assuming that when you type in chili chicken, you get watching, then every time you type in chili chicken, you must also get watching. If not, hash tables are useless.

Yi? We use a random number as the offset, so we can’t find it when we search. Because we di is randomly generated, the random here is actually a pseudo random number, pseudo random number means, we set a random seed, same is constantly call random function can generate will not repeat sequence, we are looking for, use the same random seed, every time it get the sequence is the same, then the same di can get the same hash address.

Random Seed is a computer term that takes Random numbers as objects and takes true Random numbers (seeds) as initial conditions. The random number of the general computer is pseudo random number, with a true random number (seed) as the initial condition, and then use a certain algorithm to continuously generate random number

We can use a random number as its offset. The same random seed will get the same sequence of numbers every time.

Let’s take a look at how other functions handle hash collisions

Then the hash method

This method is also very simple, using different hash functions to find a hash address, until there is no conflict.

F,(key) = RH,(key) (I = 1,2,3,4….. k)

So RH here, which is just different hash functions, you can use all of those hash functions that we talked about before, and every time there’s a conflict, you can change the hash function, and trust that there’s always one that can resolve the conflict. This method eliminates keyword aggregation at the cost of increased computation time. Isn’t that easy?

Chain address method

Now let’s imagine the following scenario.

Yuan Ji restaurant, bell bell bell, bell bell bell telephone ring again, that dapeng again to book a room.

Dapeng: Lao Yuan ah, I will go to your place to have a meal later, or last time that private room

Yuan Chui: dapeng you next time can not say early, and no one ordered away, this time is Lao Wang ordered

Dapeng: Lao Wang is an old guy. Anyway, he is also an acquaintance. Please give me the whole table and I will put it behind him

Sorry, dear students, carrier pigeons are too expensive to buy. The above scenario simulates our new conflict-handling method, chain address method.

Up there, we’ve been in conflict, and we’ve moved. So is there any way we can stay where we are? That’s what we now call chained address method.

Remember when we talked about synonyms? If the key is different and the f(key) is the same, we store these synonyms in a single linked list, which is called a synonym subtable, and the hash table stores only the head pointer of the synonym subtable. **f(key) = key mod 12 **f(key) = key mod 12 ** When we use the chain address method, there are no more conflicts, no matter how many conflicts, we just add nodes to the synonym subtable. Now let’s look at the chain address method of storage.

The linked address method can avoid conflicts, but it also brings the performance cost of traversing a single linked list.

Public overflow zone method

Now let’s look at a new method. This time, Dapeng will come to dinner again.

Yuan Ji Restaurant…..

Yuan Kitchen: yo, this is what wind to blow you to come, zha didn’t open your big rush.

Dapeng: Oh mama ah, don’t talk so much nonsense, I am starving, you quickly find a place for me, I want to eat some rice.

Yuan Kitchen: you come, too bad, our shop is full, you go to the cabin next to watch TV, and I will call you when I have time. There are a few others in the cabin who are just as late as you.

Dapeng: TV? Watching TV?

The above scenario is to simulate our public overflow area method, which is also very well understood, you are not conflict? All of you in conflict let me get you a place to stay, so you’ll have a place to stay. We created a public overflow area for all the conflicting keywords.

So how do we do that? We first calculate the hash address through the hash function, before the basic table comparison, if not equal to the overflow table to search in order. This method of conflict resolution is very high performance with very few conflicts.

Hash table lookup algorithm (linear detection method)

Let’s look at the implementation of the hash table lookup algorithm

First need to define hash list structure and some related constants, the elem on behalf of the hash table data storage arrays, the count is representative of the current insert element number, size, on behalf of the hash table capacity NULLKEY hash initial value, then if we find the successful return to the index, if there is no element of the element is returned does not exist.

We initialize the hash table and assign initial values to the array elements.

The specific steps of the insert operation:

(1) Through the hash function (division and hash method), the key is converted to the array subscript

(2) If there is no element in the subscript, it will be inserted; otherwise, there is a conflict, and linear detection method will be used to deal with the conflict. See the notes for detailed steps

The specific steps of the search operation:

(1) Convert the key to an array subscript using a hash function (as with insertion)

(2) Find the key value by array subscript. If the key is consistent, the search is successful; otherwise, use linear detection method to continue the search.

Let’s take a look at the complete code

Hash table performance analysis

Hash lookups are the most efficient of our lookups with O(1) time if there are no conflicts, but no conflicts is ideal, so what does the average length of hash lookups depend on?

1. Whether the hash function is uniform

We said above that we can reduce collisions by designing hash functions, but since different hash functions are equally likely to collide with a set of keywords, we can ignore its effect on the average search length.

2. Methods to deal with conflicts

The same keyword, the same hash function, and different ways of dealing with conflicts will make the average search length different. For example, linear detection will sometimes accumulate, but it is not as good as quadratic detection, because the chain address method will not produce any accumulation when dealing with conflicts, so it has the best average search performance

3. Hash table loading factor

I wanted to mention the loading factor in the previous article, but it turns out that the absence of a description does not affect our understanding of hash tables, so let’s take a look at the summary of the loading factor

Loading factor α = number of records filled in the table/hash table length

The hash factor represents the degree to which the hash table is filled. The more records in the table, the greater the α and the greater the probability of conflicts. In the example we mentioned above, the length of the table is 12, and the number of entries is 6, then the α = 6/12 = 0.5, so when we fill in the element when our α is large, the possibility of conflict is very high. So the average lookup length of a hash table depends on the loading factor, not on the number of records. So all we need to do is choose the right loading factor to limit the average search length to a range.

If you can feel that this article is very carefully written, can bring you a drop of help, can you please give this article a thumbs-up? This gives me a huge incentive to write.

In addition, if you need the GIF analysis of other selected algorithm questions, you can follow yuan Chu’s algorithm Hut on wechat. I am Yuan Chu, a programmer who loves cooking and has obtained the chef’s certificate. I will always write with my heart, thank you for your support!