preface

You can do very fast lookups using hash tables. But what exactly is a hash table? A lot of people don’t talk about it, although they know that they use it a lot, and many of the built-in data structures in languages like dictionaries in Python and Hashmaps in Java are based on hash tables. But what exactly is a hash table?

What is a hash?

Hashing is the processing of data in computer science by associating the items to be retrieved by a particular function/algorithm (called a hash function/algorithm) with the index to be retrieved (called a hash, or hash value) to generate a searchable data structure (called a hash table). Also translated as hash. Old translation hash (transliteration for a person’s name) It is also commonly used as an implementation method for information security. Data fingerprints calculated from Hashing algorithms in a string of data are often used to identify whether files and data have been tampered to ensure that the files and data are indeed provided by the originator. —-Wikipedia

The hash function

All hash functions have the basic property that if two hash values are different (according to the same function), then the original input of the two hash values is also different. This property is a consequence of the deterministic nature of the hash function, which is called a unidirectional hash function.

Hash table

  • If the keyword is k, the value is stored in f(k). Thus, records can be obtained directly without comparison. The corresponding relation f is called the hash function, and the table created according to this idea is called the hash table.

  • The same hash address may be obtained for different keywords, that is, K1 ≠ K2 and F (k1)=f(k2). This phenomenon is called conflict. Keywords with the same function value are called synonyms for the hash function. To sum up, according to the hash function f (k) and the methods of dealing with conflict to map a set of keywords to a limited set of contiguous address (interval), and take the keyword in the address of “like” as recorded in the storage location in the table, the table is called a hash table, this mapping process called hash tabulation or hash, the storage location of said hash address.

  • If the probability of any keyword in the keyword set being mapped to any address in the address set is equal through the Hash function, this kind of Hash function is called Uniform Hash function. This is to make the keyword get a “random address” through the Hash function, so as to reduce conflicts.

Create a hash table

So basically, a hash table is a table that has mappings that you can use to find values by keys. Are there any examples? Sure, but it’s not fun if you just use it.

So we’re going to implement f(k), key-value mapping. Let’s try to implement this ourselves:

class Map:
    def __init__(self):
        self.items=[]

    
    def put(self,k,v):
        self.items.append((k,v))
    

    def get(self,k):
        for key,value in self.items:
            if(k==key):
                return value
Copy the code

In this way, the search time of the Map is O(n). “This is so simple that it doesn’t seem to have anything to do with key. This is sequential lookup. Are you kidding me?” This is just a warm-up, ok, so let’s do a mapping function by definition:

class Map:
    def __init__(self):
        self.items=[None] *100
    
    def hash(self,a):
        return a*1+0
    
    def put(self,k,v):
        self.items[hash(k)]=v

    def get(self,k):
        hashcode=hash(k)
        return self.items[hashcode]
Copy the code

“This hash function is a little bit simple.” Yes, it is simple, but simplicity doesn’t prevent it from being a hash function. In fact, it’s called direct addressing, and it’s a linear function: hash(k)= a*k+b

“Why is capacity 100 specified for initialization?” It’s important to point out that this is a must. You want to store and access by subscript, which is inevitable with arrays. In the JDK source code, you can also see that Java’s HashMap has an initial capacity of 16. You might say, well, your hash function, if I just set the key above 100, the program is dead. Yeah, it’s not perfect. This involves expansion, but I’ll talk about that later.

The obvious advantage of direct addressing is that it does not produce duplicate hash values. However, because it is related to the key value itself, when the key value is scattered, a large amount of storage space can be wasted. So you don’t usually use direct addressing.

To deal with conflict

What if a hash function produces a bunch of hashes, and those hashes conflict (which happens a lot in real production)? Handling collisions is a necessary step in any hash table implementation. For instance you defined a hash function: hash (k) = k mod 10 assumes that the key sequence is:,1,24,32,55,64,42,93,82,76 [15]

0 1 2 3 4 5 6 7 8 9
1 32 93 24 15 76
42 64 55
82

Down the road, there are four conflicting elements, and here are a few ways.

Open addressing

Open addressing is the search for the next free space after the conflict. Function is defined as:

Where hash(key) is the hash function, di is the incremental sequence, and I is the number of conflicts.

  • Linear detection method

Di is equal to I, or some other linear function. This is equivalent to probing the table of addresses one by one until an empty cell is found and then placed in that cell.

[15,1,24,32,55,64,42,93,82,76]

As you can see, there is no conflict until 55:

0 1 2 3 4 5 6 7 8 9
1 32 24 15

At this time, 55 is inserted, which conflicts with 15, and linear detection is applied. At this time, I =1, it can be obtained:

0 1 2 3 4 5 6 7 8 9
1 32 24 15 55

Insert 64 again, so I =3:

0 1 2 3 4 5 6 7 8 9
1 32 24 15 55 64

Insert 42, I =1:

0 1 2 3 4 5 6 7 8 9
1 32 42 24 15 55 64

Insert 93, I =5:

0 1 2 3 4 5 6 7 8 9
1 32 42 24 15 55 64 93

Insert 82, I =7:

0 1 2 3 4 5 6 7 8 9
1 32 42 24 15 55 64 93 82

Insert 76, I =4:

0 1 2 3 4 5 6 7 8 9
76 1 32 42 24 15 55 64 93 82

I found the conflict getting more and more bizarre. Therefore, the size selection of the table is also very important. In this example, 10 is chosen as the size of the table, which is prone to conflicts. In general, the more prime the number, the more evenly the mod mod is likely to be distributed.

  • Square detection

This is called squaring, and it’s the same thing, finding an empty cell and putting it in. I’m not going to do it step by step here. =

  • Pseudo-random detectiondiIs a random sequence of numbers. “Random number? What about getting? Is also a random number ah, how to ensure consistent?” So it’s a pseudo-random number.In fact, almost all the numbers we deal with in computers are pseudorandom, and as long as they are generated by deterministic algorithms, they are pseudorandom. As long as the seed is identified, the resulting sequence is the same. The sequence is the same, so that’s ok. =

The linked list method

This is another type of conflict resolution, where you hash elements at the same location, and instead of going down, you have a list at that location, and all of those elements are going to be on that list. Java’s HashMap uses this.

And then hash

If one isn’t enough, try again until the conflict doesn’t happen again.

Establish public overflow areas

The hash table is divided into two parts: the base table and the overflow table. Any element that conflicts with the base table is filled into the overflow table.

Saying so, for example, with the open address method (linear detection) :

class Map:
    def __init__(self):
        self.hash_table=[[None.None]for i in range(11)]
    
    def hash(self,k,i):
        h_value=(k+i)%11
        if self.hash_table[h_value][0]==k:
            return h_value
        if self.hash_table[h_value][0]! =None:
            i+=1
            h_value=self.hash(k,i)
        return h_value
 
    def put(self,k,v):
        hash_v=self.hash(k,0)
        self.hash_table[hash_v][0]=k
        self.hash_table[hash_v][1]=v

    def get(self,k):
        hash_v=self.hash(k,0)
        return self.hash_table[hash_v][1]
Copy the code

“Can we not fix the length of death? Eleven is not enough.”

That’s the question, so there’s another concept, which is called a load factor. The load factor is defined as: α= number of existing elements/length of table

Since the table length is a constant value, α is proportional to “the number of elements in the table”. Therefore, the larger α is, the more elements in the table are filled, the more likely conflicts will occur. On the other hand, a smaller α indicates that fewer elements are filled into the table and less conflict is likely to occur. In fact, the average lookup length of the hash table is a function of the load factor α, but different methods of handling conflicts have different functions.

So when the table reaches a certain point, the length of the table will change, namely, resize=. = Like Java’s HashMap, the payload factor is designed to be 0.75; Above 0.8, the CPU’s cache missing increases dramatically. May have a look this discussion: www.zhihu.com/question/22…

The size of the expansion is usually twice the number of inserted elements, which is what Java does.

Then above, update our map again:

class Map:
    def __init__(self):
        self.capacity=11
        self.hash_table=[[None.None]for i in range(self.capacity)]
        self.num=0
        self.load_factor=0.75
    
    def hash(self,k,i):
        h_value=(k+i)%self.capacity
        if self.hash_table[h_value][0]==k:
            return h_value
        if self.hash_table[h_value][0]! =None:
            i+=1
            h_value=self.hash(k,i)
        return h_value

    def resize(self):
        self.capacity=self.num*2 Double the number of elements
        temp=self.hash_table[:]
        self.hash_table=[[None.None]for i in range(self.capacity)] 
        for i in temp:
            if(i[0]! =None) :# Save the existing elements
                hash_v=self.hash(i[0].0)
                self.hash_table[hash_v][0]=i[0]
                self.hash_table[hash_v][1]=i[1]
 
    def put(self,k,v):
        hash_v=self.hash(k,0)
        self.hash_table[hash_v][0]=k
        self.hash_table[hash_v][1]=v
        self.num+=1                 # do not consider the case of duplicate key, specific can be optimized
        if(self.num/len(self.hash_table)>self.load_factor):# If the ratio is greater than the load factor
            self.resize()

    def get(self,k):
        hash_v=self.hash(k,0)
        return self.hash_table[hash_v][1]
Copy the code

If you look at the function above, you can see that resize is a time-consuming operation, because it is only a principle teaching, so there is no strange skill in it. Take a look at the Hash and resize methods of Java’s HashMap, as well as the design for handling collisions (JDK8 and later HashMaps use red-black trees).

I think we’re pretty much done with hashing tables. As you can see, it is essentially to solve the problem of finding time. If the search is sequential, the time complexity is O(n); A hash table, on the other hand, is O(1)! This is why hash tables are used in large data stores for lookups.