Wish you do not like vegetables so helpless pain, the annual prize greatly drops in the days without year-end bonus, work still have to continue….. A picture of ice and fire shows helplessness

Remember the user space that CAI CAI designed not long ago? For those of you who have not seen it, please enter the portal = “Design a high-performance visitor log system

Do you remember any remaining issues? To repeat, how do I determine if there is a record of the current user in the cache of user access records? Linked lists, although the primary data structure for our business scenario, are not the best solution to the current problem, so we need a data structure that can access elements quickly to solve this problem, right? That’s the hash table we’re going to talk about today

Hash table

A Hash table (also called a Hash table) is a data structure that is directly accessed based on Key values. That is, it accesses records by mapping key values to a location in the table to speed up lookups. This mapping function is called a hash function, and the array of records is called a hash table. A hash table can be approximately equal to what we call a key-value form. Hash tables take advantage of arrays’ ability to access data randomly by index, so a hash table is an extension of an array, evolved from an array. If there is no array, so to speak, there is no hash table. Why do we use arrays? Because the time complexity of arrays accessing elements by subscript is O (1), you can refer to the previous article on arrays. If you want to access an element by the index of an array, you must also consider how to convert the Key to the index. So that’s the hash function that WE’re going to talk about.

The hash function

A hash function is simply a black box that converts a Key to an array index. The hash function plays a key role in the hash table. A hash function, as its name implies, is a function. We can define this as a hash(key), where the key represents the element’s key value, and the hash(key) value represents the hash value computed by the hash function. So what are the requirements for a hash function?

  1. The hash function evaluates to a non-negative integer value.
  2. If key1 = key2, hash(key1) == hash(key2)
  3. If key1 ≠ key2, hash(key1) ≠ hash(key2)

The hash value must be a non-negative integer (>=0) because the hash value is the index of the array. The hash value must be the same for the same key. Just to focus on the first three, the third one is really theoretical, we imagine that different keys should get different hashes, but in fact, it’s hard to do that. If I hash an infinite number of keys, then the bottom array of the hash table has to be infinitely large. Hash algorithms such as MD5 and SHA, which are well-known in the industry, cannot completely avoid such conflicts. Of course, the smaller the underlying array, the greater the chance of such collisions. So a perfect hash function does not exist, and even if it does, the cost of time and labor may be beyond imagination.

Hash conflict

Since no good hash function can avoid hash collisions, we must find other ways to solve the problem.

  1. What happens when addressing encounters conflicts? One way to do this is to start looking for free space in the array at the conflicting location, find free space and insert it. Like when you go to the store to buy something and it’s sold out, what do you do? Find the next store that has something for sale. Regardless of the detection method, the probability of hash collisions increases significantly when there are not many free positions in the hash table. In order to maximize the efficiency of the hash table operation, we generally try to ensure that a certain percentage of free slots in the hash table are available. We use the load factor to represent how many vacancies there are.

Load factor of the hash table = number of elements filled in/length of the hash table

The larger the load factor, the fewer free positions, the more conflicts, and the worse hash table performance. Suppose the hash function is f=(key%1000), as shown in the figure below

  1. Chain address method (zipper method) The zipper method is one of the most common ways to resolve hash value conflicts. The basic idea is that each element of the array points to a linked list, and when the hash values conflict, new elements are added to the end of the list. Similarly, when looking up, you hash to an array location and then look up elements along the list. If the hash function is badly designed, if there are too many hash values, the hash element lookup will degenerate into a linked list lookup, and the time complexity will degenerate into O (n).

  2. In essence, rehash computes multiple hashes, which inevitably requires multiple hash functions. When a conflict occurs, another hash function is used to calculate the hash value until the conflict does not occur again. This method is not easy to produce “aggregation”, but increases the calculation time.

  3. Establishing a common overflow area as for this scheme is less introduced on the network, and also less generally applied. The elements with conflicting hash values are placed in another container, which can be an array, a linked list, or even a queue. But whatever it is, you need to think carefully about the container you choose to use to ensure the benefits of a hash table.

Further reading

  1. Need in here once, hash the underlying relies on the array according to subscript access features (time complexity is O (1)), and general hash table in order to avoid a lot of conflict has the definition of load factor, the characteristics of the expansion and this goes to the array: need to make room for the new array, and need to copy to the new array element. When initializing the hash table, we can allocate as much space as possible at one time if we know how much data there is, or roughly how much data there is. Avoid the drawbacks of array expansion later. It turns out that when memory is tight, a plan that prioritizes this one-time allocation is also better than other plans.
  2. In the hash table addressing scheme, there is a special case: what if I find no free position at the end of the array? And that reminds me of a circular list, and an array, too, can be assembled into a circular array. If there is no space at the end, the search continues at the top of the array.
  3. I feel the need to say something about the deletion of hash elements. Since the elements are in the linked list, the time complexity of deleting an element is the same as that of the linked list, and there is no problem with subsequent look-up. But the addressable hash table is different. If we delete position N, then the element with the same hash value after position N cannot be searched, because position N is already empty. The hash table search method determines that the element after the empty space is broken…. This is one of the reasons why zip-based hash tables are more common.
  4. In industrial-scale hash functions, one of the requirements is to distribute the hash values of the elements as evenly as possible. This is not only for space utilization, but also to prevent too many Hashcodes from falling into the same place. The time complexity of finding an element is reduced to the time complexity O (n) of finding an element in a linked list, which results in the loss of the greatest feature of hash tables.
  5. In a zipper list, I prefer to use a bidirectional list. In this way, when deleting an element, the advantages of the bidirectional list can be played out at the same time. In this way, the time complexity of deleting elements in a hash table can be reduced to O (1).
  6. In a hash table, since the position of elements is determined by the hash function, the order in which the elements are traversed is not the order in which they were added, which needs to be paid attention to in specific business applications.

Net Core c# code

A few local dishes need to be emphasized:

  1. The distributed framework used in the current project is Orleans based on the Actor model, so I don’t have to worry about multithreading for each user’s access record.
  2. I didn’t use hashTable as a data container because hashTables are prone to boxing and unboxing.
  3. We use a bidirectional list because we found the current element, which means we also found the last element and the next element. The deletion time of the current element can be O(1).

The entity that the user accesses the record

Class UserViewInfo {// UserId public int UserId {get;set; } // Access Time, utc timestamp public int Time {get;set; } public string UserName {get;set; }}Copy the code

User-space code to add access records

Class UserSpace {// Maximum number of caches const int CacheLimit = 1000; CacheUserViewInfo = new LinkedList<UserViewInfo>(); // Here we use a hash table variant of Dictionary to store access records, to achieve fast access, and at the same time set the number of capacity greater than the cache limit, Dictionary<int, UserViewInfo> dicUserView = new Dictionary<int, UserViewInfo>(1250); Public void AddUserView(UserViewInfo uv) {public void AddUserView(UserViewInfo uv) {public void AddUserView(UserViewInfo uv)if(dicUserView.trygetValue (uv.UserId, out UserViewInfo currentUserView)) {// If it exists, remove the user's access record from the current position in the cache. Cacheuserviewinfo. Remove(currentUserView); cacheUserViewInfo.AddFirst(currentUserView); }else{/ / if there is no, then added to the cache in the head And added to the hash table cacheUserViewInfo. AddFirst (uv); dicUserView.Add(uv.UserId, uv); } // Check whether the cache exceeds the limit each timeif(cacheUserViewinfo. Count > CacheLimit) {// Remove the last element from the cache and remove it from the hashtable. Technically, there are two Pointers inside the dictionary to the first and last element, So look for the two elements of time complexity is O (1) var lastItem = cacheUserViewInfo. Last. The Value; dicUserView.Remove(lastItem.UserId); cacheUserViewInfo.RemoveLast(); }}}Copy the code

Add attention, view more beautiful version, harvest more wonderful