Programmer's path to immortality - optimize user access records to the hilt

Wish you do not like vegetables so helpless pain, the annual prize greatly drops in the days without year-end bonus, work still have to continue….. A picture of ice and fire shows helplessness

Remember the user space that CAI CAI designed not long ago? For those of you who have not seen it, please enter the portal = “Design a high-performance visitor log system

Do you remember any remaining issues? To repeat, how do I determine if there is a record of the current user in the cache of user access records? Linked lists, although the primary data structure for our business scenario, are not the best solution to the current problem, so we need a data structure that can access elements quickly to solve this problem, right? That’s the hash table we’re going to talk about today

Hash table

A Hash table (also called a Hash table) is a data structure that is directly accessed based on Key values. That is, it accesses records by mapping key values to a location in the table to speed up lookups. This mapping function is called a hash function, and the array of records is called a hash table. A hash table can be approximately equal to what we call a key-value form. Hash tables take advantage of arrays’ ability to access data randomly by index, so a hash table is an extension of an array, evolved from an array. If there is no array, so to speak, there is no hash table. Why do we use arrays? Because the time complexity of arrays accessing elements by subscript is O (1), you can refer to the previous article on arrays. If you want to access an element by the index of an array, you must also consider how to convert the Key to the index. So that’s the hash function that WE’re going to talk about.

The hash function

A hash function is simply a black box that converts a Key to an array index. The hash function plays a key role in the hash table. A hash function, as its name implies, is a function. We can define this as a hash(key), where the key represents the element’s key value, and the hash(key) value represents the hash value computed by the hash function. So what are the requirements for a hash function?

The hash function evaluates to a non-negative integer value.
If key1 = key2, hash(key1) == hash(key2)
If key1 ≠ key2, hash(key1) ≠ hash(key2)

The hash value must be a non-negative integer (>=0) because the hash value is the index of the array. The hash value must be the same for the same key. Just to focus on the first three, the third one is really theoretical, we imagine that different keys should get different hashes, but in fact, it’s hard to do that. If I hash an infinite number of keys, then the bottom array of the hash table has to be infinitely large. Hash algorithms such as MD5 and SHA, which are well-known in the industry, cannot completely avoid such conflicts. Of course, the smaller the underlying array, the greater the chance of such collisions. So a perfect hash function does not exist, and even if it does, the cost of time and labor may be beyond imagination.

Hash conflict

Since no good hash function can avoid hash collisions, we must find other ways to solve the problem.

What happens when addressing encounters conflicts? One way to do this is to start looking for free space in the array at the conflicting location, find free space and insert it. Like when you go to the store to buy something and it’s sold out, what do you do? Find the next store that has something for sale. Regardless of the detection method, the probability of hash collisions increases significantly when there are not many free positions in the hash table. In order to maximize the efficiency of the hash table operation, we generally try to ensure that a certain percentage of free slots in the hash table are available. We use the load factor to represent how many vacancies there are.

Load factor of the hash table = number of elements filled in/length of the hash table

The larger the load factor, the fewer free positions, the more conflicts, and the worse hash table performance. Suppose the hash function is f=(key%1000), as shown in the figure below

Chain address method (zipper method) The zipper method is one of the most common ways to resolve hash value conflicts. The basic idea is that each element of the array points to a linked list, and when the hash values conflict, new elements are added to the end of the list. Similarly, when looking up, you hash to an array location and then look up elements along the list. If the hash function is badly designed, if there are too many hash values, the hash element lookup will degenerate into a linked list lookup, and the time complexity will degenerate into O (n).
In essence, rehash computes multiple hashes, which inevitably requires multiple hash functions. When a conflict occurs, another hash function is used to calculate the hash value until the conflict does not occur again. This method is not easy to produce “aggregation”, but increases the calculation time.
Establishing a common overflow area as for this scheme is less introduced on the network, and also less generally applied. The elements with conflicting hash values are placed in another container, which can be an array, a linked list, or even a queue. But whatever it is, you need to think carefully about the container you choose to use to ensure the benefits of a hash table.

Net Core c# code

A few local dishes need to be emphasized:

The distributed framework used in the current project is Orleans based on the Actor model, so I don’t have to worry about multithreading for each user’s access record.
I didn’t use hashTable as a data container because hashTables are prone to boxing and unboxing.
We use a bidirectional list because we found the current element, which means we also found the last element and the next element. The deletion time of the current element can be O(1).

The entity that the user accesses the record

Class UserViewInfo {// UserId public int UserId {get;set; } // Access Time, utc timestamp public int Time {get;set; } public string UserName {get;set; }}Copy the code

User-space code to add access records

Class UserSpace {// Maximum number of caches const int CacheLimit = 1000; CacheUserViewInfo = new LinkedList<UserViewInfo>(); // Here we use a hash table variant of Dictionary to store access records, to achieve fast access, and at the same time set the number of capacity greater than the cache limit, Dictionary<int, UserViewInfo> dicUserView = new Dictionary<int, UserViewInfo>(1250); Public void AddUserView(UserViewInfo uv) {public void AddUserView(UserViewInfo uv) {public void AddUserView(UserViewInfo uv)if(dicUserView.trygetValue (uv.UserId, out UserViewInfo currentUserView)) {// If it exists, remove the user's access record from the current position in the cache. Cacheuserviewinfo. Remove(currentUserView); cacheUserViewInfo.AddFirst(currentUserView); }else{/ / if there is no, then added to the cache in the head And added to the hash table cacheUserViewInfo. AddFirst (uv); dicUserView.Add(uv.UserId, uv); } // Check whether the cache exceeds the limit each timeif(cacheUserViewinfo. Count > CacheLimit) {// Remove the last element from the cache and remove it from the hashtable. Technically, there are two Pointers inside the dictionary to the first and last element, So look for the two elements of time complexity is O (1) var lastItem = cacheUserViewInfo. Last. The Value; dicUserView.Remove(lastItem.UserId); cacheUserViewInfo.RemoveLast(); }}}Copy the code

Add attention, view more beautiful version, harvest more wonderful

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Programmer’s path to immortality – optimize user access records to the hilt

Hash table

The hash function

Hash conflict

Further reading

Net Core c# code

Programmer’s path to immortality – optimize user access records to the hilt

Hash table

The hash function

Hash conflict

Further reading

Net Core c# code

Related Posts

What a painful realization — The first lecture on how computers are made

Decorator pattern for design pattern

Best practices for writing Linux dynamic libraries