Algorithm chatter: Learn Hash

>>>> 😜😜😜 Github: 👉 github.com/black-ant CASE Backup: 👉 gitee.com/antblack/ca…

A. The preface

In MySQL, Hash indexes can directly locate data faster than b-tree indexes.

This begs the question, how are hashes generated and addressed?

In response to this question, I tried to learn Hash again.

2. Hash principle

2.1 the Hash briefly

Objective:

A Hash algorithm is used to compress a set of characters to a fixed length. Since every file on a computer is ultimately just data that can be represented in binary form, a Hash algorithm can perform complex calculations on that data and output a fixed length string as the result of the calculation

USES:

Information summary, file checksum, integrity, security check, information encryption, file index, file history judgment

Basic usage

The most basic and common use of a Hash is with an array index

For columns 11,12,13,14,15, make them modulo 10 to get {1,2,3,4,5}, which will be stored in an array or hash table {1,2,3,4,5}.

Hash conflicts and Hash functions

Hash conflict: The same object can have only one Hash result. Different objects may have the same Hash result

For example, 21, 22,23 above can also be modulo to 1,2,3, which is a Hash conflict.

Hash conflicts can be resolved using Hash algorithms and data structures. Common solutions are 1) split chaining; 2) Open address method, see more about it later

2.2 Hash Algorithm Case – MD5

The Hash function is a Hash algorithm. MD5 is the most common Hash algorithm in the industry. This is a typical case and highlights the key points of the Hash algorithm.

Recommend an original # MD5 algorithm principle and implementation

// MD5 functions:Generates a fixed length for each input value128A string of bits and calculates the determined output values in several loops using standard one-way operations// The main points of the MD5 algorithm are as follows:
- Step 1: data filling -step2: Adds the message length -step3Initializes the variable -step4: Data processing// noun: big end and small endThere are two ways to store these two bytes in memory: one is to store low-order bytes at the starting address, which is called little-endian byte order; Another approach is to store high-order bytes at the start address, which is called big-endian byte orderCopy the code

The overall process is also clear in the document. Here is a flow chart to make our thinking clearer:

Step 1 and Step 2: Data filling and length supplement

Step 2: Data calculation

Step 3: Integrate after cyclic processing

It can be seen that the result of Hash is a set of labeled numbers and a set of the same operation method, which are compressed to get the data. From more to less, conflicts are bound to occur

MD5 status quo

After knowing the principle of MD5, it is not difficult to understand the current situation of MD5, MD5 is no longer unbreakable. Fixed length means collisions, and MD5 has such a long timeline that it is possible to decipher the corresponding source data through exhaustive history.

Evolution of Hash discrete functions

For the exhaustive state of MD5, there are more Hash functions to choose from

// MD2 ,MD3 , MD4 , MD5- MD5 indicates message-digest Algorithm5Information - summary algorithm5) - Fixed output128 位


// Secure Hashing Algorithm (SHA1)The first proposed standard fixed the length of the output value at160Bit, compared with MD5, SHA1 mainly increases the output length, the number and complexity of one-way operation, and cannot really solve the attack problem// SH2
- SHA-2 SHA-224, SHA -256, SHA -384And SHA -512

// SHA3: Algorithm based on Sponge Construct- Uses random permutations to absorb and output data while providing randomness for future input values to be used in hash algorithms// SHA256- The token protocol (proof of work) requires the SHA256 algorithm to be run twice, which is designed to resist length extension attacks// RIPEMD-160 :
- 160Bit encryption hash function/ / CRC series


// Mac: indicates the Message authentication code- Hash function with the key as part of the messageCopy the code

For the results, check out the website Hasj Online Computing

Hash function classification

/ / add the Hash

// 位运算 Hash

/ / multiply Hash

/ / division Hash

/ / the Hash table

Copy the code

Why is a HashMap called a HashMap?

So how does a HashMap implement its own Hash? How do you look up when you look up?

As stated in the HashMap source documentation, the HashMap implementation provides constant time performance for the basic operations (GET and PUT), and hash functions disperse elements appropriately into buckets. Iteration over the collection view takes time proportional to the “capacity” of the HashMap instance (the number of buckets) plus its size (the number of key-value mappings).

So how does a HashMap do this?

final Node<K,V> getNode(int hash, Object key) {
    Node<K,V>[] tab; Node<K,V> first, e; int n; K k;
    if((tab = table) ! =null && (n = tab.length) > 0 &&
        >>> first = TAB [(n-1) &hash]
        (first = tab[(n - 1) & hash]) ! =null) {
        
        // If the first element in the array position is the element to be searched based on the hash, it is returned. At this time of O (1)
        if (first.hash == hash && // always check first node((k = first.key) == key || (key ! =null && key.equals(k))))
            return first;
        if((e = first.next) ! =null) {
            if (first instanceof TreeNode)
                return ((TreeNode<K,V>)first).getTreeNode(hash, key);
                
            // We can see from this that we still loop to see if the corresponding Hash matches
            do {
                if(e.hash == hash && ((k = e.key) == key || (key ! =null && key.equals(k))))
                    return e;
            } while((e = e.next) ! =null); }}return null;
}
Copy the code

Therefore, the use of Hash in Java is mainly through array + hash subscript to achieve O(1) time complexity.

Object HashCode:

// s[0]*31^(n-1) + s[1]* 31^(n-2) +... + s[n-1]
// 0-n indicates the first character to the last character of the string
public int hashCode(a) {
    int h = hash;
    if (h == 0 && value.length > 0) {
        char val[] = value;

        for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i];
        }
        hash = h;
    }
    return h;
}
Copy the code

Secondary disturbances and Poisson distribution

Most of us are familiar with these two concepts. What is the problem solved by the secondary perturbation and Poisson distribution of a HashMap

Poisson distribution:

The compromise between improving space utilization and reducing query cost is mainly poisson distribution, and 0.75 is the minimum collision

Check out this article: blog.csdn.net/ccnt_2012/a…

Simply put, it is not a waste of space, and can avoid insufficient space.

Secondary disturbance:

In this way, the randomness of the lower part of the Hash value is increased to make the distribution more uniform, so as to improve the randomness and uniformity of the index position of the corresponding array storage, and finally reduce the Hash conflict. Only two times is enough, which has achieved the purpose of both high and low parts participating in the operation

(hash1(key) + I * hash2(key)) % TABLE_SIZE where hash1() and hash2() are hash functions and TABLE_ size is the size of the hash TABLE.

If Hash conflicts are large, you can add I to resolve them.

Let’s look at a systematic approach to resolving Hash conflicts

Separated chaining method

Chaining: The idea is to make each cell of a hash table point to a linked list of records with the same hash function value

To put it simply, the same Hash is worth generalizing into a linked list.

/ / advantages:
1) easy to implement2) The hash table never fills up3) is less sensitive to hash functions or load factors4) is used when the number and frequency of keys inserted or deleted is not known/ / disadvantages:
1) Links do not cache well because key stores use linked lists.2Space waste (some parts of the hash table are never used)3If the chain is longer, the search time can be o (n) in the worst case (PS: always indexing inward)4Use extra space for linksCopy the code

Open addressing

In an open address, all elements are stored in the hash table itself. Therefore, at all times, the size of the table must be greater than or equal to the total number of keys

To put it simply, open addressing is analogous to multiple processing until an empty slot is found:

If slot hash(x) % S is full, then we try (hash(x) + 1) % S
If (hash(x) + 1) % S is also full, then we try (hash(x) + 2) % S
If (hash(x) + 2) % S is also full, then we try (hash(x) + 3) % S 

// select * from 'search';
- Search(k): When looking for an element, check all entries until the desired element is found, or the element is not in the table -Delete (k)The slot for the delete key is specifically marked "delete". An insert operation can insert items in a deleted slot, but the search does not stop in the deleted slot.Copy the code

Relevant reference www.geeksforgeeks.org/hashing-dat…

5. What is the Hash index of MySQL?

Now that we know about HashMap, let’s look at MySQL’s Hash index.

Let’s look at the concept of a Hash index:

// MySQL supports Hash indexes:MyISAM doesn't support Hash indexes, and InnerDB isn't actually a full Hash index -- it's a self-built Hash based on the B-tree. InnerDB doesn't actually have a Hash index, it does in MySQL8.0This is also reflected in official documents: +----------------+--------------------------------+ | Storage Engine | Permissible Index Types | +----------------+--------------------------------+ | MyISAM | BTREE | | InnoDB | BTREE | | MEMORY/HEAP | HASH, BTREE | | NDB | BTREE, HASH (see note in text) | +----------------+--------------------------------+/ / features:- Each index unit computes a Hash value. - Hash indexes include key values, Hash codes, and Pointers. - Hash indexes are compact and fast/ / disadvantages:N/A Hash index has Hash conflicts and cannot be created on columns that are repetitive. N/A Hash index stores Hash values and therefore requires a second search. N/A Hash index Does not support external sorting. N/A Hash index only supports accurate searchCopy the code

As you can see, the features and disadvantages of MySQL are mostly Hash features.

The Hash index in the MySQL InnerDB is an adaptive Hash indexAdaptive Hash

Adaptive hash indexes enable InnoDB to perform more in-memory database/server like operations without sacrificing transactional features or reliability on systems with appropriate workload and sufficient memory combinations in the buffer pool. The adaptive hash index is enabled by the innodb_adaptive_hash_index variable, or turned off at server startup by: -- skip-Innodb-adaptive-hash-index// Adaptive index features:- Using the prefix of an index key to build a Hash index - Adaptive Hash Allows hot data to be built into a Hash tableCopy the code

6. Hash’s role in security

We said that hashing is irreversible, so you can’t reconstruct the contents of a file just by using the hash algorithm and the result of the file hash. However, it can determine whether two files are the same without knowing their contents.

Early antivirus software

Traditional anti-virus solutions rely entirely on hash values to determine if a file is malicious without examining the contents or behavior of the file, which is done by keeping the hash values of an internal database belonging to known malware. When scanning the system, the AV engine computes a hash value for each executable on the user’s machine and tests for a match in its database.

However, this method is relatively easy to crack, even if there is a difference of one character, the overall Hash will be a big difference

Signature and Abstract

Signatures and digests are the most common Hash operations, which effectively verify that a file, website, or download is real. They can also be used in certificates and SSL

Hash search

Since the length of the Hash is fixed, you can save the Hash of the file to realize the history of the file and determine whether the file exists or existed in the past

Hashes index data, hash values that can be used to map data to individual “buckets” in a hash table. Each bucket has a unique ID that acts as a pointer to the original data. This creates an index that is much smaller than the original data, making it possible to search and access values more efficiently.

Password encryption

If the password cannot be decrypted, encrypt it in the same way as the verification

Block chain

Both Bitcoin and Ethereum have certain applications for Hash

The miner’s attention focused on a series of figures. This number is appended to the hash content of the previous block, which is then hashed. If the new hash is less than or equal to the target hash, it is accepted as the solution, the miner is rewarded, the block is added to the blockchain, and the validation process of the block chain transaction relies on the encrypted data using algorithmic hashes.

conclusion

So far, I have a preliminary understanding of Hash. The article is not deep and contains some information. I have sorted out and learned relevant knowledge points as much as possible, hoping to be helpful to others

Reference Documents:

Blog.csdn.net/z_ryan/arti…

zhuanlan.zhihu.com/p/106941474

Fangjian0423. Making. IO / 2016/03/12 /…

Segmentfault.com/a/119000000…