Rambling on about hash functions

Speaking of hashes, this generally corresponds to hash tables (hash tables) and hash functions. We’re not going to talk about hash tables today, just hash functions.

define

Cite a section of Baidu Encyclopedia on the definition of hash function.

To Hash an input of any length into an output of a fixed length using a Hash algorithm. The output is a Hash value. This transformation is a compression mapping, that is, the space of hash values is usually much smaller than the space of input, and different inputs may be hashed into the same output, so it is impossible to determine a unique input value from the hash value. Simply put, it is a function that compresses a message of any length into a message digest of a fixed length.

There are many expressions about the definition of the hash function.

The nature of the

Check out Wikipedia and Baidu Encyclopedia, both of which mention a few things about the nature of hashes:

1. Certainty

If two hashes are different, then the original input of the two hashes is also different;

2. Conflict (collision)

The input and output of a hash function do not correspond exclusively. If two hash values are the same, the two input values may well be the same, but they may also be different.

3. Irreversibility

And then the last one is about whether or not it’s reversible, and there’s a difference:

Wikipedia – Hash functions

Baidu Encyclopedia – hash function

Wikipedia explicitly states that “the hash function must be irreversible”, while Baidu Baike’s statement is ambiguous, which makes the latter seem lax by comparison. I prefer irreversibility as wikipedia mentions it.

4. Confusion

In the “Properties of hash functions” section, Wikipedia also mentions a point:

A hash function with strong confounding properties will produce a completely different hash value by inputting some data to calculate the hash value and then partially changing the input value.

There are two words in this expression: “strongly confused” and “completely different”. What does that mean?

Let’s start with a concept: the essence of the avalanche effect is the strict avalanche rule: when any input bit is reversed, each bit in the output has a 50% chance of changing.

Another concept: Hamming distance. Some are translated as “Hamming Distance”, while others are “Hamming Distance”. The name is not important, the meaning is important.

The number of bits of two code words with different values is called the Hamming distance of the two code words. For example, if the first, fourth, and fifth digits of 10101 and 00110 are different, the haiming distance is 3.

For hashes, if “partially change the input value” and the hamming distance between the two hashes is half the length of the hash (i.e., half of the bits are different), it is “50% probability of change”. Such hash functions are “strongly obfuscated hash functions”.

Example of a hash function

Common hash functions include encrypted hash functions such as the MD5 and SHA families, and CRC should also be considered a hash. Both are used for data verification, while the former is also used for digital signature, access authentication and other security areas. But we won’t expand too much on cryptographic hashes today. We’ll focus on the following two hashes:

BKDRHash

This is a hash function that you’ve probably seen before, but for those of you who don’t know its name. The hashCode() function for String in the JDK implements this hash function:

    public int hashCode() {
        int h = hash;
        final int len = length();
        if (h == 0 && len > 0) {
            for (int i = 0; i < len; i++) {
                h = 31 * h + charAt(i);
            }
            hash = h;
        }
        return h;
    }
Copy the code

Define a class whose implementation would be similar if the IDE were to automatically generate hashCode() :

    public static class Foo{
        int a;
        double b;
        String c;
        
        @Override
        public int hashCode() {
            int result;
            long temp;
            result = a;
            temp = Double.doubleToLongBits(b);
            result = 31 * result + (int) (temp ^ (temp >>> 32));
            result = 31 * result + (c != null ? c.hashCode() : 0);
            return result;
        }
    }
Copy the code

Why do you always have a problem with “31”? Why do you iteratively multiply and sum like this? This article talks about some of the principles: parsing and extending the BKdrHash algorithm of the hash table, and there are also many experts on Zhihu who have done the analysis: As can be seen from the comparison of the scores given by the second link, BKDRHash is simple to implement but effective (low collision rate).

Low collision, allowing BKDRHash to be used not only for hash tables but also for index objects. The most common usage is MD5. Some websites may use MD5 of a file as the key to retrieve a file. For example, DiskLruCache also uses MD5 as the key, but MD5 is usually used for urls (such as OkHttp and Glide) rather than the file itself. The message digest generated by MD5 has 128 bits. If there are not many objects to be identified, the conflict rate will be very low. When the collision rate is much lower than the probability of hardware damage, you can consider MD5 as the key to be reliable. For websites that store massive files, you are not advised to use MD5 as the key. By the way, the UUID is 128bit accurate, just a few more slashes for easy reading.

BKDRHash is used to index objects. The key is used to index objects. The key is used to index objects with BKDRHash.

private String getFilenameForKey(String key) {
       int firstHalfLength = key.length() / 2;
       String localFilename = String.valueOf(key.substring(0, firstHalfLength).hashCode());
       localFilename += String.valueOf(key.substring(firstHalfLength).hashCode());
       return localFilename;
}
Copy the code

Since the JDK hashCode() returns an int, this function can be said to be 64bit precision. You can’t call it a hash function because it returns a variable length, and by definition you can’t call it a hash function, although the idea is very close. The equivalent is written as follows:

    public static String getFilenameForKey(String key) {
        byte[] bytes = key.getBytes();
        int h1 = 0, h2 = 0;
        int len = bytes.length;
        int firstHalfLength = len / 2;
        for (int i = 0; i < firstHalfLength; i++) {
            byte ch = bytes[i];
            h1 = 31 * h1 + ch;
        }
        for (int i = firstHalfLength; i < len; i++) {
            byte ch = bytes[i];
            h2 = 31 * h2 + ch;
        }
        long hash = (((long) h1 << 32) & 0xFFFFFFFF00000000L) | ((long) h2 & 0xFFFFFFFFL);
        return Long.toHexString(hash);
    }
Copy the code

The effect is roughly equivalent to BKDRHash with 64bit precision. The 64bit BKDRHash is as follows:

    public static long BKDRHash(byte[] bytes) {
        long seed = 1313; // 31 131 1313 13131 131313 etc..
        long hash = 0;
        int len = bytes.length;
        for (int i = 0; i < len; i++) {
            hash = (hash * seed) + bytes[i];
        }
        return hash;
    }
Copy the code

I have written a program to compare the conflict rate between the two, and the former is higher than the latter (space is limited, I will not post the test code, interested readers can test).

32bit hashes, of any kind, will conflict almost every time the data set (random data) is 10^6. 64-bit hash, as long as the performance is not too bad, if the length of data is relatively long (say, 20 bytes random number group), even tens of millions of data sets are difficult to conflict (I did not try hundreds of millions, the machine can not support).

I was also a fan of BKDRHash and used it for a while in my project (as a cache key). An answer to the above discussion of Zhihu:

I modified the test case to construct random data with variable length of data, such as 1-30 random bytes, and tested the 64bitBKDRHash written above. The result was that conflicts could be seen in data set 50,000.

We already know that BKDRHash is not confusable (for example, if the value of the last byte is increased by 1, the hash value is just increased by 1 if it is not overflowed); But because of the simplicity of the implementation, and the unreasonable test results, it was used as a cache key, after all, Volley also did it. It’s actually not too much of a problem because the input is usually long and there aren’t many files to cache (hundreds or thousands of levels), so there shouldn’t be any conflicts. But I didn’t feel comfortable (see murphy’s Law), and soon switched to another hash function.

MurmurHash

When I first saw this hash function, I was surprised by its name. But some think the name is cute:

Sites.google.com/site/murmur…

uint64_t MurmurHash64A ( const void * key, int len, unsigned int seed ) { const uint64_t m = 0xc6a4a7935bd1e995; const int r = 47; uint64_t h = seed ^ (len * m); const uint64_t * data = (const uint64_t *)key; const uint64_t * end = data + (len/8); while(data ! = end) { uint64_t k = *data++; k *= m; k ^= k >> r; k *= m; h ^= k; h *= m; } const unsigned char * data2 = (const unsigned char*)data; switch(len & 7) { case 7: h ^= uint64_t(data2[6]) << 48; case 6: h ^= uint64_t(data2[5]) << 40; case 5: h ^= uint64_t(data2[4]) << 32; case 4: h ^= uint64_t(data2[3]) << 24; case 3: h ^= uint64_t(data2[2]) << 16; case 2: h ^= uint64_t(data2[1]) << 8; case 1: h ^= uint64_t(data2[0]); h *= m; }; h ^= h >> r; h *= m; h ^= h >> r; return h; }Copy the code

In general, it is not very complicated, compared to BKDHHash, it is all over the loop. In addition, C++ can point int64 to a char array, which can count up to 8 bytes at a time, faster for longer arrays. For Java, however, it is more troublesome:

public static long hash64(final byte[] data) { if (data == null || data.length == 0) { return 0L; } final int len = data.length; final long m = 0xc6a4a7935bd1e995L; final long seed = 0xe17a1465; final int r = 47; long h = seed ^ (len * m); int remain = len & 7; int size = len - remain; for (int i = 0; i < size; i += 8) { long k = ((long) data[i] << 56) + ((long) (data[i + 1] & 0xFF) << 48) + ((long) (data[i + 2] & 0xFF) << 40) + ((long) (data[i + 3] & 0xFF) << 32) + ((long) (data[i + 4] & 0xFF) << 24) + ((data[i + 5] & 0xFF) << 16) + ((data[i + 6]  & 0xFF) << 8) + ((data[i + 7] & 0xFF)); k *= m; k ^= k >>> r; k *= m; h ^= k; h *= m; } switch (remain) { case 7: h ^= (long)(data[size + 6] & 0xFF) << 48; case 6: h ^= (long)(data[size + 5] & 0xFF) << 40; case 5: h ^= (long)(data[size + 4] & 0xFF) << 32; case 4: h ^= (long)(data[size + 3] & 0xFF) << 24; case 3: h ^= (data[size + 2] & 0xFF) << 16; case 2: h ^= (data[size + 1] & 0xFF) << 8; case 1: h ^= (data[size] & 0xFF); h *= m; } h ^= h >>> r; h *= m; h ^= h >>> r; return h; }Copy the code

The above implementation reference: github.com/tnm/murmurh… Github.com/tamtam180/M…

Do a test, random number group, number 10^7, length 1-30, the results are as follows:

The hash function	The number of conflict	Operation time (ms)
BKDRHash	1673	1121
CRC64	1673	1331
MurmurHash	0	1119

This test incorporates another 64-bit hash, CRC64. Amazingly, CRC64 and BKDRHash have the same collision rate (tested many times, both are the same), which is probably the cause of both overflow problems. As for computing time, there’s not much difference. If the length of random number group is adjusted to 1-50, the conflict rate of the former (BKDRHash & CRC64) will be relatively reduced, while the operation efficiency of the latter will be slightly better than the former.

I’m guessing MurmurHash’s popularity is due not only to its low collision rate, but also to an important feature: “strong obfuscation.” Write a function to calculate the code distance as follows:

private static int getHammingDistance(long a, long b) { int count = 0; long mask = 1L; while (mask ! = 0) { if ((a & mask) ! = (b & mask)) { count += 1; } mask <<= 1; } return count; }Copy the code

Construct a random array of numbers, changing one bit at a time, and compute the MurmurHash with a code spacing between 31 and 32. For 64-bit hashes, it is just around 50%, indicating that the hash function has an avalanche effect. Or to look at it another way, MurmurHash produces a “uniform” hash, which is important. For example, if used in hash tables, Bloom filters, etc., the elements will be evenly distributed.

Birthday attack For an avalanche effect hash function, the collision distribution is shown in the following table:

conclusion

Hash is widely used in various fields of computer, some properties of dissolving column function may be helpful to solve engineering problems, so there is this “diffuse talk”, hoping to inspire readers. Please correct any misunderstandings.