This is the fourth day of my participation in the November Gwen Challenge. Check out the details: The last Gwen Challenge 2021

Introduction to Min-Hash Algorithm

Introduction to Min-Hash Algorithm in Detail and Easy to understand [2]

Introduction to Min-Hash Algorithm with Super detail and Easy to understand

Minimum hash operation

Imagine a collection by the contrast of the original we need according to the column of matrix, the matrix according to line up sort of P, after every upset, find out each collection (that is, the columns of the matrix) of the first row index value is 1, and the index respectively filled into a new collection (each original collection has a new collection). This new set is the hash signature of the original set. It can be expressed as follows:


s i g ( C i ) = P a C i The first value of the column after each shuffling is 1 The row index value of (3) Sig (C_i)= index value \tag 3 of P C_i columns whose first value is 1 after each scramble

That’s the minimum hash operation, and if you’re still a little confused, okay, let’s do an example.

(Stanford University)

  1. Suppose there are three original sets of (5,) Ci that need to be compared, first combine them into a matrix of (5, 3) by column, where Ri represents the row number:
  1. Suppose we want to reduce the dimension to 3, that is, the original set Ci is reduced to (3,), then P is set to 3, that is, three times out of order:
In the first sort, we still sort by row order (12345), where C1's first 1 appears in R1, and we add 1 to S1, the digital signature of the original set C1. Same thing for the other two sets. In the second sort, the first 1 of C1 appears in R4, so we add 4 to S1, the digital signature of the original set C1. Same thing for the other two sets. The third sorting operation is the same as the first and second operations.Copy the code
  1. Finally, we calculate the Jacarrd coefficient of digital signature Si obtained after min-hash operation on Ci, and obtain the similarity between sets (Col_Col is the Jaccard coefficient of the original set, sig-sig is the Jaccard coefficient of hash signature, Which is the min-hash coefficient of the original set) :

It should be easy to see that the Jaccard coefficients of the set calculated in the above example are close to the Min-Hash coefficients, but not identical. This is because the dimensions of the three original sets in this example are not high, only 5 dimensions. In practical applications, the sets we are faced with are usually millions and thousands of dimensions. From a macro perspective, or from a statistical point of view, the Min-Hash coefficient is equal to the Jaccard coefficient. This leads us to the question left in the previous section: Why do we think that the Jaccard distances between the sets after the minimum hash operation are still the same as the Jaccard distances between the original sets? To this question, we prove as follows:

First of all, we can know from the definition of Jaccard coefficient that Jaccard coefficient is actually to find the proportion of equal elements (intersection of two sets) in all elements (union of two sets) in two sets.

Let’s just consider sets C1 and C2, then each row of the same set of the two columns will have the following three conditions:

  1. C1 and C2 are equal to 1 on this row, and we’re going to call the number of times that happens X
  2. One of C1 and C2 has a value of 1 in this row, and the other has a value of 0, and we’re going to call that the number of times that happens let’s call that Y
  3. C1 and C2 are both 0, and the number of times that happens is Z

Obviously, in the first case X represents the intersection of C1 and C2, X plus Y represents the union of C1 and C2,

So Jaccard of C1,C2 is equal to X over X plus Y.

Next compute min-hash(C1,C2), which is P[min-hash(C1)=min-hash(C2)]. After random row scrambling, scan from top to bottom, and the probability that row X is encountered before row Y is X/(X+Y). That is, the probability that min-hash(C1)=min-hash(C2) is X/(X+Y).

Why do we calculate the probability that we scan down and hit row X before we hit row Y? Remember what we were adding to the hash signature of the original collection?

(References are given to this article by Stanford University and Dasda)

This prising Property is a little trickier to understand, and the Stanford Powerpoint presentation even uses the name “Suprising Property” to describe it. Although there are certain requirements for knowledge of probability theory, I believe that you can understand with a little thought. If you have any questions, please feel free to leave a comment in the comments section, or just send me a private message.

Well, that’s the end of our discussion of min-hash algorithms… ? Don’t. Remember why we used the min-hash algorithm? Dimension reduction. What is the purpose of dimensionality reduction here? Reduce computational complexity! However, we are now faced with a problem, shuffling the matrix to get the hash signature to complete dimensionality reduction, and then computing the similarity does reduce the computational complexity, but shuffling the matrix to sort itself is also a very high computational complexity operation!

So in the next section, we’ll look at the min-hash algorithm in action! Don’t take this lesson for granted (don’t take it for granted), because the actual min-hash algorithm is also based on this lesson. See you in the next video!