This is the third day of my participation in the November Gwen Challenge. Check out the details: the last Gwen Challenge 2021

Basic ideas of the Min-hash algorithm

In today’s section, we’ll give a general introduction to the basic idea of the Min-hash algorithm.

As we mentioned in the previous section, “The Min-Hash algorithm is an improved Jaccard distance with the function of dimensionality reduction”. What does that mean? In this section, we will give an overall description of the basic idea of the Min-hash algorithm based on the explanation of this sentence, and further explain the implementation of the algorithm in the following sections.

First, let’s explain the key phrase in the first half of the sentence: “Improve on Jaccard distance.” This sentence means that the min-hash algorithm ultimately compares the similarity between sets through the idea of Jaccard distance, that is, formula (1) is used. So what’s the difference between Min-Hash and Jaccard distance, and what’s the improvement? This should be combined with the following sentence: “Advanced Jaccard distance with dimensionality reduction”. Obviously, the min-hash algorithm needs to perform a minimum hash operation on Ci and Cj before comparing the set similarity. And why it can achieve dimensionality reduction will be explained in detail in the next section) to reduce the dimension of the set and obtain its hash signature SIG (Ci) and SIG (Cj), which is the dimensionality reduction function mentioned in the above definition. At this time, we can compare the Jaccard distance between SIG (Ci) and SIG (Cj). We can summarize the formula of min-hash algorithm as follows:


m i n h a s h ( C i . C j ) = J a c c a r d ( s i g ( C i ) . s i g ( C j ) ) (2) minhash(C_i,C_j)=Jaccard(sig(C_i) ,sig(C_j)) \tag 2

Now we need to pay attention to the question, why do we think that the Jaccard distance between the set after the minimum hash operation is the same as the Jaccard distance between the original set? Don’t worry, we’ll prove this problem in detail in the next section, after we explain the definition of minimum hashing in detail.