This is the first day of my participation in the Gwen Challenge in November. Check out the details: the last Gwen Challenge in 2021

Introduction to LSH algorithm

Before introducing the Min-hash algorithm, we must briefly introduce the concept of LSH (Locally Sensitive Hashing).

LSH (Locality Sensitive Hashing) algorithm is the most popular one of approximate nearest neighbor search algorithms, and the most popular interpretation of approximate nearest neighbor search is to find the target object similar to the specified object. It is mainly used to mine similar data from massive data, and can be applied to text similarity detection, web search and other fields.

LSH algorithm can be roughly divided into three steps:

  1. Shingling: Converting text documents to set representations (usually to Boolean vectors)
  2. Min-hashing: Convert higher-dimensional vectors into lower-dimensional digital signatures, then calculate the similarity of digital signatures
  3. Locality-sensitive Hashing: Focus on a pair of candidate digital signatures from similar documents

(The first of the three steps above, Shingling, is vectorization of text, which is a very large area and will be explained in a separate series later.)

Now we can know that min – hash algorithm is a step in LSH algorithms, the main work is to enter the high dimensional vector (probably hundreds of ten thousand d or higher) is converted into a low dimensional vector (after the dimension reduction of vector is called a digital signature), and then to low dimensional vector calculating the similarity, to reduce the computational cost, to enhance the efficiency of the operation.

Now that we know the purpose of min-hash, we need to focus on how min-hash implements these requirements.

Jaccard distance

Before we get into the min-hash algorithm, we need to learn one more important concept, the Jaccard distance.

As we know, there are many measures to calculate the similarity of two sets, such as Euclidean distance and cosine similarity, etc. Jaccrad distance is also one of the measures to measure the similarity of two sets, and its basic formula is as follows:


J a c c a r d ( C i . C j ) = C i C j C i C j Jaccard(C_i ,C_j)=\frac{|C_i \bigcap C_j|}{|C_i \bigcup C_j|}

Here we declare the notion that the “set” mentioned above (Ci, Cj in the formula) you can think of as the columns of a matrix, and the rows represent the elements of the set (you can represent anything in nature that is to be transformed into a Boolean vector anyway).

Such as:


J a c c a r d ( C 1 . C 2 ) = 2 / 5 = 0.4 Jaccard (C_1, C_2) = 2/5 = 0.4

The concept of Jaccard distance, as mentioned above, is an uncomplicated concept.

Although The Jaccard distance itself is an uncomplicated concept, the cost of calculating the Jaccard distance between sets increases exponentially with the increase of the dimension of sets. Therefore, we have to consider a question: how to reduce the complexity of computation?

Remember the purpose of the Min-Hash algorithm mentioned in the last paragraph of the previous section? Yes, the Min-Hash algorithm is an improved Jaccard distance with dimensionality reduction.

Ok, with that in mind, we can now start working on the Min-hash algorithm.