Text Similarity algorithm (unsupervised algorithm)

1. Jaro distance

Given two text strings $s_1$ . $s_2$ , their Joro distance is defined as:Among them: $m$ Represents two stringsmatchThe number of characters $|s_i|$ Represents the text string length $t$ Number of transpositoins

Select * from match; Characters from s1s_1S1, s2s_2s2, When they are the same or less than d = ⌊ Max (∣ xi ∣ ∣ x2 ∣) 2 ⌋ – 1 d = \ lfloor \ frac {Max (| x_i | and | x_2 |)} {2} \ rfloor – 1 d = 2 Max ⌊ (∣ xi ∣, ∣ x2 ∣) ⌋ – 1, is considered to be the match.

Such as: $s_1$ = “DIXON,” $s_2$ = “DICKSONX”distance $d$ Calculated to be equal to 3, then each time from Max (0,i-d) to min(I +d,xLen) in space comparison (if from the horizontal axis $s_1$ To compare, xLen means $s_1$ Length). The resultingmatchThe number $m=4$ .

Every character in s1s_1S1 is compared to a character in the distance DDD in S2s_2S2. Divide the total number of match strings by two, which is the size of transpositions (TTT). The two strings match: “DION”, “DION”, so t=0t=0t=0. Another ∣ s1 ∣ | s_1 | ∣ s1 ∣ = 4, ∣ s2 ∣ | s_2 | ∣ s2 ∣ = 8, are: DJ = 13 (45 + 48 + 4-04) d_j = \ frac {1} {3} (\ frac {4} {5} + \ frac {4} {8} + \ frac {4-0} {4}) DJ = 31 (54 + 84 + 44-0)

Reference: rosettacode.org/wiki/Jaro_d…

2. PCA like SIF

The first step is to multiply each word vector in the sentence by a unique weight. The weight is a constant alpha alpha alpha divided by alpha alpha alpha alpha plus the frequency of the word, which means that the weight of the high frequency word drops. The sum gives a temporary sentence vector.
Then calculate the first principal component UUu of the matrix composed of all sentence vectors in the corpus, and let each sentence vector subtract its projection on UUu (similar to PCA). Where, the projection of one vector VVV onto another vector uuu is defined as follows:

Code implementation:

Text Similarity algorithm (unsupervised algorithm)

1. Jaro distance

2. PCA like SIF

Related Posts

Residual Networks (ResNets)

Giants grab AI talent, guess which company is the top target?

Rambling on about temporal forecasting