preface

At present, there are many different schemes to calculate sentence similarity, such as the method based on semantic dictionary, the method based on the same word, the method based on statistics and the method based on edit distance. This article begins with the basics of edit distance.

Edit distance

The edit distance is the minimum number of editing operations needed to convert one string to another. Includes insert characters, replace characters, and delete characters. The smaller the editing distance, the greater the similarity.

For example, if we want to convert what to WHERE, we might change a -> e, then t -> r, to wher, and finally add e, and we’re done. Because each step can be inserted, deleted or replaced, how to get a minimum cost in the end, this is to use dynamic programming to solve.

Suppose we have two strings of length I and j, and set the edit distance to Edit (I,j). Next, if their final characters are equal, the edit distance is actually equal to Edit (i-1, J-1). If the final character is not equal, then we can make it equal by insert or replace, but the cost varies from operation to operation. If the insert is edit(I, J-1)+1 or eidit(i-1,j)+1, and if the replace is Edit (i-1, J-1)+1.

To solve the

It can be expressed by the following dynamic equation:

Take the previous example, from what to WHERE, assuming that the two characters are null, the corresponding cost is as follows:

Edit distance empty w h a t
empty 0 1 2 3 4
w 1
h 2
e 3
r 4
e 5

After dynamic programming:

Edit distance empty w h a t
empty 0 1 2 3 4
w 1 0 1 2 3
h 2 1 0 1 2
e 3 2 1 1 2
r 4 3 2 2 2
e 5 4 3 3 3

In the whole dynamic programming process, the minimum cost is always selected, so the final 3 is the minimum cost of the two strings, that is, the editing distance.

github

https://github.com/sea-boat/TextAnalyzer/blob/master/src/main/java/com/seaboat/text/analyzer/distance/CharEditDistance.j ava

implementation

public class EditDistance {

	public static int getEditDistance(String s, String t) {
		int d[][];
		int n;
		int m;
		int i;
		int j;
		char s_i;
		char t_j;
		int cost;

		n = s.length();
		m = t.length();
		if (n == 0) {
			return m;
		}
		if (m == 0) {
			return n;
		}
		d = new int[n + 1][m + 1];
		for (i = 0; i <= n; i++) {
			d[i][0] = i;
		}
		for (j = 0; j <= m; j++) {
			d[0][j] = j;
		}
		for (i = 1; i <= n; i++) {
			s_i = s.charAt(i - 1);
			for(j = 1; j <= m; j++) { t_j = t.charAt(j - 1); cost = (s_i == t_j) ? 0:1; d[i][j] = Math.min(d[i - 1][j] + 1, d[i][j - 1] + 1); d[i][j] = Math.min(d[i][j], d[i - 1][j - 1] + cost); }}return d[n][m];
	}

	public static void main(String[] args) {
		EditDistance ed = new EditDistance();
		System.out.println(ed.getEditDistance("what"."where")); }}Copy the code

————- Recommended reading ————

My 2017 article summary – Machine learning

My 2017 article summary – Java and Middleware

My 2017 article summary – Deep learning

My 2017 article summary — JDK source code article

My 2017 article summary – Natural Language Processing

My 2017 Article Round-up — Java Concurrent Article


Talk to me, ask me questions:

The public menu has been divided into “reading summary”, “distributed”, “machine learning”, “deep learning”, “NLP”, “Java depth”, “Java concurrent core”, “JDK source”, “Tomcat kernel” and so on, there may be a suitable for your appetite.

Why to write “Analysis of Tomcat Kernel Design”

Welcome to: