“This is the 12th day of my participation in the First Challenge 2022. For details: First Challenge 2022.”


What problem is the KMP algorithm used to solve

The KMP algorithm was discovered by D.E.Knuth, J.H.Morris and V.R.Pratt at the same time, so the algorithm is abbreviated from the names of the three authors

The problem KMP is used to solve is: given a text consisting of N characters, a string consisting of m (m<= N) characters, find the substring from the text, if the substring exists, return the index of the first matching substring leftmost element in the text, otherwise return -1

The string match in the figure above returns 4

Violent solution to string matching

The idea of the violent solution of string matching is very simple, align the string with the first M characters of the text, and then match corresponding characters from left to right, until all m characters are matched successfully; Otherwise, move the text one bit to the right, start at the beginning, continue matching, and so on until the algorithm ends

The procedure for violence matching is as follows

public class SubStr { public static int getIndexOf(String str1, String str2) { int n = str1.length(); int m = str2.length(); for (int i = 0; i <= n - m; i++) { int j = 0; while (j < m && str1.charAt(j + i) == str2.charAt(j)) { j++; if (j == m) return i; } } return -1; } public static void main(String[] args) { String str1 = "abcab124agf"; String str2 = "124agf"; System.out.println(SubStr.getIndexOf(str1, str2)); System.out.println(str1.indexOf(str2)); }}Copy the code

The time complexity of violent matching algorithm is O(Mn)O(mn)O(mn)

Problems with violence algorithms

The brute force algorithm, while elegant and simple to write, is too time intensive because the text only advances one character at a time. Consider the following situation:

In this case, A and T do not match, and the idea of A violent match should be the following transformation

But in fact, we already know that part of the text is already “ABCABC”, and part of the string is also “ABCABC”, so the best transformation strategy for the next matching should be as follows:

In this way, we can use the known matching information to expand the text’s moving steps, rather than moving character by character, which is the essence of the KMP algorithm

Next, KMP algorithm is introduced

KMP algorithm

The maximum common length of a prefix and suffix

Before we learn KMP, we need to understand a concept: longest prefix and suffix. Let’s go straight to the concept of an example, given the following string

Note that prefixes do not contain the last character and suffixes do not contain the first character (for reasons explained below).

For A with subscript 2:

  • The prefixes are: A,AB

(because B is the last character of A with subscript 2)

  • Suffixes: B,AB(Because coordinate 0 A is the first character)

So for A with subscript 2, the longest common length of prefix and suffix is 0, because A and B are not equal

For B with subscript 3:

  • The prefixes are: A, AB,ABA

(because A is the last character of B with subscript 3)

  • Suffixes are: A, BA,ABA(Because coordinate 0 A is the first character)

So for B with subscript 3, the longest common length of prefix and suffix is 1, because prefix A and suffix A are equal

Here, we explain why prefixes do not contain the last character and suffixes do not contain the first character. For the purpose of our current operation is to get a character all the length of the longest prefix and postfix public string If the prefix can contain the last character, suffix can contain the first character, then obviously the longest prefix and suffix public the length of the values listed in the table below for the current character, it is totally meaningless!!!!!

Let’s do one last example for B with subscript 4:

  • The prefixes are A, AB, ABA,ABAB

(because B is the last character of B with subscript 4)

  • Suffixes: B, AB, A, BAB,ABAB(Because coordinate 0 A is the first character)

So for B with subscript 4, the longest common length of prefix and suffix is 2, because the prefix AB is equal to the suffix AB, which is of length 2

In particular, we specify that the maximum common length of the prefix and suffix for a character with subscript 0 is -1

Therefore, we can get an array of given strings

This array represents the maximum common length of the prefix and suffix of the corresponding character. We call this array nexts. For example, C with subscript 5 has a maximum common length of 0 for the prefix and suffix

Now that we know what nexts is, let’s assume that we have an algorithm that can find a nexts array for a string. (The algorithm will be explained later, but we will use nexts arrays first.) How does nexts array speed up string matching

nextsUse of arrays in KMP

It is divided into the following three cases

  1. When text characters and substring characters correspond to equal

You just need to put a pointer to the textiAnd a pointer to the stringjIncrement each of them by 1, namelyi++.j++If the first character of the string does not match the text, move the pointer to the right by one character. The pointer to the string remains at 0, i.ei++ 3. If the above two conditions are not met, that is, a character in the substring fails to match the corresponding position character of the text, and the character is not the first character in the substring

In this case, j=nexts[j],nexts is the longest common length array of prefix and suffix of “ABABBC” obtained above. Transform the pointer to the substring to the value of the nexts array that the pointer points to

The value of the nexts array for the character B that j points to is 2, so we point j to the subscript 2 of the substring and continue the backward comparison.

As shown in the figure below

Let’s explain why KMP can skip certain characters and speed up the search as shown in the figure below, assuming that all characters before X and Y are matched successfully, and that the longest prefix and longest suffix of Y represent the range as an ellipse

According to the KMP algorithm, the substring will then change the current pointer according to the value of Y’s nexts, as shown in the figure below

The e question is why the position between I and k does not need to be used as a matching position, but is skipped directly? Here we use contradiction to prove that if the position p between I and k can be matched, the following situation will occur

In fact, the value of nexts corresponding to character Y is not as long as the “theoretical longest prefix” deduced by us (it can be seen only from the image width), so it is contradictory. Therefore, the rationality of j=nexts[j] is proved

At this point, the core of the KMP algorithm has been introduced, understand the above content, the code is very easy

/** 1. If str2 is in str1, if str2 is in str1, if str2 is in str1, if str2 is in str1, if str2 is in str1, then return str2. Public class KMP {public static int getIndexOf(String str1, String str2) { if (str1 == null || str2 == null || str2.length() < 1 || str1.length() < str2.length()) { return -1; } int i = 0; int j = 0; Int [] nexts = getNext(str2); int[] nexts = getNext(str2); while (i < str1.length() && j < str2.length()) { if (str1.charAt(i) == str2.charAt(j)) { i++; j++; } else if (j == 0) { i++; } else { j = nexts[j]; } } return j == str2.length() ? i - j : -1; }}Copy the code

nextsArray

We specify that the first two values of the Nexts array are -1 and 0, so we continue to populate the Nexts array starting at the subscript 2

Let’s say the value of i-1 is cn, and now let’s find the value of i-1 and from that we know the longest prefix and the longest suffix, as indicated in the figure above. We just need to determine if the value of cn is the same as the value of i-1

  1. If equal
if (str.charAt(i - 1) == cn) {
   nexts[i++] = ++cn;
}
Copy the code
  1. If it’s not equal

cnThe character of the position is not equal toi-1Position of the character, then willcnMove to thenexts[cn]Indicated position

Then compare, nexts[i++] = ++cn if equal, otherwise continue to transform cn until cn can no longer move forward, to the third case

  1. Other situations

The STR (cn)! If STR [i-1] &&cn <= 0, nexts[I ++]=0

The complete code to get the Nexts array is as follows:

    public static int[] getNext(String str2) {
        int len = str2.length();
        if (len == 1) {
            return new int[]{-1};
        }

        int[] nexts = new int[len];
        nexts[0] = -1;
        nexts[1] = 0;

        int cn = 0;
        for (int i = 2; i < len; i++) {
            if (str2.charAt(i - 1) == cn) {
                nexts[i++] = ++cn;
            } else if (cn > 0) {
                cn = nexts[cn];
            } else {
                nexts[i++] = 0;
            }
        }
        return nexts;
    }
Copy the code

KMP algorithm code

/** * if str2 is in str1, return the index of the starting character index in str1. Public class KMP {public static int getIndexOf(String str1, String str2) { if (str1 == null || str2 == null || str2.length() < 1 || str1.length() < str2.length()) { return -1; } int i = 0; int j = 0; int[] nexts = getNext(str2); while (i < str1.length() && j < str2.length()) { if (str1.charAt(i) == str2.charAt(j)) { i++; j++; } else if (j == 0) { i++; } else { j = nexts[j]; } } return j == str2.length() ? i - j : -1; } public static int[] getNext(String str2) { int len = str2.length(); if (len == 1) { return new int[]{-1}; } int[] nexts = new int[len]; nexts[0] = -1; nexts[1] = 0; int cn = 0; for (int i = 2; i < len; i++) { if (str2.charAt(i - 1) == cn) { nexts[i++] = ++cn; } else if (cn > 0) { cn = nexts[cn]; } else { nexts[i++] = 0; } } return nexts; } public static void main(String[] args) { String str1 = "abcab124agf"; String str2 = "abcab124agf"; System.out.println(getIndexOf(str1, str2)); System.out.println(str1.indexOf(str2)); }}Copy the code