Problem Description:

Given two strings S= “s1, s2, s3… Sn “and T=” T1 t2 t3… Tm “, the process of finding substring T in the main string S is called pattern matching, and T is called pattern. If the match is successful, the position of T in S is returned. Pattern matching is the most frequently used operation in text processing, anti-virus software, operating systems, compilation systems, database systems, and search engines.

Conventional thinking

BF algorithm, namely brute force matching and brute force solving, is used to compare the first character of the main string S with the first character of the mode T. If they are equal, the subsequent characters of the two will be compared. Otherwise, the comparison starts with the next character of primary string S and the first character of pattern T. Repeat the process until all characters in S or T are compared. If all characters in T are compared, the match is successful and the start of the match is returned. Otherwise, the match fails. The time complexity of this algorithm is O(n*m).

KMP algorithm

thought

Before describing the KMP algorithm, pay homage to D.E.Knuth, J.H.Morris, and V.R.Pratt! This algorithm improves BF algorithm, and the starting point of improvement is that the main string is not backtracked.

As shown in the figure below, at this time S[0 3]=T[0 3], but S[4]≠T[4]. According to the idea of BF algorithm, the next step will start from the next character of the main string S, that is, S[1] will be compared with T[0]. However, we find that S[1]=T[1] and T[1]≠T[0], so S[1]≠T[0]. There is no need to compare T[0] with S[1] and S[2], and since S[3]=T[0], we can start from S[4] and T[1], that is, the main string subscript I is not backtracked, and the mode T subscript j is backtracked to 1 for comparison. In the case of large data volume, this measure will save a lot of time.

Derivation of pattern string formula

When we find S[I]≠T[j] in pattern matching, our next operation is to find a K, so that S[I] and T[k] can continue to compare, that is, the main string does not backtrace, pattern string backtrace to find the correct pattern matching. Observe the characteristics when the parts match correctly, as shown in the figure below.

For pattern matching until partial matching is correct, we have the following figure.

To sum up, we have an expression only for the pattern string T, i.e

T[0]… T[k-1]=T[j-k]… T[j-1]

The specific meaning of this formula is whether there is a k value, so that the prefix and suffix can be exactly the same before the j subscript in the pattern string T, and the length of the prefix and suffix are k.

When we find S[I]≠T[j] in pattern matching, we want to find a value of k and compare S[I] with T[k], so that pattern matching can be better (save time and avoid useless work). As for the search of k value, after analyzing the pattern matching situation, we found that k value is only related to the pattern string T and the subscript j value, and has nothing to do with the main string. To sum up, we can use the next array to store the corresponding K value with the subscript J. Let’s look at the value formula of the next array first.

Let’s look at the various cases of formulas,

Next [0]=-1; next[0]=-1; next[0]=-1;

2. When j=1, there is no k satisfying 1≤k<j, or it is easy to know according to the explanation above the formula, next[1]=0;

3. If j≠0, 1, we will search for T[0]… T[k-1]=T[j-k]… T[j-1] Max k value and store, if not present set next[j] = 0.

After solving the next array, pattern matching becomes easy. We set the variable I = j = 0, where I represents the subscript of the main string S, namely S[I], and j represents the subscript of the pattern string T, namely T[j], and then start our algorithm flow:

Compare whether S[I] is equal to T[j], if so, then the variables I and j are self-incremented (I ++, J ++); If not, set j = next[j]. Repeat the process until I or j reaches the end of the string. If I reaches the end of the main string S, no match for pattern T can be found. If j traverses to the tail of pattern T, the pattern is successfully matched and the subscript is (i-j), which means that the matching string of pattern T is found from the position where the subscript of main string S is (i-j).

Int Kmp(string mainString,string pattern){int I = 0; int j = 0; while (i < mainString.length()){ if (j==-1 || mainString[i]==pattern[j]){ i++; j++; if (j == pattern.length()) return (i-j); } else { j = next[j]; } } return -1; // cannot find a match for pattern T}Copy the code

Next Array creation process (core)

If we code according to the formula, we get the following code:

int* next_solve(string pattern){
    int length = pattern.length();
    int *next = new int[length];
    next[0] = -1;
    for (int j=1; j<length; j++){
        int max = 0;
        for (int k=1; k<j; j++){
            string str1 = pattern.substr(k,0);
            string str2 = pattern.substr(k,j-k);
            if (str1 == str2)
                max = k;
        }
        next[j] = max;
    }

    return next;
}
Copy the code

However, we find that the method of solving the next array seems to be inefficient (time complexity is O(m^2)). When the pattern string T is long enough, pattern matching will take a lot of time to solve the next array. Is there a better way to solve the next array? Let’s take a look at some really efficient but obscure code:

int* next_solve(string pattern){
    int length = pattern.length();
    int *next = new int[length];
    next[0] = -1;

    int j = 0;
    int k = -1;
    while (j < length-1){
        if (k==-1 || pattern[k]==pattern[j]){
            next[++j] = ++k;
        } else {
            k = next[k];    // ?
        }
    }

    return next;
}
Copy the code

Regardless of the specific meaning and validity of this code, let’s first look at the time complexity, O(m)! If this code works correctly, the time benefit is huge. Let’s take a look at what this code means layer by layer.

The first is the processing of special cases, namely, next[0]=-1 and next[1]=0, which will not be repeated here in order to meet the requirements of the formula.

The main idea of the algorithm:

Next [j+1] is solved by using the known next[j], k and pattern string T.

If we know next[j], we have T[0]… T[k-1]=T[j-k]… T[J-1], at this time, we compare T[k] with T[j], and we will divide into the following two cases.

1, T = T [k] [j].

As shown in the figure, when T[k]=T[j], we use the previous T[0]… T[k-1]=T[j-k]… T[j-1] can be obtained, T[0]… T[k-1]+T[k]=T[j-k]… T[j-1]+T[j], namely T[0]… T[k]=T[j-k]… T[j], so next[j+1]=k+1, we achieve the purpose of solving next[j+1] by using the known next[j], k and pattern string T.

2 [j], T [k] indicates T

The code for this explanation is “K = next[k]”, which is short and concise, and the most anti-human of all! But don’t worry, we’re going to use diagrams and explanations to get you through what “k = next[k]” really means. Let’s start with “k = next[k]”.

In the process of pattern matching, we find that T[k]≠T[j], at this point we let k = next[k] = 2(readers can simply think for themselves),

At this point, the comparison between T[k] and T[j] is continued, and it is found that T[k]=T[j], and the previous T[0]… T[k-1]=T[j-k]… T get T [0] [1]… T[k]=T[j-k]… T[j], T[j+1] = k+1 = 3.

But how can a simple statement like “k = next[k]” achieve such a powerful function?

Let’s use a diagram to illustrate this process.

1. Green represents equal elements, i.e. T[0]… T[k-1]=T[j-k]… T [j – 1);

2, pink represents in the green element, there exists next[k] elements are equal to each other, let k’ = next[k], satisfy T[0]… T[k’-1]=T[k-k’]… T [k – 1], T [j] – k… T[j-k+k’-1]=T[j-k’]… T[j-1] is derived from these three formulas

T[0]… T[k’-1]=T[j-k’]… T[j-1]

T[j] =k’+1=next[k]+1; T[j] =next[k]+1; If not, repeat procedure 3 by making k “= next[k’].

At this point, the next array is explained.

LeetCode example exercises

Attach a code

class Solution { public: int strStr(string haystack,string needle){ if (needle.length() == 0) return 0; int *next = next_solve(needle); int i = 0,j = 0; while (i < haystack.length()){ if (j==-1 || haystack[i]==needle[j]){ i++; j++; if (j == needle.length()) return (i-j); } else { j = next[j]; } } return -1; } int* next_solve(string pattern){ int length = pattern.length(); int *next = new int[length]; next[0] = -1; int j = 0; int k = -1; while (j < length-1){ if (k==-1 || pattern[k]==pattern[j]){ next[++j] = ++k; } else { k = next[k]; } } return next; }};Copy the code

Running results: