KMP algorithm learning (details)

KMP algorithm, also known as “watching porn” algorithm, is a highly efficient string matching algorithm. But because it was so hard to understand, it remained a mystery for a long time. While there is a lot of material available online, few good blogs can state it clearly and simply. Here, a few good blogs on the Internet (see the last), try their best to strive for the idea and implementation of KMP algorithm clearly.

The task completed by KMP algorithm is: given two strings O and f, length n and m respectively, judge whether F appears in O, and return the position of occurrence if it does. The conventional approach is to traverse every position of A and then match b from there, but this approach is O(nm) complexity. KMP algorithm reduces the complexity of matching to O(n+m) by an O(m) preprocessing.

KMP algorithm thought

We first describe the idea of the KMP algorithm with a graph. Looking for f in string O, we need to move the string f forward when the two strings are not equal at position I. The conventional approach is to move forward one bit at a time, but it doesn’t take into account the fact that the previous I-1 bits have already been compared, so it’s not very efficient. In fact, if we calculate some information ahead of time, it is possible to move forward a bit at a time. Let’s say we know from the information we have that we can move k bits forward, let’s analyze what happens to f before and after the shift. We can draw the following conclusion:

The A string is A prefix to f.
The b-string is a suffix for f.
A string is equal to B string.

Therefore, after moving forward k bits, you can continue to compare position I on the premise that the first i-1 positions of f meet the following requirements: Prefix A with i-K-1 length is the same as suffix B. Only then can we move k places forward and continue the comparison from the new position.

Therefore, the core of KMP algorithm is to calculate the maximum length of the common part of the prefix and suffix of the string before each position of the string F (excluding the string itself, otherwise the maximum length is always the string itself). Once you have the maximum common length for each position of f, you can use that maximum common length to quickly compare with the string O. When we compare two strings with different characters each time, we can move the string f forward (matched length – maximum common length) bit based on the maximum common length, and then continue to compare the next position. In fact, the string f moves forward only conceptually, as long as we compare f and O after the maximum common length when comparing.

Next array calculation

Having understood the basic principles of the KMP algorithm, the next step is to obtain the maximum common length for each position of the string f. This maximum common length is referred to as the next array in the introduction to algorithms. One thing to notice here is that the next array represents length, starting at 1; But when iterating through the original string, the subscripts still start at 0. Suppose we now have next[1], next[2],… Next [I], representing the maximum common length of prefix and suffix for strings of length 1 through I, is now required for next[I +1]. Next [I +1] is equal to next[I] plus 1 if the characters at position I and position next[I] are the same (subscripts start at zero). If the characters at two positions are different, we can split the string of length next[I] to get its maximum common length next[next[I]], and then compare it with the characters at position I. This is because the length next[I] prefix and suffix can be split into upper constructs. If position next[next[I]] and position I have the same characters, then next[I +1] is equal to next[next[I]] plus 1. If not, you can continue splitting the string of length next[next[I]] until the string length is 0. From this we can write the code to evaluate the next array (Java version) :

1 public int[] getNext(String b) 2 { 3 int len=b.length(); 4 int j=0; 5 6 int next[]=new int[len+1]; 7 next[0]=next[1]=0; 8 9 for(int i=1; i<len; I ++)// I indicates the subscript of the string, starting at 0 10 {//j indicates the value of next[I] at the beginning of each loop, as well as the next position to compare 11 while(j>0 &&b.charat (I)! =b.charAt(j))j=next[j]; 12 if(b.charAt(i)==b.charAt(j))j++; 13 next[i+1]=j; 14 } 15 16 return next; 17}Copy the code

The important thing to note in this code is that the next array we took represents the maximum common length of the f prefix for strings of length 1 through m, so we need to allocate one more space. While iterating through the string f, we still start at subscript 0 (positions 0 and 1 are next values 0, so we put them outside the loop) and end at m-1. The structure of the code is consistent with the above explanation, which uses the previous next value to calculate the next value.

String matching

After evaluating the next array, we can use the next array to find the occurrence of string f in string O. The matching code is very similar to the code for evaluating the next array, because the matching process is the same as the next array. Assuming that the first I position of the string f matches the string O starting at some position, now compare the I +1 position. If the I +1 position is the same, then compare the I +2 position; If the I +1 position is different, a mismatch occurs, and we still split the string of length I to get its maximum common length, next[I], and continue comparing the two strings from next[I]. This process is the same as finding the next array, so it matches the following code (Java version) :

1 public void search(String original, String find, int next[]) { 2 int j = 0; 3 for (int i = 0; i < original.length(); i++) { 4 while (j > 0 && original.charAt(i) ! = find.charAt(j)) 5 j = next[j]; 6 if (original.charAt(i) == find.charAt(j)) 7 j++; 8 if (j == find.length()) { 9 System.out.println("find at position " + (i - j)); 10 System.out.println(original.subSequence(i - j + 1, i + 1)); 11 j = next[j]; 12} 13} 14}Copy the code

One thing to note about the code above is that we reassign j every time we get a match.

The complexity of the

The complexity of KMP algorithm is O(n+m), which can be solved by means of amortization analysis. For details, please refer to introduction to the algorithm.

The resources

1. KMP algorithm summary

2. Detailed explanation of KMP algorithm

3. The KMP algorithm

4. Understanding and implementation of KMP algorithm

Open source implementation

If you want to actually use this algorithm, I will give you an example: Java Notepad

PS:

And finally, I’ll give you a couple more pictures, just to make sense of it.

Koch curve

Its structure repeats itself

KMP algorithm thought

Next array calculation

String matching

The complexity of the

The resources

Open source implementation

PS:

Related Posts

System Architecture Design Notes (33) – Defining system problems and resolution models

The input box has a background color problem when filling automatically

What happens when you update your DNS