【 Data structure and algorithm 】KMP algorithm in detail

KMP algorithm

1. The introduction of

The most famous application of KMP algorithm is the substring problem.

Topic:

Now we have str1=” abCD1234EFg “and str2=”1234”. How can we determine if str2 is a substring of str1?

Note that the substring must be consecutive, for example “1234F” is not a substring of “abCD1234EFg”, but “1234E” is.

Analysis:

In this case, if the violent solution is used, the attempt method is to try str1 from left to right to see if each character can match str2.

Violent solution In an extreme case, the time complexity can be very high. For example, str1=”1111111112″, str2=”11112″. To use the brute force method, str1 is matched five times from the first character 1 and found unmatched five times from the second character 1, and so on until the last section of str1 is matched. If STR1 is long N and STR2 is long M, then the time complexity is O(NM), which means that every step of STR1 needs to traverse the whole str2.

If the KMP algorithm is used to solve the problem, the return type is not Boolean, but the subscript of the first character of the substring in STR1, or -1 if not included.

The core idea of KMP and the violent solution is the same, they both try str2 from left to right for each character in STR1, but KMP is accelerated.

Violent solution:

public static boolean findSubStr(String str1, String str2) {
    if (str1 == null || str2 == null || str1.length < str2.length) {
        return false;
    }

    return process(str1.toCharArray(), str2.toCharArray());
}

public static boolean process(char[] c1, char[] c2) {
    The // flag defaults to false, so that str1 and str2 will return false if they have no intersecting characters
    boolean flag = false;

    // walk through str1
    for (int i = 0; i < c1.length; i ++) {
        // Try to match every character in str1 to str2
        if (c1[i] == c2[0]) {
            // As long as one character is matched, we expect the match to be successful
            flag = true;
            Str1 and str2 are matched character by character
            int k = i + 1;
            for (int j = 1; j < c2.length; j ++, k ++) {
                // If there is only one character different during the matching process, the matching fails
                if(c1[k] ! = c2[j]) { flag =false;
                    break; }}}}return flag;
}
Copy the code

2. Maximum matching length

Before using KMP algorithm to solve the substring problem, we first need to understand the concept of the longest prefix and suffix matching length.

The matching length of the longest prefix and suffix is the information that each character in the substring (STR2) needs to carry, independent of the corresponding character, but related to all the characters preceding the corresponding character.

For example, if we have a string “abbabbk”, we want to know the matching length of the longest prefix followed by the suffix of the k character.

First we need to get the string “abbabb” before the k character. Then we find the maximum matching length of the prefix and suffix of the string “abbabb” is 3. The k character carries 3 information.

Neither prefix nor suffix can be taken to the whole string.

If the character is preceded by no string, the information carried by the character is -1.

The character at position 0 of the string is artificially assigned to carry -1 and the character at position 1 carries 0.

3. Speed up the process

The key to using KMP to speed up this problem is that each character in STR2 carries the maximum matching length of prefix and suffix.

Suppose str1 = “… oabbcefabbckbc……” , str2 = “abbcefabbcg…” . The first character o displayed in STR1 is bit i-1, assuming that the current str1 matches STR2 from bit i-1 (the previous matching process is ignored).

Use the classic method:

Characters after the i-th bit of STR1 are matched with characters of STR2 in sequence. When str1 matches the i-th bit of STR1, it is found that the characters of str2 cannot match the i-th bit of STR1, and the matching operation from the i-th bit fails. Thus, STR1 abandons the possibility that the i-th bit will match the substring successfully, and the matching pointer jumps from the i-th + 10 bit of STR1 to the i-th + 1 bit. Characters after the i-th + 1 bit of STR1 are then matched with characters at the 0th bit of STR2 (str2 slides one bit to the right). And so on.

Acceleration using KMP algorithm:

[-1, 0, 0, 0, 0, 1, 2, 3, 4……] .

According to the maximum matching length of prefixes and suffixes, the maximum matching length of the first 11 bits is 4 corresponding to the 10th character. Therefore, it can be determined that the maximum prefix and suffix before the 10th character G of STR2 are “ABbc”.

Str1’s matching pointer is left untouched, and character matching continues with str2’s fourth bit (the last bit of the largest prefix).

4. Analyze the

To understand what KMP’s second action really means? You can see this by looking at the comparison object twice.

Just like the classical method, str1 does not attempt the possibility that the I + 1 bit can be matched successfully after the initial matching operation of str1 fails. However, STR1 does not attempt the possibility that the I + 2 bit can be matched successfully. Instead, it directly attempts the probability that the I + 6 bits will match (equivalent to str2 sliding right (the subscript of the first digit in the largest suffix – the subscript of the first digit in the largest prefix) bits), meaning that STR1 will match directly from str2’s largest suffix in STR1.

But the I + 6 bits of STR1 and the 0 bits of STR2 are confirmed to be overlapped in the last round of matching, both of which have the maximum prefix/suffix of “ABBC”. In order to accelerate better (constant acceleration), the actual comparison can directly skip the coincidence part and start the comparison.

So instead of trying to match from the I + 6 position, STR1 matches directly from the I + 11 position, because the I + 6 to the I + 9 position of STR1 coincides with the 0 to 3 position of STR2. Only after bit I + 10 of STR1 can a match fail.

So now we need to prove a point: why does a match from the I + 1 bit of STR1 to the I + 5 bit of str1 always fail?

5. Proof

Why does a match from the I + 1 bit of STR1 to the I + 5 bit of str1 always fail?

Array [-1, 0, 0, 0, 0, 1, 2, 3, 4……] The relevant.

Abstract the above proof problem: the proof that any k between the i-th and j-th bits of STR1 [I +1, J-1] starts to match is bound to fail.

The proof assumes that the I to (x-1) bits of STR1 are exactly the same as the 0 to (y-1) bits of str2.

Using proof by contradiction, we assume that the KTH bit of str1 starts matching, and the result is a successful match.

We know that no matter what bit str1 matches from, it matches from the 0th bit of str2, and if the match is successful, it means that the segment from the KTH bit of STR1 to the X bit of str1 must be exactly the same as the segment from the 0th bit of str2 to the x-k bit of str2.

So the period from the KTH bit of str1 to the x-1 bit must be exactly the same as the period from the 0th bit of str2 to the x-k-1 bit.

And since only the x-position of STR1 and the y-position of STR2 were different in the last round of matching, the k-position to (x-1) position of STR1 must be exactly the same as the (y-x + K) position to (y-1) position of STR2.

By equivalence swap, the segment from the 0th bit of STR2 to the (x-k-1) bit must be exactly the same as the segment from the (y-x + K) bit of str2 to the (y-1) bit.

Since the KTH bit of str1 is less than the JTH bit, the distance from the KTH to the X bit of str1 must be larger than the distance from the JTH to the X bit.

So the corresponding segment from the 0th bit of str2 to the (x-k-1) bit must be longer than the longest prefix of the original y-th bit.

Because the distance from the 0th bit of str2 to the x-k-1 bit is the same as the distance from the y-x + k bit of str2 to the y-1 bit.

The corresponding (y-x + K) bit to (y-1) bit of str2 must therefore be longer than the longest suffix of the original y-bit.

And since the period from the 0 digit of str2 to the (x-k-1) digit is also the prefix of the Y digit, and the period from the (y-x + k) digit of str2 to the (y-1) digit is also the suffix of the Y digit.

So the Y bit of the original STR2 has a prefix and suffix longer than the longest prefix and suffix. Since the longest prefix and suffix have been given, they contradict each other and the hypothesis is not valid.

6. Complete the process

In fact, the entire KMP process is str2 moving all the way to the right.

We put aside the process of KMP constant optimization and carefully analyze the essential process of KMP acceleration.

Why is KMP fast? Take str1 a~ E for example, using the classical method requires 17 comparisons, while using KMP only requires 4 comparisons. If constant acceleration of KMP is added, the comparison will be faster in 4 comparisons.

Among them, ①, ②, ③ and ④ are the comparative positions of STR1 and STR2 at the beginning of the four matches, and actually represent the complete acceleration process of KMP to a~ E in STR1.

When the matching pointer points to position ① of str1, the matching fails until e and w do not match.

Then find the longest prefix and suffix of w in STR2: abbsabb. The matching pointer points to position ② and starts matching until it is found that e and T do not match, so position ② fails to match.

Then find the longest prefix and suffix of t in STR2: ABB. The matching pointer points to position ③ until it is found that e and S do not match, so position ③ fails to match.

Then find the longest prefix and suffix of s in STR2: none. The matching pointer points to position ④ and starts to match. It is found that e and A do not match. Therefore, position ④ fails to be matched.

In STR1, a~ E have all been matched with STR2, and str2 cannot be matched at any position. Therefore, the matching pointer points to the last digit of E to start matching, and the operation of position ① continues in a cycle until the last digit of STR1 ends.

7. To achieve

The implementation process includes constant acceleration.

// The return value is the starting index of the substring in str1
public static int findSubStr(String str1, String str2) {
    if (str1 == null || str2 == null || str1.length() < str2.length()) {
        return -1;
    }

    return process(str1.toCharArray(), str2.toCharArray());
}

public static int process(char[] str1, char[] str2) {
    // i1 is a pointer to str1, i2 is a pointer to str2
    int i1 = 0;
    int i2 = 0;

    // Get the longest prefix and suffix for all str2 characters
    int[] next = getMaximalPrefixAndSuffix(str2);

    // The matching process is terminated whenever i1 is out of bounds or i2 is out of bounds (i1 and i2 may also be out of bounds)
    while (i1 < str1.length && i2 < str2.length) {
        // There are only three cases of KMP acceleration
        if (str1[i1] == str2[i2]) {
            Str1 and str2 continue matching in parallel
            i1 ++;
            i2 ++; // i2 will move forward only if the matching characters are equal
        } else if (i2 == 0) { // next[i2] == -1
            // The position of the string is not the same, but the matching pointer of str2 is at the position of 0, indicating that i2 jumps to the position of 0, which means that there is no position of str1 before str2 can be matched successfully
            // The str1 match pointer moves back one bit to start the next match with str2
            i1 ++;
        } else {
            // The corresponding position is different, and the matching pointer of str2 is not at 0, so i2 needs to jump to the next bit of the longest prefix for the next comparison
            // This process is the KMP constant acceleration operationi2 = next[i2]; }}// If i2 is out of bounds, then str2 is matched successfully in str1, and the first neutron string of STR1 is returned
    // If i1 is out of bounds, then none of str1 can match str2 successfully and -1 is returned
    return i2 == str2.length ? i1 - i2 : -1;
}

// The essence of counting the maximum prefix is to determine the maximum prefix for each bit of STR2. The estimated maximum prefix will shrink each round if no match is successful until it shrinks to 0, indicating that there is no maximum prefix at this position
public static int[] getMaximalPrefixAndSuffix(char[] str2) {
    // If str2 has only one character, it must be next[0]
    if (str2.length == 1) {
        return new int[] {-1};
    }

    // If there is more than one character, assign next[0] and next[1] manually
    int[] next = new int[str2.length];
    next[0] = -1;
    next[1] = 0;

    // str2 cursor, starting from str2[2] and filling in next
    int i = 2;

    // prefix is the subscript of the last digit of the longest c[I] prefix currently most likely
    int prefix = 0;

    // Next is fully filled when I is out of bounds
    while (i < str2.length) {
        if (str2[i - 1] == str2[prefix]) {
            Str2 [I] = str2[I] = str2[I] = str2[I]
            // The i-th bit is successfully matched
            next[i ++] = ++ prefix;
        } else if (prefix > 0) {
            // If the first digit of str2[I] is not the same as the last digit of str2[I], the longest prefix must be shrunk and prefix must jump forward
            // prefix needs to jump to the last digit of the longest c[prefix] prefix
            // The current I bit fails, and the next round continues to match the I bit
            prefix = next[prefix];
        } else {
            Str2 [I] has no longest prefix and is set to 0
            // The i-th bit is successfully matched
            next[i ++] = 0; }}return next;
}
Copy the code

Complexity 8.

First demonstrate the complexity of the while loop in the process method:

while (i1 < c1.length && i2 < c2.length) {
    if (c1[i1] == c2[i2]) {
        i1 ++;
        i2 ++; 
    } else if (i2 == 0) { 
        i1 ++;
    } else{ i2 = next[i2]; }}Copy the code

By looking at the code, we can see that i1 and i2 increase in the first branch; I1 increases in the second branch; In the third branch i2 goes down.

To estimate the complexity of while, we need to assume two quantities, the first of which is i1 and the second is i1-i2.

Assuming that str1 has length N, then i1 and i1-i2 are both at maximum N.

We’re going to look at the effects of each of the three branches of the loop on these two quantities.

The first branch of the loop, i1 and i2 increase together. So i1 goes up, i1 minus i2 stays the same.

The second branch of the loop, i1 increases. So i1 goes up, i1 minus i2 goes up.

The third branch of the cycle, i2 decreases. So i1 stays the same, i1 minus i2 goes up.

You can only go one branch at a time, so if you add the ranges of these two things together, the maximum range is 2N (go the second branch, and both are N).

None of the three branches reduces either quantity, so the number of times the loop occurs is tied to the superposition of the range of changes between the two quantities, which is the number of times the while loop, so the total while loop cannot exceed 2N.

Therefore, it can be proved that the time complexity of while is linear and is O(N).

Then prove getMaximalPrefixAndSuffix method of complexity:

public static int[] getMaximalPrefixAndSuffix(char[] str2) {
    if (str2.length == 1) {
        return new int[] {-1};
    }

    int[] next = new int[str2.length];
    next[0] = -1;
    next[1] = 0;

    int i = 2;
    int prefix = 0;

    while (i < str2.length) {
        if (str2[i - 1] == str2[prefix]) {
            next[i ++] = ++ prefix;
        } else if (prefix > 0) {
            prefix = next[prefix];
        } else {
            next[i ++] = 0; }}return next;
}
Copy the code

The complexity of the estimated getMaximalPrefixAndSuffix, we also need to assume two quantities, the first is I, second quantity is the I – prefix.

Assuming that str2 is of length M, the maximum value of I is M, and the maximum value of i-prefix is also M.

The first branch of the loop, I and prefix are added together. So I increases and i-prefix stays the same.

The second branch of the loop, prefix, is reduced. So I stays the same and i-prefix increases.

The third branch of the cycle, I increases. So I increases, i-prefix increases.

Each cycle takes only one branch, so if you add the ranges of these two quantities together, the maximum range is 2M (go the third branch, and both are N).

So you can prove getMaximalPrefixAndSuffix time complexity is linear, for O (M).

The overall time complexity is:

Since M must be less than or equal to N, the overall time complexity of KMP is O(N).