The Knuth-Morris-Pratt string lookup algorithm (abbreviated as KMP algorithm) can find the occurrence of a word W within a main text string S. This algorithm avoids rechecking previously matched characters by making use of the discovery that the word itself contains enough information at the time of the mismatch to determine where the next match will begin.

Here’s an example

As can be seen from GIF, KMP algorithm does not carry out any rollback of the pointer to the main string after the matching failure. What it cares about is the processing of the pointer to the search term.

Careful as you may have sensed, the above processing is abstract (generic), a simulation where you don’t need to know exactly what the main string is.

GIF simulated pointer K mismatch in character C, through such a pretreatment, in the actual matching, if encountered this situation, we just need to calmly move the search term pointer to E, and then continue to match.

By the way, why do search terms stop at K = E? In the process of moving, AB successfully matched, but did not know ‘? If you move the search term forward with the exact character represented by ‘, you may miss a match.

Now ABEABC is still used as the search term, and then look at several exercises

So let’s summarize the pattern. In both cases, the K pointer does not move to the starting point, but stops halfway. ABEAB can be found? ABEAB has the same character AB at the head and tail as ABEABC before the mismatch character.

ABEA? In the character ABEA before the mismatch with ABEAB, the same character A exists in the head and tail.

The K pointer moves to exactly the last character of the prefix when mismatched

First, understand two concepts: prefix and suffix. “Prefix” means all header combinations of a string except the last character; Suffixes refer to all combinations of the ends of a string except the first character.

From Ruan Yifeng
String matching KMP algorithm

Now that we know about prefixes and suffixes, let’s define them again. When the character pointed to by pointer J is mismatched with the character pointed to by pointer K, the character before the mismatch exists a prefix set and a suffix set, we can get

K '= Max (0 ~ k-1 prefix set ∩ 0 ~ k-1 suffix set

The meaning of k’ is clear from the formula (the maximum value of the intersection of prefix and suffix). On the other hand, if the search term subscript is calculated from 0, when k is mismatched, we just need to move K to k’ to continue matching.

The KMP algorithm usually puts the calculated K ‘into the next array and stores next[k] = k’. When we mismatch pointer k in actual combat, we just need to revert the pointer k to k’, i.e. K = k’ = next[k].

Indeed, as we said before, it is clear from the definition that computing the next array does not require the participation of the main string at all, but is completely a process of self-matching search words to calculate k’ = Max (prefix set ∩ 0~k-1 suffix set).

Although this definition is rigorous and easy to understand, it is not well described in computer language. Let’s take a look at the computer-friendly derivation of the Next array. This is probably the hardest part of the whole KMP algorithm to understand

Next array derivation

By definition of the next array, we can have

If next[j] = k, then w[0 ~ k-1] = w[j-k ~ J-1]

It should be understood that the relationship between the two is sufficient and necessary, that is, if w[0 ~ k-1] = w[J-K ~ J-1], then next[j] = k

The case in the figure below is one that satisfies the definition: Next [6] = 2

I don’t know how to prove this, because it’s defined by the next array, so I don’t need to prove it.

Now that we know that next[j] = k, it stands to reason that next[j+1] can be solved in two ways

w[k] == w[j]

According to the derivation above, when w[k] == w[j], w[0 ~ k] = w[j-k ~ j], then we can get next[j+1] = k +1

w[k] ! = w[j]

If w [k]! = w[j] = w[j] = w[j] The following is an example that fits our discussion

It can be seen that k and j are mismatched when looking for the maximum prefix and suffix intersection of the string ABEFABA

If this happens in the KMP algorithm, another k = next[k], and then again compare w[k] to w[j]. So here’s the question

  1. Why is it that when W [K]! = w[j], let k = next[k], instead of k = k-1 or something else?

    When w[k] is mismatched with w[j], k must move at least to next[k] so that k continues to match j of the main string. This is the definition of the next array, and we’re just using it now

  2. w[k] ! = w [j], so the other k = next [k], ‘if this w/k] = = w [j], [0 ~ how to prove that w k’] [j – k = = w ~ j]? (In pink)

    k' = next[k]getw[0 ~ k'-1] == w[k-k' ~ k-1]

    From next[j] = k, w[0 ~ k-1] == w[J-K ~ J-1]

    Because w [0] ~ k – 1 = = w [j – k ~ j – 1] [k – so w k ‘= = w ~ k – 1] [j-k’ ~ j – 1)

    This is an emotional proof, and I cannot use the formula to prove it for the time being

    So w[0 ~ k’-1] == w[j-k’ ~ J-1]

    W [0 ~ k’] == w[j-k’ ~ j]

    From w[0 ~ k’] == W [j-k’ ~ j], we get next[j+1] = k’ +1

    This is the case if w[k’] == w[j], but in most cases w[k’]! = w[j], which is discussed in the implementation of the algorithm.

Algorithm implementation

Next array implementation

private function getNext($word): array
{
    $next = [- 1];
    $len = strlen($word);
    $k = - 1;
    $j = 0;

    while ($j < $len - 1) {
        if ($k == - 1 || $word[$j] == $word[$k]) {
            $next[++$j] = ++$k;
        } else{ $k = $next[$k]; }}return $next;
}
Copy the code

Next [0] = -1 -1 is a special flag to facilitate judgment. W [k] at the top! = w[j], k = next[k] and then to determine whether w[k] is equal to w[j], if not, k = next[k] and so on. But there’s always an end to the cycle, and there’s something like this at the end, where k = 0, w[k]! = w[j], according to the algorithm k = next[0] = -1.

Therefore, when we see k = -1, we can know that w[0 ~k] does not have the intersection of prefix and suffix, i.e. Max (0~k prefix set ∩ 0~k suffix set) = 0, so we can set next[k+1] = 0

The algorithm above, for brevity, sets the special value to -1, so that an if, else can cover all three cases, and of course you can write it the same way

if($k == - 1) {
    $next[++$j] = 0;
} elseif ($word[$j] == $word[$k]) {
    $next[++$j] = ++$k;
} else {
    $k = $next[$k];
}

Copy the code

Detailed implementation with test cases github.com/weiwenhao/a…

In fact, KMP algorithm is not often used in string matching. Although its time complexity is O(m+ N), its performance is not much worse than that of the naive algorithm. In the process of learning, it should also be found that partial matching is rare. It has to be said that KMP algorithm is very difficult to understand, too many details are easy to fall into a situation of robbing Peter to pay Paul, all kinds of horns drill to stop. But the idea of the state machine and the derivation of the next array are worth studying.