Problem description

String matching is one of the most common problems in development. It asks you to find the location of a shorter string from a longer string. For example, find the position of the string (P=ababaca) from the string (T=bacbababaabcbab). (T) is called * primary string *, and the string (P) is called * pattern string *.

This problem is old and recurring, so there are many algorithms to solve it.

The original address

Violence to solve

The one that usually comes to mind is the naive matching algorithm, also known as brute force solution. Simply put, all possible bits in (T) match (P) one by one. For example (T= badCAB), (P= dCA) :

Badcab DCA -- compares DCA with BAD, does not match DCA -- compares DCA with ADC, does not match DCA -- compares DCA with DCA, matches, returns current position 2Copy the code

The matching code is as follows:

Int search(const string&t, const string&P) {if (p.epty ()) {return 0; } if (p.size () > t.size ()) {// return -1; Return -1} for (size_t I = 0; i < T.size(); ++i) { for (size_t j = 0; j < P.size(); ++j) { if (T[i + j] ! = P[j]) { break; } if (j == P.size() - 1) { return i; } } } return -1; // -1 indicates no match}Copy the code

Set (n = T.s considering ()), (m = p. considering ()), the complexity of this algorithm is obviously (nm) (O).

This algorithm is simple, effective and error-prone. For the most part, the brute force solution will suffice. On my PC, algorithms below (1e7) can be calculated in less than 1 second after the complexity parameter is added.

Hash solution (RK, Rabin-carp)

Inside the violence solution, the inner loop is used to match whether the substring corresponding to each position matches the pattern string. The comparison process can be optimized.

By mapping the 26 characters a-z to 0-25 and matching the string as a base 26 integer, we can efficiently match the pattern string P, thus removing the inner loop. The algorithm complexity is order n.

The disadvantage of this algorithm is that integers may overflow. The solution is to use other hashing methods, such as hashing strings to the sum of each character, but this creates hash collisions, which require comparing strings one by one after the hashing matches. If there are a lot of hash collisions, the strings are compared each time, so the worst time complexity of such an approach is (O(nm)). In fact, unless the schema string is large, it is very rare to encounter hash conflicts. Inspired by the Bloom filter algorithm, I used 8 different hash functions to calculate the matching hash value at the same time. The pattern string is not long, so there is basically no conflict and the speed is very fast. The downside is always being asked “what the hell are you doing?” Later, someone changed it to KMP algorithm. Two years after the launch of the business, I was found to be wrong. I was also included in the review list. A bad algorithm is worse than a brute force solution.

BM algorithm (Boyer Moore)

Let’s take a closer look at (T= ABCACabDC), (P=abd) and try to match it using a brute force solution.

Abcacabdc 1: abd | | | - did not match 2: abd | | - did not match 3: abd | |). 4: abd | .. 5: abd | - did not match 6: abd - match!Copy the code

In the first round of matching, the character (c) appears in the main string, but does not appear in the pattern string. In fact, the pattern string can be moved directly after the main string (C).

For round 2 matching, the pattern string (d) corresponds to the corresponding main string (a), which does not match and moves 1 byte to the right. If you know that the last position of (a) in the pattern string is 0, you can save a lot of time by moving 2 bytes back so that (a) of the main string is directly aligned with (a) of the pattern string.

Bad character rule

In the solution of violence, pattern string and main string match from small to large in subscript order, while BM algorithm matches from large to small. Matches backwards from the end of the modular string. When a character is found unmatched, that character in the main string is called a “bad character”. The match of round 1 (c) in the above example is also the match of round 4 (a) on the far right of the above example.

In the next round of matching, the bad character (c) does not exist in the pattern string and can be moved directly after (c).

abcacabdc
abd
   abd
Copy the code

In the next round of matching, the bad character (a) is 0 in the pattern string, and the pattern string can be directly moved 2 bytes to the right, and the pattern string (a) is aligned with the main bad character (a).

abcacabdc
   abd
     abd
Copy the code

In fact, when bad characters occur, we move the pattern string to the right until a character identical to the bad character appears at the far right of the pattern string, aligned with the bad character of the main string.

The bad character rule is obviously correct, but it’s not enough to just use the bad character rule, (T= aaAAAAAAAAAAAAAAAAA), (P=baaa), using the bad character rule is the same as using the brute force solution.

Good suffixes rule

The good suffix rule is similar to the bad character rule. We observe (T= ABcacAbCBCBACABC), (P= ABcabc) matches.

   v  __
abcacabcbcbacabc
 abbccbc
    abbccbc
      -- --
Copy the code

You can see that after the “cabc” match, there is a bad character (a). Regardless of the bad character rule, you can move the pattern string 3 bytes to the right, so that the earlier “BC” in the pattern string is moved to the current position suffix “BC”. If the bad character rule is used, the last (a) position is to the right, and then it moves to the left to do nothing but revert to brute force.

It is easy to understand the pattern string in advance, know its suffix and the longest prefix position equal to the suffix, and when a bad character occurs, move the pattern string right until the prefix in the pattern string is equal to the position of the already matched good suffix. If there are more than one substring in the string that matches the suffix, the rightmost substring is selected.

It’s not hard to see how the rules for good suffixes are similar to the rules for bad characters so far.

But when the good suffix can’t find the same substring letter moving in the pattern string? Just move 1 byte right like brute force? This can be seen in the figure below, where the blue part is the part of the pattern string where the prefix equals the suffix. Since there is no other match in the pattern string, the next best thing is to move the pattern string right until the prefix and suffix are equal.

If there is a blue part in the middle of the pattern string, which is equal to the blue prefix and suffix, it is not necessary to consider. Because even if the middle blue part is moved to the blue part of the suffix is equal, since there is no other correspondence in the pattern string for the premise of good suffix matching, even if the blue substring is corresponding, it will not find a good suffix and move right again.

BM algorithm implementation

The bad character rule is easy to implement by building a table that looks for the last occurrence of a character in a pattern string.

The question is how to implement the rule of good suffixes efficiently.

We introduce an array of (suffix), where (suffix[I]) indicates the position of another substring in the pattern string where suffixes of length (I) match. For example (P = cabcab) :

The suffix substring The length of (I) suffix[i] prefix[i]
b 1 2 false
ab 2 1 false
cab 3 0 true
bcab 4 – 1 false
abcab 5 – 1 false

The (suffix) array solves the problem of suffixes that have other matching substrings in the pattern string. A prefix array is also required when no substring matches a good suffix. (prefix[I]) indicates whether the suffix of length (I) is equal to the pattern string prefix.

The implementation code is as follows:

// character set size const static int kMaxChar = 256; // Bad character vector<int> generate_bad_char(string P) {vector<int> BC (kMaxChar); fill(bc.begin(), bc.end(), -1); // initialize for (int I = 0; i < P.size(); ++i) { bc[P[i]] = i; } return bc; } void generate_good_suffix(string P, vector<int>&suffix, vector<bool>&prefix) {suffix = vector<int>(p.size ()); prefix = vector<bool>(P.size()); fill(prefix.begin(), prefix.end(), false); fill(suffix.begin(), suffix.end(), -1); for (int i = 0; i < P.size() - 1; ++ I) {// P[0.. I]; int k = 0; / / public suffix while the length of the substring (> = 0 & j & P [j] = = P [p. considering () - 1 - k]) {/ / P/j. i. P [... p. considering () - 1) match - j; ++k; suffix[k] = j + 1; } the if (j = = 1) {/ / suffix and prefix matching P = P [0.. I] [... p. considering () - 1) prefix [k] = true; String (string, string, string, string, string, string, string, string, string, string, string, string) vector<bool>prefix) { int k = m - 1 - j; If (suffix[k]! Return j-suffix [k] + 1; return j-suffix [k] + 1; } for (int r = j + 2; r <= m - 1; ++r) {if (prefix[m-r]) {return r; } return r; } return m; return m; } int bm(string T, string P) {if (p.epty ()) {return 0; } if (p.size () > t.size ()) {// return -1; } auto BC = generate_bad_char(P); vector<int> suffix; vector<bool> prefix; generate_good_suffix(P, suffix, prefix); int i = 0; / / while matching position (I < = T.s considering () - p. considering ()) {int j; for (j = P.size()-1; j >= 0; --j) {// Mode string from back to front P[j].. P[0] if (T[i+j] ! = P[j]) { break; T[I +j]}} if (j < 0) {return I; } int x = j - BC [T[I +j]]; // T[I +j] int y = 0; If (j < p.size () -1) {// there is a good suffix y = move_gs(j, p.size (), suffix, prefix); } i = max(x, y); } return -1; }Copy the code

BM algorithm complexity

BM Algorithm Complexity analysis is difficult, Tight Bounds On The Complexity Of The Boyer-Moore String Maching Algorithm proved that The upper limit Of comparison Of BM Algorithm is (3n).

KMP Algorithm (Knuth Morris Pratt)

I’ve never actually implemented the KMP algorithm in my work, but there are plenty of people who, after reading KMP or red-black tree recently, show up at an interviewer’s desk and ask for a handwritten key. The algorithm is instructive to the human mind, even though it is not usually useful, because similar AC automata algorithms often have to be implemented by themselves.

The idea of KMP algorithm is similar to BM algorithm. It compares pattern strings sequentially and, if it encounters a bad character, moves to the right until the infix matches a good prefix. Look at the following example.

     v
ababaeabac
ababacd
---
 \ \
  ---
  ababacd
Copy the code

When a bad character (e) is matched, the good prefix (ababa) for (P) is already a confirmed match. We found that:

ababa
  ababa
Copy the code

The prefix of this good prefix has a longest match with the suffix, and when we move right (P), we can move directly to make this longest match correspond. It’s the longest match, and anything that moves less than it is always worse than it, so just move to the longest match.

The implementation requires a (next) array that (next[I]) represents the subscript of the longest matched prefix substring in the suffix (P[0.. I]). For example string (P=ababacd) :

P[0..i] i next[i] instructions
a 0 – 1 There is no
ab 1 – 1 There is no
aba 2 0 P[0] = P[2]
abab 3 1 P[0..1] == P[2..3]
ababa 4 2 P[0..2] == P[2..4]
ababac 5 1 There is no

Computing (Next) arrays uses a dynamic programming-like approach.

  • (next[I]=k) equivalent to (P[0..k]=P[i-k.. I]). If at this time (P = P (k + 1] [I + 1]), then (P [0.. k + 1) = P + 1] [I – k. i.), namely (next [I + 1) = k + 1).
  • If (P[k+1]\neq P[I +1]), the shorter length of the match prefix suffix match is tried. Let (m=next[k]), (P[0..m]=P[i-m.. I]). If (P[m+1]=P[I +1]) then (next[I +1]=m+1). And so on.

The implementation is as follows:

vector<int> gen_next(string P) { vector<int> next(P.size()); if (P.empty()) { return next; } next[0] = -1; int k = -1; for (int i = 1; i < P.size(); ++i) { while (k ! = -1 && P[k+1] ! = P[i]) { k = next[k]; } if (P[k+1] == P[i]) { ++k; } next[i] = k; } return next; } int KMP (string T, string P) {if (p.epty ()) {return 0; } if (p.size () > t.size ()) {// return -1; } auto next = gen_next(P); int j = 0; for (int i = 0; i < T.size(); ++i) { while (j > 0 && T[i] ! = P[j]) { j = next[j - 1] + 1; } if (T[i] == P[j]) { ++j; } if (j == P.size()) { return i - P.size() + 1; } } return -1; }Copy the code

The complexity of KMP algorithm is not difficult to calculate. (Next) While loop parsing is a bit of a hassle when arrays are built. Think of it this way: every time we go through the loop, delta of k either increases by 1 or decreases by 1. The only increment is (++k), which is executed at most (m-1) times in the for loop. The (k) reduction is in the while loop, because it is not possible to reduce after its minimum (-1), so it is performed at most (m-1) times. So the time complexity of evaluating the (next) array is (O(m)).

Similarly, the complexity of matching (T) is order (n). The total time complexity of KMP is (O(n+m)).