preface

A string, also known as a string, is a finite sequence composed of 0 or more characters. A string can also be stored in two ways: sequence storage and chain storage. Finding and locating sub-string problems in the main string (pattern matching) is one of the most important operations in the string, and different algorithms have different efficiency. Today we will compare the two modes of learning string matching:

  • Naive Pattern Matching algorithm (Brute-Force algorithm, abbreviated as BF algorithm)

  • KMP pattern matching algorithm

Naive Pattern Matching algorithm (BF algorithm)

BF algorithm is a conventional algorithm in pattern matching, and its idea is:

  • The first round: Compares the first character in the substring with the first character in the main string
    • If equal, the primary string is compared to the second character of the substring
    • If not, conduct a second round of comparison
  • Round 2: The first character in the substring is compared to the second character in the main string……
  • Round N: Keep comparing until all match

Illustration:

The first round:

The second round:

. Same principle, omit intermediate steps

The fifth round:

The sixth:

Code implementation:

After reading the text and legend, let’s start to implement such an algorithm

The above steps are summarized as follows:

Each character of the main string matches the beginning of the substring. If the match succeeds, the substring matches the next bit of the main string, and if the match fails, the substring matches the next bit of the main string. Obviously, we can implement such an algorithm by using two Pointers to one character of the main string and one character of the substring

Returns the position at which the substring first appeared in the main string on success, -1 on failure, and 0 if the substring is empty

int String::bfFind(const String &s, int pos) const {
    // Pointers to primary and substrings, I primary and j substrings
    int i, j;
    // The primary string is smaller than the substring, the match fails, and curLenght is the length of the string
    if (curLength < s.curLenght)
        return - 1;
    
    while (i < curLength && j < s.curLength) {
        // The corresponding characters are equal, and the pointer moves backward
        if (data[i] == s.data[j])
            i+, j++;
       	else {	// The corresponding characters are not equal
            i = i -j + 1;	// The main string pointer moves
            j = 0; // The substring starts at the beginning
        }
        // Return the position of the substring in the main string
        if (j >= s.curLength) 
            return (i - s.curLength);
        else return - 1; }}Copy the code

Note: the code only reflects the idea of the algorithm, the specific definition is not given

This algorithm is easy to understand, but there is a big disadvantage, that is, it requires multiple backtracking, inefficient, if the main string is 000000000001 and the substring is 00001, it means that each round to compare the last character of the substring will fail to match, is there a better way? The following KMP pattern matching algorithm is a good solution to this problem

KMP pattern matching algorithm

If only a few data operations, you may even feel that BF algorithm is also ok, at least it is easy to write, after all, can run is a good program, but once the amount of data increases, you will find that there are some “useless” really will greatly slow down your speed

The KMP pattern matching algorithm is proposed by D.E.Knuth, J.H.Morris and V.R. Prate. It is an improvement on the naive pattern matching algorithm. The core is to use the information after the failure of matching to minimize the number of sub-main string matching, which is reflected in the main string pointer moving backward and the sub-string pointer backtracking

Illustration:

The following shows the process of the naive pattern matching algorithm. Let’s see which steps can be omitted if using the idea of the KMP algorithm

In ①, the first five elements are matched with each other, and the sixth element fails to be matched. According to BF algorithm, the operation ② ③ is directly carried out. However, we can find that the first three elements in the substring are not the same, but have been matched with the main string in ①. Therefore, if the substring matches the second and third elements in the main string, it must not match, so the ② and ③ in the figure can be omitted

The first and second elements in the substring ab and the fourth and fifth elements ab are identical, and the fourth and fifth elements ab have matched the fourth and fifth elements in the main string. This means that the first and second elements ab must match the fourth and fifth elements in the main string. So step ④ ⑤ can be omitted

If you follow this line of thinking, the above example only needs to execute ① and ⑥

Next Array value derivation

(a) whether the main string pointer needs to be backtracked

BF -①②③④⑤⑥ and KMP -①⑥. If we now assume that there are two Pointers, I and J, which point to the position of the main string and its substring respectively, we can see from the above figure that the main string pointer, that is, the value of I in ① state, points to the position of 6. In ②③④⑤, it points to 2345, while in ⑥ it still points to 6

This shows that the naive pattern matching algorithm, the I value of the main string will continuously backtrace, but the KMP pattern matching algorithm will omit this unnecessary backtrace, so the number of execution is reduced

(2) Summary of substring pointer optimization

Since the primary pointer does not backtrace, the substring pointer can also be optimized. There are two cases in which we use two examples:

  • If the substring is abcDEF and the main string is abcdexabcdef, when the first round matches the sixth character f and x, the match fails. In this case, if the match follows the naive pattern, the first element of the substring, A, needs to be compared with the bcDE of the main string respectively. However, since there is no identical element before the element of the substring f, And matches the main string, so it is impossible for a to match the elements 2-5 in the main string, that is, BCDE. All of these parts can be omitted, and let A directly match the x in the main string

  • If the substring is abcabx and the main string is abcababcax, in the first round, the first five elements of the substring match each other, but the sixth element is in the wrong position, and we follow the naive pattern of matching, we need to take the first element of the substring, A, and match each element after a in the main string, but the first three characters of the substring are not equal. According to our experience in the first case, skip these steps, all we take direct substring a compared with primary string a fourth element is ok, but we found that the location of the error in the substring before the x series abcab prefixes and suffixes are ab, since the first round of the time, have a match, it means that, The first and second elements ab in the substring must be equal to the fourth and fifth elements ab in the main string. Therefore, this step can also be omitted, which can directly compare the c after the substring prefix AB with the one starting from A. This is the detailed idea in the example above

Conclusion: So we get the rule that the value of a substring pointer depends on how similar the substring elements are to each other

To apply this to concrete code, we can define the j value of the substring position as a Next array with the same length as the substring


  • Case 1: when j = 0, next[j] = -1, which means that the substring pointer fails to match the element with subscript 0, and the substring cannot be backtracked (j cannot assign -1). At this point, the main string pointer is moved one bit later, and the substring does not perform the next round of comparison

  • Case 2: The same prefix T0 T1… Tj-k and tJ-K +1… Tj-1, the substring pointer goes back to next[j] = k, and then makes the next comparison, for example: the substring abcabc has the same prefix and suffix AB, so the substring pointer goes back to C

  • Case 3: In the matched substring, if there is no equal prefix and suffix, the main string pointer does not move, and the substring pointer goes back to j = 0, and the next comparison is made

Example: primary string S = “ABC520ABC520ABCD”, substring T = “ABC520ABCD”, using KMP algorithm to match the process

Substring next array

j 0 1 2 3 4 5 6 7 8 9
substring a b c 5 2 0 a b c d
next[j] -1 0 0 0 0 0 0 1 2 3

As can be seen, when the pointer I = 9 and j = 9, the match fails, and next[9] = 3, so the substring pointer goes back to j = 3, which is the position of element 5, for a second round of comparison, and then exactly all matches succeed

(3) Next array algorithm implementation

void Stirng::getNext(const String &t, int *next) {
    int i = 0, j = - 1;
    next[0] = - 1;
    while (i < t.curLength - 1) {
        if ((j == - 1) || t[i] == t[j]) {
            ++i, ++j;
            next[i] = j;
        }else{ j = next[j]; }}}Copy the code

KMP algorithm code implementation

With the next array in place, we are ready to implement the KMP algorithm

Returns the location of the substring’s first occurrence in the main string on success, -1 on failure, and 0 on empty substring

int String::kmpFind(const String &t, int pos) {
    // It is not allowed to apply for arrays of size 0
    if (t,curLength == 0) return 0;
    // If the primary string is smaller than the substring, the match fails
    if(t.curLength < t.curLength) return - 1;
    // primary pointer I, subpointer j
    int i = 0, j = 0;
    int *next = new int[t.curLrngth];
    getNext(t,next);
    while (i < curLength && j < t,curLength) {
        if (j == - 1 || data[i] == t.data[j])	//情况12
            i++, j++;
   		else	//情况3
           	j = next[j];
    }
    delete []next;
    if (j > t.curLength)
        return (i - t.curLength)
    else
        return - 1;
}
Copy the code

KMP pattern matching algorithm improved

There is a special case that makes us have to consider the improvement of KMP algorithm

If you have multiple consecutive repeating elements in the substring, for example, the main string S= “aaABCde” and the substring T= “aaaaax” are not moving in the main string pointer, moving the substring pointer to compare these values is actually a lot of useless work, because the first five elements in the substring are the same a, so we can omit these repeated steps

void String::getNextVal(const String &t, int *nextVal) {
    int i = 0, j = - 1;
    nextVal[0] = - 1;
    while (i < t.curLength - 1) {
        if ((k == - 1) || (t[i] == t[j])) {
            ++i, ++j;
            if(t[i] ! = t[j]) nextVal[i] = j;else
                nextVal[i] = nextVal[j];
        }
        elsej = nextVal[j]; }}Copy the code

The core of this improvement is increased for string t and t [j] [I] is equal, equal directly to nextVal [j] values assigned to nextVal [I]

conclusion

In BF algorithm, when the main string and the substring do not match, the Pointers of the main string and the substring need to be backtracked, so the algorithm has high time complexity O(nm) and space complexity O(1) note: Although its time complexity is O(nm), its execution time is approximately O(n+m) in general applications, so it is still used

KMP algorithm uses the structural similarity of substrings to design the next array, which achieves the effect of no backtracking of the main string and greatly reduces the comparison times, but correspondingly sacrifices the storage space. The time complexity of KMP algorithm is O(n+m) and the space complexity is O(n).

The end:

If there are any deficiencies or mistakes in the article, please leave a comment and share your thoughts. Thank you for your support!

If it helps you, follow me! If you prefer the way of reading articles on wechat, you can follow my official account

We don’t know each other here, but we are working hard for our dreams

A adhere to push original development of technical articles of the public number: ideal more than two days