preface

String matching problem: give you two arbitrary string string A = “afhasoidfhaiodfaodfnoahfadfnad”; String B = “dfaod”; Let you figure out where the string B first appears in the string A, and if it doesn’t, return -1.

Note: Normally string A is called the main string and string B is called the pattern string.


String matching is a very common problem, but there are many kinds of algorithms involved, and each has its own characteristics. The relevant algorithms are as follows:

  1. BF algorithm (not Boy friend, brute Force), brute Force method, is the simplest, most direct and least efficient.
  2. The RK algorithm, which is an improved version of BF, turns the matching string into the hash value of the matching string.
  3. KMP algorithm, textbook level algorithm, higher efficiency, a little round.
  4. BM algorithm, which is said to be used for command + F shortcut search, is very efficient, about 3 ~ 5 times that of KMP.
  5. Sunday algorithm, an optimized version of BM, is extremely efficient and is also known as the fastest string matching algorithm.

The following is mainly for these algorithms to say their respective principles and implementation.

BF algorithm

BF algorithm also called violence, listen to the name of the know, simple and rough. The main idea of this algorithm is to compare from scratch, character by character, until all matches or the main string ends. And that’s the algorithm that comes to mind when we see this problem.

The code is as follows:

#import <Foundation/Foundation.h>

#define NO_FOUND -1
/ / method of violence
int findSubstringInStringBF(char *str, char *subStr)
{
    if(! *str || ! *subStr) {return NO_FOUND;
    }
    int i = 0, j = 0;
    char *p;
    while (str[i])
    {
        // Compare whether the first letter is the same
        if (str[i] == subStr[0]) {
            // Start matching
            p = str + i;
            j = 1;          // The first character has already been compared
            // All characters exist and are the same
            while (p[j] && subStr[j] && p[j] == subStr[j]) {
                j++;
            }
            
            if(! subStr[j]) {// The substring matches exactly
                return i;
            }
            if(! p[j]) {// The main string is at the end of the string, after which the string length is not enough, there is no need to compare
                return NO_FOUND;
            }
        }
        i++;
    }
    return NO_FOUND;
}
int main(int argc, const char * argv[]) {
    @autoreleasepool {
        int a = findSubstringInStringBF("aaabcabcde"."abcd");
        if (a >= 0) {
            printf("Found substring existing index is: %d\n", a);
        }
        else {
            printf("The string \n was not found"); }}return 0;
}
Copy the code

Fairly RK algorithm

The introduction to the algorithm (which year? Who put it forward? …). Forget it. These things don’t mean anything other than shit. Understand the following method??

Let’s get down to business

The idea of algorithms

As you can see from the introduction, the RK algorithm turns contrast characters into contrast hash values. Here’s an example:

And now we have the pivot string A is equal to theta"abcdefg", mode string B ="abc"; Violence method: to compare 3 times respectively, namely contrast a, contrast B, contrast C, all matched, the match is successful. So can we just compare it once? At this time, I thought of usinghash. RK algorithm: since the length of the pattern string is 3, the main string can be divided into many substrings of length 3. ABC, BCD, CDE, DEF, efG calculate their respectivelyhashValues arehash(abc),hash(bcd),hash(cde),hash(def),hashWill these (efg)hashValue with mode string BhashValues are compared ifhashValues don't match, so these two strings definitely don't match; ifhashValues match, don't get too excited, because it might existhashConflict, and then a character by character comparison to verify again.Copy the code

Advantages and disadvantages of the algorithm

Advantages: If the hash value is different, it can be passed directly, preventing the occurrence of aaAAAAAAAAAAAB and AAAAAAAAAAAA matching. .

Disadvantages: It takes time to compute hash values and the hash value may exceed the maximum limit of the int type. Therefore, you need to design a good hash algorithm to avoid hash conflicts.

To optimize the

Calculates the hash value of the next string using the hash value of the previous string.

First of all, how do you evaluate a string to a number? We know decimal numbers:

So, can I think of 26 letters as base 26? A = 1, b = 2

So, the question is, how do you convert a to 1??

char a = 'a';
int aInt = a - 'a' + 1;     // We add 1 because we want to convert a to 1 instead of 0
// In the same way, b, C, d... can be calculated in this way.
Copy the code

Back to the question above, how do you optimize the hash?

So let’s go back to decimal



then

I believe that after reading these two formulas, you should understand what the optimization method is, right? That is, the hash value of the previous substring is used to calculate the hash value of the next substring to achieve optimization effect.

Code implementation

#import <Foundation/Foundation.h>

#define NO_FOUND -1
long hashStr(char *str, int length)
{
    char *p = str;
    long num = 0;
    for (int i = 0; i < length; i++) {
        int a = p[i] - 'a' + 1;
        // I didn't use optimization here, because I found it would make a calculation error
        // There is no 0 in base 26.
        num += a * pow(26, length - 1 - i);
    }
    return num;
}
// Determine whether the two strings are equal within the specified length
BOOL isEqual(char *str1, char *str2, int length)
{
    int strl1 = (int)strlen(str1);
    int strl2 = (int)strlen(str2);
    if (strl1 < length || strl2 < length) {
        return NO;
    }
    int i = 0;
    for (; i < length; i++) {
        if(str1[i] ! = str2[i]) {break; }}if (i == length) {
        return YES;
    }
    return NO;
}
// The RK algorithm matches the string
int substringIndexFromString(char *str, char *subStr)
{
    if(! *str || ! *subStr) {return NO_FOUND;
    }
    int  strLen         = (int)strlen(str);             // The main string length
    int  subLen         = (int)strlen(subStr);          // Substring length
    long subHash        = hashStr(subStr, subLen);      / / substring hash
    
    int i = 0;
    while (i <= strLen - subLen) {
        if (subHash == hashStr(str, subLen) && isEqual(str, subStr, subLen)) {
            return i;
        }
        i++;
        str++;
    }
    return NO_FOUND;
}

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        
        int a = substringIndexFromString("aaabcabcde"."abcd");
        if (a >= 0) {
            printf("Found substring existing index is: %d\n", a);
        }
        else {
            printf("The string \n was not found"); }}return 0;
}
Copy the code

KMP algorithm

This algorithm may not be that easy to understand, but it’s not too difficult.

First, the key to this algorithm is the traceback array of pattern strings.

A traceback array of pattern strings

It is called a traceback array of pattern strings because the array is computed from pattern strings, and different pattern strings compute different results.

Backtrace: To go back, under certain conditions, to the position of an existing character.

Direct say concept may not understand very much, take an example below should understand.

Example: Primary string: abcabcabcabf Substring: abcabf// Also called pattern stringA b C A B f substring String0 1 2 3 4 5The subscript0 0 0 1 2 0Traceback array (saves the traceback index, regardless of where it came from, will be explained later)// Start matching strings
1If abCAB does not match, the second c and f of the main string do not match2, will look for the backindex of the letter before f, find the subscript is2
3, from the subscript is2Compare the second c in the main string4, matches, moves back, that is, the previous ab in the substring is skipped5, continue to match, to the third C and F comparison, found that the mismatch again.6, repeat steps3
7, repeat steps4
8, f and f match, and the substring ends, the match is successfulCopy the code

As we can see from this example, because there is a duplicate ab in the pattern string, the character f after the second ab fails to match, but we can also know from this example that the character AB before f is matched successfully, otherwise it will not match the position f. Now that AB matches successfully, is it necessary to compare AB again next time? I don’t have to, so I can go back to the character before f and find the subscript 2 for b, and then I can see that the pattern string for b is C, so I start right at C and then I skip ab before C.

Compute traceback array

Again, take the pattern string above as an example:

Again, take the substring a, B, C, a, B, and f above1The first a is the first element, so the backtrace subscript is0
a b c a b f
0
2Set I to the position of the first B, and set j to the position of the first A, and compare whether the elements of I and j match j I A B C A B F0
3, find no match, and j has no element before it, so the traceback index of the first b is0I moves back a j I a B C A B F0 0
4C is the same0
j     i
a b c a b f
0 0 0
5, found that I and j matched (the first a and the second A), in this case, the backtracking subscript corresponding to the second A is j +1, I and j both move backwards j I a B C a B F0 0 0 1
6And the first5J I A B C A B F0 0 0 1 2
7In this case, I and j do not match, but j still has elements in front of j, then j is backtracked to j -1The corresponding backtrace subscript, j minus1That corresponds to b, and the backindex of B is b0
j         i
a b c a b f
0 0 0 1 2
8At this time, I and j still do not match. Since j is not in front of it at this time, the backtracking subscript of f is set as0
j         i
a b c a b f
0 0 0 1 2 0

9There is no element after I, indicating that the traceback array of the substring has been evaluated.Copy the code

Code implementation

#import <Foundation/Foundation.h>

#define NO_FOUND -1

int* getBacktrackArr(char *str, int strLen)
{
    // Traceback the array
    int *btArr = malloc(sizeof(int) * strLen);
    btArr[0] = 0;       // The first one is 0
    int j = 0;
    for (int i = 1; i < strLen; i++) {
        if (str[i] == str[j]) {
            btArr[i] = j + 1;
            j++;
        }
        else {
            if (j == 0) {
                btArr[i] = 0;
            }
            else {
                j = btArr[j - 1];
                // at this point I cannot be moved backwards, so use I -- to cancel out I ++ of this roundi--; }}}return btArr;
}

int findSubstringInStringKMP(char *str, char *subStr)
{
    if(! *str || ! *subStr) {return NO_FOUND;
    }
    // Get the traceback array of substrings
    int subLen = (int)strlen(subStr);
    int *btArr = getBacktrackArr(subStr, subLen);
    // Iterate over characters
    char *p = str;
    int i = 0, j = 0;
    while (p[i] && j < subLen) {
        if (p[i] == subStr[j]) {
            i++;
            j++;
        }
        else {
            if (j == 0) {
                i++;
            }
            else {
                j = btArr[j - 1]; }}}free(btArr);
    if (j == subLen) {
        return i - subLen;
    }
    
    return NO_FOUND;
}

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        int a = findSubstringInStringKMP("abccbddfaaabcabcabcabcabcabxasabc"."abcabcabx");
        if (a >= 0) {
            printf("Found substring existing index is: %d\n", a);
        }
        else {
            printf("The string \n was not found"); }}return 0;
}
Copy the code

BM algorithm

After looking at the previous KMP algorithm, we can find that the designer of this algorithm is really quite awesome, too clever. However, this algorithm is not used in the actual development of much, at most, that is, postgraduate entrance examination and interview may encounter.

So why isn’t such a clever algorithm used in real development? The efficiency is too slow, the next to say BM algorithm is said to be 3 ~ 5 times the efficiency of KMP (of course, these are the author heard others say, there should be a cow test the efficiency of these two algorithms, anyway, I have not personally tested 😁, wrong don’t look for me).

Principle of BM algorithm

A lot of other articles say this algorithm is easier to understand than KMP, but I don’t think so. The KMP algorithm I studied the steps of string matching twice to understand the principle, but this thing, two days of reading books, reading other people’s blogs, finally understand how this algorithm does string matching.

Let’s start with a couple of things that make this algorithm unique

Matching order From the previous several algorithms can be seen, the matching order is from left to right, this algorithm is not the same, is the reverse, from right to left start matching.

/ / such as
// Main string: abcdefabcdefg
// substring: abcdefg
// For positive matching, abcdef must be matched successfully for 6 times and failed for 7 times
// Reverse matching, however, requires only one attempt to determine the match failure.
// Of course, this is not useful because we can't be sure what the string to match is
/ / such as
// Main string: afedcbagfedcba
// substring: gfedcba
Copy the code

Bad character Bad character is the first character in the main string that does not match the pattern string in each round of string matching.

For example, primary string: abcdefg substring: abD// The c in the main string is the bad characterSubstring: abbce// The d of the main string is a bad character (not e, because e matched successfully)
Copy the code

A good suffix is a string of characters successfully matched in the process of reverse matching.

For example, primary string: abcdefg substring: bacde// cde is a good suffix, b is a bad character
Copy the code

The question is, what is the use of these things? Let’s talk about the role of bad characters in string matching. (Don’t bother drawing, just use code blocks)

For example, primary string: abcdefg Substring: def offset: defIf c and f do not match, c is a bad character. If c is a bad character, check whether c exists in the substring
// The substring is offset back to the next character in c, because c and the character before c cannot matchAnother example is primary string: aaaabCD Substring: abcd Offset: abcdIf a and d do not match, a is a bad character, and the substring contains a
// The result is, and a is in position 0, so the string fails to match d in position 3
// Therefore offset 3-0 charactersAnother example is primary: baabaaab Substring: aaab Offset: aaabThe last string b and a do not match. The primary string b is a bad character. Find the substring B
// if b is in position 3, the offset is 0-3 characters
W/o? What the hell? Wrong? How can you still move forward?
Copy the code

There’s nothing wrong with the third example, it should be that way. And this is the disadvantage of bad character matching, so in addition to bad character matching, but also say good suffix matching.

First statement, bad character and good suffix both matching method belongs to the BM algorithm, but their is no relationship between them, and he said, even if your algorithm USES only one of them, also can, just bad characters method will exist in the example above the forward migration problem (suffix) can be used independently.

Here is the good suffix, which is the key to the author’s two days.

// Again, as an exampleFor example, primary string: babacabdeabxxxx Substring: cabdeab Alignment: cabdeab// The last c character does not match the last e character, and the suffix is ab, then we need to find the substring
// In addition to the good suffix itself, is there any other substring with the same suffix
// At the first character of the substring (c is the 0th character), we need to put the suffixes ab and ab in the main string
// The ab we found in the substring is aligned and continues to match. (If there is more than one ab in the substring,
// Take the last ab except for the suffix itself, because this is the smallest offset value, no leakage)Another example is primary string: aabbdabcddabcxxxx sub-string: abcddabc align: abcddabc// Find the dABC in the substring other than the good suffix itself
// I can't find it. << important: is the substring of the suffix found the same as the prefix? >>
// Substrings with good suffixes are: ABC, BC,c. Da, AB don't count because they're not suffixes,
// It doesn't make sense. Next, you can see that the substring ABC with the good suffix dabc happens to have the same prefix as the substring
// At this point, align them and proceed to the next round of matching.For example, primary string: aabbefGABcdefgxxxx substring: abcdefg align: abcdefg// There is no other EFG in the substring
// The efG substring does not have the same prefix as the substring. What about this?
// This is the simplest case, just offset the entire substring. There's no way that's going to match.
Copy the code

The question then arises, since there are three different cases of good suffix matching, is there a priority among them? What if all these things come together? (PS: The third can’t happen at the same time as the first two)

For example: Primary string: aabbabdabcdABXXXX Sub-string: abdabcdab Aligned: abdabcdab// Align with case 1Alignment: abdabcdab// Align with case 2
// In this case, the good suffix is dab. Repeated DAB in substring (case 1)
// dab substring ab is the same as the substring prefix (case 2)
// The alignment result of the two cases is as shown above.
Copy the code

At this point, I believe you already know the principle of bad character matching and suffix matching, right?

pretreatment

Can you start writing code if you know how matches work? Think much, next is this algorithm most fine (e) wonderful (xin) place.

Think of the two cases of good suffixes above

When a good suffix appears, I need to look in the substring to see if there is a repeat of the suffix, and if so, I also need to know where the repeat position is? Sounds good, isn’t it to find the position of suffix repetition? How can I find it? Can the length of a good suffix be determined?

In case two, there is a good suffix, the good suffix is not repeated in the substring, so look for the substring of the good suffix is the same as the prefix? How to find? How many characters are uncertain about a good suffix, let alone a substring of a good suffix?

Because it is difficult to directly find the one in line with the above two situations, we need to preprocess the substring first, so that we can easily find the location we need.

Good postfix preprocessing code is as follows:

/// preprocess the suffix
/// @param pStr pattern string
/// @param pLen mode string length
/// @param suffix returns the position at which pattern string suffix characters are repeated in the pattern string (excluding the original position, which is -1 if the suffix is not repeated)
/// for example: pattern string pStr = "abcdefg"; Since the suffix G occurs only at the end, all elements of the suffix array of the pattern string are -1, meaning that there is no duplicate suffix
/// for example: pattern string pStr = "abcdefa"; Since suffix A is also present in position 0 and has only 1 character, suffix[1] = 0
/// for example: pattern string pStr = "abcdeab"; Since suffix B exists at position 1 and a before b exists at position 0, suffix[1(suffix 1 character)] = 1(the subscript 1 character is B), suffix[2] = 0;
/// for example: pattern string pStr = "abcabab"; Suffix [1] = 4, suffix[2] = 3; (not ones and zeros)
///
/// @param prefix whether the suffix substring is the same as the prefix (note: this is a bool array)
/// for example: pattern string pStr = "abcdefg"; Since the suffix G only appears in the last position, that is, there is no case of the same prefix, so both are flase;
/// for example: pattern string pStr = "abcdefa"; Prefix [1] = true; prefix[1] = true;
/// for example: pattern string pStr = "abcdabc"; Prefix [3] = true; prefix[3] = true; // 3 represents the suffix length 3
///
void goodSuffix(char *pStr, int pLen, int *suffix, bool *prefix)
{
    // Initialize the suffix and prefix
    for (int i = 0; i < pLen; i++) {
        suffix[i] = - 1;     // There is no duplicate position by default
        prefix[i] = false;  // The prefix and suffix do not exist by default
    }
    // j indicates the first character (starting from 0), k indicates the last character (starting from 1)
    int j = 0, k = 0;
    for (int i = 0; i < pLen - 1; i++) {
        j = i;
        k = 1;
        // j represents the first character in the pattern string, pLen -k represents the last character
        while (j >= 0 && pStr[j] == pStr[pLen - k]) {
            // Save the position where the KTH suffix repeats
            suffix[k] = j;
            // All move forward one character
            j--;
            k++;
        }
        // Check whether the suffix of length k is the same as the prefix
        if (j == - 1) {
            prefix[k] = true; }}}Copy the code

Notes are very detailed, should be able to understand, there are places that do not understand welcome comments.

Similarly, if good suffixes can be preprocessed, can I also use bad characters? Sure, and it’s not that complicated.

Bad character preprocessing code is as follows:

// Count bad characters by pattern string
int* badChar(char *pStr, int pLen) {
    // Bad character array
    int *bc = malloc(sizeof(int) * SIZE);
    // Initially set all values to -1, indicating that the character does not exist in the pattern string
    for (int i = 0; i < SIZE; i++) {
        bc[i] = - 1;
    }
    
    // Iterate over the pattern string
    // To find the position of character C in the pattern string
    // For example, if the pattern string is abcdef, the position of d in the pattern string is 3, if the character is repeated, only the position of the last occurrence is recorded
    // This is the bad character rule
    for (int i = 0; i < pLen; i++) {
        // The ASCII code corresponding to the i-th character in the pattern string
        int ascii = (int)pStr[i];
        bc[ascii] = i;          // Save the position of this character in the pattern string
    }
    
    return bc;
}
Copy the code

Here is the string matching code:

#import <Foundation/Foundation.h>

#define SIZE 256    // The number of characters in the character set
#define NO_FOUND -1

/* ----------- here put the above two pretreatment methods can be -------------- */

// String matching method
int findStringIndexWithSubstring(char *str, char *subStr, int stl, int ssl)
{
    // Bad character array
    int *bc = badChar(subStr, ssl);
    / / the suffix
    int *suffix = malloc(sizeof(int) * ssl);
    // Whether the suffix of different length is the same as the prefix, the subscript of the array indicates the length of the suffix
    bool *prefix = malloc(sizeof(bool) * ssl);
    // Populate the data to suffix and prefix
    goodSuffix(subStr, ssl, suffix, prefix);
    
    // Start matching strings
    int i = 0, j = 0;
    // The number of characters offset each time
    int os1 = 0, os2 = 0;
    // If I > stl-ssl, the remaining characters are not as long as the substring, and the match cannot be successful
    while (i <= stl - ssl) {
        // BM algorithm matches backwards, starting from the last character of the pattern string
        for (j = ssl - 1; j >= 0; j--) {
            // STR [I + j] represents the characters to be matched in the main string, and subStr[j] represents the characters to be matched in the pattern string (substring)
            if(str[i + j] ! = subStr[j]) {// Do not want to wait, match failed, exit the loop
                break; }}// if j == -1, the string was successfully matched, not by breaking the loop, but by the loop condition
        if (j == - 1) {
            // Free up space
            free(bc);
            free(suffix);
            free(prefix);
            return i;
        }
        // if j >= 0, start counting offset characters
        // 1. Calculate the bad character in STR [I + j] according to the bad character rule
        os1 = j - bc[str[i + j]];
        // 2. Follow the good suffixes rule, if there are good suffixes
        os2 = 0;
        if (j < ssl - 1) {
            // Good suffix string length
            int k = ssl - 1 - j;
            // If k length suffixes have repeated strings
            if(suffix[k] ! =- 1) {
                os2 = j - suffix[k] + 1;
            }
            else {
                int x = 0;
                // x represents the starting subscript of the well-suffixed substring, ssl-x represents the length of the well-suffixed substring
                for (x = j + 1; x < ssl; x++) {
                    if (prefix[ssl - x]) {
                        os2 = x;
                        break; }}// Indicates that break is not used to exit
                if (x == ssl) {
                    If there is no string identical to a good suffix, and no prefix or substring identical to a good suffix, offset the length of the entire pattern string
                    os2 = ssl;      // Skip all of them}}}// Calculate the size of the actual offset (take the largest of the bad characters and the suffix)
        int offset = os1 > os2 ? os1 : os2;
        
        // Offset indicates the size of offset
        i = i + offset;
    }
    // No match for the string, free up space, and return -1
    free(bc);
    free(suffix);
    free(prefix);
    
    return NO_FOUND;
}

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        char *mainStr = "aabcaababcaabcbabcdeaabc";
        char *subStr  = "aababcaa";
        
        int a = findStringIndexWithSubstring(mainStr, subStr, (int)strlen(mainStr), (int)strlen(subStr));
        printf("%d\n", a);
    }
    return 0;
}
Copy the code

Sunday algorithm (known as the fastest string matching algorithm)

This algorithm can be said to be an improved version of BM algorithm, and the idea of BM algorithm is somewhat similar, but the difference is that it is matched from front to back and from left to right.

The Sunday algorithm is no longer concerned with bad characters and suffixes, but with the next character of the last character in the main string.

Algorithm principle

The principle of this algorithm is much easier to understand than BM algorithm. But the author’s language summary ability is not strong, or illustrate for example.

Example Main string: substring searching Substring: search offset: search/ / offset 1Offset: search/ 2 / migration
// The first match is being made
// The matching string in the main string is substr, so the last character is r
// the next character of r is I.
// Find if there is an I character in the substring, offset the substring length by +1
// the second match starts with the n after I
// 
// The second match is being made
// The main string is ng sea, so the last character is a
// The next character of a is r.
// If there is r in the string, align r with r in the string.
// Get offset 2
Copy the code

Code implementation

The Algorithm implementation of Sunday algorithm is also relatively simple.

The code is as follows:

#import <Foundation/Foundation.h>

#define SIZE 256
#define NO_FOUND -1

// Substring offset table
// Next array similar to KMP
int* offsetTable(char *str, int len)
{
    int *os = malloc(sizeof(int) * SIZE);
    for (int i = 0; i < SIZE; i++) {
        // The default offset is len + 1
        os[i] = len + 1;
    }
    for (int i = 0; i < len; i++) {
        os[str[i]] = len - i;
    }
    return os;
}
// Find the index of the substring
int findSubstringIndex(char *str, char *subStr, int stl, int ssl)
{
    // Get the offset array
    int *os = offsetTable(subStr, ssl);
    int i = 0, j = 0;
    while (i <= stl - ssl) {
        j = 0;
        // Matches the string
        while (str[i + j] == subStr[j]) {
            j++;
            // The match is successful
            if (j == ssl) {
                returni; }}// Failed to match
        i = i + os[str[i + ssl]];
    }
    
    return NO_FOUND;
}

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        char *mainStr = "aabcaababcaabcbabcdeaabc";
        char *subStr  = "abcd";
        
        int a = findSubstringIndex(mainStr, subStr, (int)strlen(mainStr), (int)strlen(subStr));
        printf("%d\n", a);
    }
    return 0;
}
Copy the code

conclusion

This article provides five solutions to the classic string matching problem, and illustrates the principles of these five solutions.

The article addresses the https://juejin.cn/post/6844904158865129486