Writing in the front

In computer science, Boyer-Moore string search algorithm is a very efficient string search algorithm. It was designed by Bob Boyer and J Strother Moore in 1977. This algorithm preprocesses only the searched string (keyword), not the searched string. Although the execution time of the Boyer-Moore algorithm also depends linearly on the size of the searched string, it is usually a fraction of that of other algorithms: it does not need to compare characters in the searched string one by one, but skiffs some of them. Generally, the longer the search keyword, the faster the algorithm. Its efficiency comes from the fact that for every failed match attempt, the algorithm is able to use this information to exclude as many unmatched locations as possible.

Before studying BM algorithm, I had already mastered KMP algorithm, so I suggest those who have not mastered it to learn it first, step by step, and you can read my article on KMP algorithm. The reason why we say step by step is that BM algorithm, in most cases, performs better than KMP algorithm, so most of the time, it is learned as KMP advanced algorithm. The BM algorithm matches from the end of the pattern string and has a worst-case time complexity of $O(N) $. Data show that in practice, it is higher than the actual efficiency of KMP algorithm, and can be about 3-5 times faster, which is worth learning. When learning BM algorithm, I have found a lot of materials and encountered many excellent articles. However, I have not encountered any articles that both explain the principle and realize the code, and it is not easy to find the Java version. Therefore, I am going to stand on the shoulders of the great god and write this article. Since some pictures take up more time, I will directly quote some pictures from the blog. Those who are interested can read their articles directly.

Principle of BM algorithm

BM algorithm defines two rules:

  • Bad character rule: When a character in the text string does not match a character in the pattern string, we call the mismatched character in the text string a bad character, and the pattern string needs to move to the right.Number of digits moved = position of bad character in pattern string - position of bad character in pattern string to the right. In addition,If the "bad character" is not included in the pattern string, the right-most position is -1.
  • Rule of good suffixes: When characters are mismatched,Backshift digit = position of good suffix in pattern string - position of good suffix in pattern string last occurrenceAnd,-1 if the good suffix does not appear again in the pattern string.

The following examples illustrate the BM algorithm. For EXAMPLE, given A text string “HERE IS A SIMPLE EXAMPLE”, and A pattern string “EXAMPLE”, we now want to find whether the pattern string IS in the text string, and if so, return the position of the pattern string in the text string.

  • First, align the “text string” header with the “pattern string” header and compare from the tail.S“And”E“It doesn’t match. At this point,”SThis is called a bad character, or mismatched character, which corresponds to the first character in the pattern string6position And”S“Not included in pattern string”EXAMPLEIn the middle (is equivalent to the most right appearing position is- 1), which means you can move the pattern string back6 - (1) = 7Bit, thus moving directly to”S“Is the last one.

  • Again, we start at the tail and we find.”P“And”E“It doesn’t match, so”P“Is” bad character “. However,”P“Contained in pattern string”EXAMPLE“. Because of the”PThe “bad character” corresponds to the first character in the pattern string6(from0Start number), and the right-most position in the pattern string is4, so, move the pattern string back6-4 = 2Bit, two “P” aligned.



  • Compare in turn, and get”MPLE“Matches, called” good suffixes, “are strings with matching tails. Note that”MPLE“、”PLE“、”LE“、”E“All good suffixes.

  • Found that”I“And”A“Mismatch:”I“Is a bad character. If the bad character rule is followed, the pattern string should be moved back2 - (1) = 3position The question is, is there a better way to move it?



  • A better shift is to take advantage of the suffix rule: when characters are mismatched,Backshift digit = position of good suffix in pattern string - position of good suffix in pattern string last occurrence, and if the good suffix does not appear again in the pattern string- 1. All “good suffixes” (MPLE, PLE, LE, E), only”E“In the”EXAMPLE“The head appears, so moves back6-0 = 6position As you can see, the “bad character rule” can only move3Bit, “good suffix rule” can be moved6position Move the larger of the two rules each time. The move bits of these two rules relate only to the pattern string, not to the original text string.

  • Continuing the comparison from the end, “P” does not match “E”, so “P” is a “bad character”, which is moved 6-4 = 2 bits by the “bad character rule”. Because is the last digit is mismatched, has not obtained the good suffix.

Good suffixes deepen understanding

It can be seen from the above that BM algorithm is not only efficient, but also clever and easy to understand. The bad character rule is relatively easy to understand, and the good suffixes, if you don’t understand them, I’m going to go ahead and give you an example here, just to make sense of it.

  • If there are good suffixes in the pattern string that have already been matched, align the target string with the good suffixes and match from the last element of the pattern string.



  • If the suffix cannot be found, find the longest prefix that matches, and align the target string with the longest prefix (if it exists). Pattern string [m-s, m] = pattern string [0, s].



  • If there is no substring matching the good suffix at all, move the entire pattern string right.

Implement good character rules first

BM is pretty easy to understand, and in fact if you’ve studied KMP before you will feel the same way, KMP is not too hard to understand, but the point is how to implement the next array. BM algorithm is also very easy to understand the principle, but how to achieve, there is no set of standard code. But you can study other people’s code and implement a code that is as lean as possible. Again, step by step, let’s implement good character rules first. The code for good character rules is as follows. I will add comments where necessary to the code to help understand. Code is the best teacher.

public static void getRight(String pat, int[] right) {
	// Create an array of pattern string character positions, initialized to -1, which is used to record pattern string
	//, the relative position of each character in the pattern string, where 256 is used directly, also
	// is the maximum number of ASCII characters, of course, if your string is limited to 26
	//, you can also use 26 directly
    for (int i = 0; i < 256; i++) {
        right[i] = -1;
    }
    // It is worth mentioning that in this way you can find if the pattern string exists the same
    // character, then in the right array, records the position of the right-most character
    for (int j = 0; j < pat.length(); j++) { right[pat.charAt(j)] = j; }}public static int Search(String txt, String pat, int[] right) {
    int M = txt.length();// The main string length
    int N = pat.length();// The length of the pattern string
    int skip;// Used to record skipping several characters
    for (int i = 0; i < M - N; i += skip) {
        skip = 0;// Remember to initialize to 0 every time you enter the loop
        for (int j = N - 1; j >= 0; j--) {
        	// not equal, which means bad characters, move according to the rules above
            if(pat.charAt(j) ! = txt.charAt(i + j)) { skip = j - right[txt.charAt(i + j)];// Skip is less than 1 because bad characters are at the far right of the pattern string, probably
                // If the j points to the right of the character, it has been crossed.
                if (skip < 1) 
                    skip = 1;
                break; }}// If skip is equal to 0, no bad characters appear, so
        // If the match is successful, return the position of the current character I
        if (skip == 0)
            return i;
    }
    return -1;
}
Copy the code

Full BM implementation

The above code is not difficult to understand, I believe you have understood, so the next also do not have to separate good suffix implementation, directly on the complete implementation code. Because the full BM implementation compares the bad character rule with the good suffix rule, whichever moves more characters is used. As usual, I’ll comment the following code as much as POSSIBLE.

public static int pattern(String pattern, String target) {
    int tLen = target.length();// The main string length
    int pLen = pattern.length();// The length of the pattern string

	// If the pattern string is longer than the main string, there is no comparability, return -1
    if (pLen > tLen) {
        return -1;
    }

    int[] bad_table = build_bad_table(pattern);// Get an array of bad character values, as shown below
    int[] good_table = build_good_table(pattern);// Get an array of good suffix values

    for (int i = pLen - 1, j; i < tLen;) {
        System.out.println("Jump position:" + i);
        // This is different from the above implementation of bad characters. We solved the bad characters and good suffixes in advance
        // The corresponding array of values, so we only need to compare in the side loop. The other thing I want to make clear is that here
        // Skip is not used to record the position skipped, and moves directly against the moving pointer I in the main string
        for (j = pLen - 1; target.charAt(i) == pattern.charAt(j); i--, j--) {
            if (j == 0) {// Returns the first character of the pattern string, indicating that the match was successful
                System.out.println("Match successful, location:" + i);
                // if you want to match more than one pattern string, break out of the loop and make i++
                // Because you can't just skip the entire string that has already been matched, you might lose the match.
// i++; // Multiple matches
// break;
                returni; }}// If bad characters are present, then the array of bad characters and good suffixes is used
        i += Math.max(good_table[pLen - j - 1], bad_table[target.charAt(i)]);
    }
    return -1;
}

// Character information table
public static int[] build_bad_table(String pattern) {
    final int table_size = 256;// The type of character is explained above
    int[] bad_table = new int[table_size];// Create an array to record the number of characters that should be skipped when bad characters occur
    int pLen = pattern.length();// The length of the pattern string

    for (int i = 0; i < bad_table.length; i++) {
        bad_table[i] = pLen;  
        // Default initialization is all matching string lengths, because bad characters in the main string do not appear in the pattern string
        // Skip the length of the pattern string
    }
    for (int i = 0; i < pLen - 1; i++) {
        int k = pattern.charAt(i);// Record the current ASCII character values
        Bad_table is stored according to the ASCII value of a character
        // Bad characters appear in the right-most position, as mentioned above when implementing bad characters. But think about it
        // Why is the value of the bad character stored here the right-most bad character relative to the pattern string last
        // The position of the character? Why is that? The first thing you need to understand is what I means, this I is not an I here, it's an I up here
        // The "I" in the loop of the pattern function, which we call "I" for convenience. This "I" is magic, although "I" is
        // The pointer on the main string, but since skip is not used in the loop, I is matched with j
        // move, which means that, in a sense, I can also be directly positioned relative to the pattern string,
        // If you understand this, you can understand the behavior of I in this loop.

		// If you think about it carefully, let's consider the case, if the pattern string is the most
        // The last character, which is the first character of the match, has a bad character, then this time, straight
        // Move the value so that the rightmost character is correct for the bad character. So if not the first one
        // bad characters? This kind of circumstance you think carefully, this kind of circumstance also means appeared good suffix
        // case, suppose we set the right-most character to the bad character
        bad_table[k] = pLen - 1 - i;
    }
    return bad_table;
}

// Matches the offset table
public static int[] build_good_table(String pattern) {
    int pLen = pattern.length();// Mode string length
    int[] good_table = new int[pLen];// Create an array and store the suffix values
    // Used to record the relative position of the latest prefix, initialized to the pattern string length, because it means that the current suffix string is empty
    // Understand the meaning of lastPrefixPosition
    int lastPrefixPosition = pLen;

    for (int i = pLen - 1; i >= 0; --i) {
        if (isPrefix(pattern, i + 1)) {
        // If the current position has a prefix match, the current position is recorded
            lastPrefixPosition = i + 1;
        }
        good_table[pLen - 1 - i] = lastPrefixPosition - i + pLen - 1;
    }

    for (int i = 0; i < pLen - 1; ++i) {
    // Calculates the length of the string that matches the suffix at the specified position
        int slen = suffixLength(pattern, i);
        good_table[slen] = pLen - 1 - i + slen;
    }
    return good_table;
}

// The prefix matches
private static boolean isPrefix(String pattern, int p) {
    int patternLength = pattern.length();// Mode string length
    // where j starts at the first character in the pattern string, I starts at the specified character position, and loops through the current specified position p
    // Whether the following string matches the pattern string prefix
    for (int i = p, j = 0; i < patternLength; ++i, ++j) {
        if(pattern.charAt(i) ! = pattern.charAt(j)) {return false; }}return true;
}

// The suffix matches
private static int suffixLength(String pattern, int p) {
    int pLen = pattern.length();
    int len = 0;
    for (int i = p, j = pLen - 1; i >= 0 && pattern.charAt(i) == pattern.charAt(j); i--, j--) {
        len += 1;
    }
    return len;
}
Copy the code

Let’s take a look at the above code. Here’S an example for the above code. The values of the two tables after calculation are as follows:



conclusion

Actually if you understand the above, I believe you will have a kind of excitement, may also have a question, is the BM algorithm, if the bad character and good suffix rules are realized, looking at the code quantity how much, and at the time of calculating displacement of two arrays, equivalent to do so many preparation work, won’t affect efficiency. It is difficult for me to answer this question. According to the specific implementation of different languages, the execution efficiency of the code will also be affected. However, it is certain that BM algorithm has obvious advantages when the length measure of character matching is very large. After all, CV data sets tend to be in the millions. BM algorithm is worth learning.