BF,BM,KMP after reading do not understand, I will go to haircut!

Note: if the video in the article is not playable, you can go to check the original text, I think even without watching the video should also be able to read.

In order to ensure the code rigor, all the codes in the article are in leetCode brush questions website AC, you can rest assured to eat.

Birthdays of the emperor, national celebration, yuen restaurant remember as the day the first hotel, so was selected as the food supply of the celebration, the celebration for yuen restaurant is an unprecedented challenge, after all, is the first time for the emperor to celebrate birthday, great sin is not taken off his head, the ji yuan decorate the restaurant is in tension. Suddenly there is a store 2 panic zhang Zhang ran to Yuan Kitchen in front of the report, in the end what happened, let the store 2 so panic?

Inside Yuan Ji Restaurant

Shopkeeper: Oh, no, shopkeeper, something serious has happened.

Yuan Kitchen: what happened, say slowly, so flustered, into what propriety. (The store has been open for a long time, and the shelf has come out.)

Shopkeeper: The emperor ordered 666 dishes according to our menu, but our chef of west Lake fish in vinegar went home to get married on leave, I don’t know if the emperor ordered this dish, if we order this dish, we can’t make it, then our restaurant will be finished.

(Yuan Chu listened to, scared to sit on the ground, slow half a day said)

Yuan Hutch: don’t say so much, quick look for me emperor point of dish inside, have this dish!

After a long search and many checks, it was confirmed that the Emperor had not ordered the dish. Everyone in the restaurant breathed a sigh of relief

With the example above, let’s take a quick look at string matching.

String matching: Suppose S and T are given two strings. The process of finding pattern string T in the main string S is called string matching. If pattern string T is found in the main string S, it is said to have been matched successfully, and the function returns the position where T first appears in S.

Ex. :

In the figure above, we are trying to find the position where the pattern T = baab first appears in the main string S = abcabaabcabac, which is the red shaded part. The position where T first appears in S is subscript 4 (the first subscript of the string is 0), so return 4. If the pattern string T does not appear in the main string S, -1 is returned.

The algorithm to solve the above problem is called string matching algorithm, and today we are going to introduce three string matching algorithms, remember to clock in, you may not be asked in the interview.

BF algorithm Brute Force

The algorithm is easy to understand, which is that we compare the pattern string to the main string, and continue to compare the next character if they are consistent, until the whole pattern string is compared. If not, move the pattern string back one bit and start the comparison again from the first part of the pattern string. Repeat the procedure. Let’s look at the GIF parsing of this method.

Video,

Because you can’t place the video, so if you want to watch the video, you can go to the official account, where the video is

Through the above code is not a moment to understand this algorithm, let’s use this algorithm to solve the following classic problem.

Leetcdoe 28. Implementing strStr()

Topic describes

Given a Haystack string and a Needle string, find the first position in the Haystack string where the Needle string appears (starting from 0). If none exists, -1 is returned.

Example 1:

Input: haystack = “hello”, needle = “ll” Output: 2

Example 2:

Input: haystack = “aaAAA “, needle = “bba” Output: -1

title

It’s actually pretty easy to understand, but we need to pay attention to a few things, like what we should return if our pattern string is 0, what we should return if our pattern string is longer than the main string. So let’s look at the problem code.

The subject code

class Solution {
    public int strStr(String haystack, String needle) {
        int haylen = haystack.length();
        int needlen = needle.length(); 
        // Special circumstances
        if (haylen < needlen) {
            return -1;
        }
        if (needlen == 0) {
            return 0;
        }
        / / the string
        for (int i = 0; i < haylen - needlen + 1; ++i) {
            int j;
            / / pattern string
            for (j = 0; j < needlen; j++) {
                // The main string pointer is moved back one bit
                if(haystack.charAt(i+j) ! = needle.charAt(j)) {break; }}// The match is successful
            if (j == needlen) {
                returni; }}return -1; }}Copy the code

Let’s look at another algorithm of BF algorithm (show back), in fact, the principle is the same, is to modify the code, as long as it is after seeing our GIF, this can also be understood, we can combine the following code notes and GIF to understand.

class Solution {
    public int strStr(String haystack, String needle) {
        // I stands for main string pointer, j mode string
        int i,j;
        // The main string length and the mode string length
        int halen = haystack.length();
        int nelen = needle.length();
        // Loop condition, where only I grows
        for (i = 0 , j = 0; i < halen && j < nelen; ++i) {
            // If it is the same, move the j pointer
            if (haystack.charAt(i) == needle.charAt(j)) {
                ++j;
            } else {
                // In case of a mismatch, redirect j to the head of the pattern string and redirect I to the next character at the beginning of the match
                i -= j;
                j = 0; }}// The index is returned on success, and -1 is returned on failure.
        int renum = j == nelen ? i - nelen : -1;
        returnrenum; }}Copy the code

BM algorithm (Boyer Moore)

We just talked about the BF algorithm, but the BF algorithm is flawed, as in the following case

As shown in the figure above, if we use the BF algorithm, every time we encounter a character that does not match, we move the pattern string one bit to the right, and then match again from the beginning. Let’s observe that each character in our pattern string ABCdex is different, but the first time we match the string, ABCDE matches successfully and fails at x. And because the pattern strings are different bits, we don’t have to move one bit to the right every time and recompare, we can skip some steps. The following figure

We can skip some of these steps and go straight to the following one. So what are our principles?

Bad character rule

Our previous BF algorithm was compared from front to back, while BM algorithm was compared from back to front. Let’s look at the specific process, and we still use the above example.

BM algorithm compares from the back to the front. At this time, we find that the first character of the comparison does not match, and we call the character of the main string bad character, that is, F. After finding the bad character, we search for whether the character (f) is contained in the pattern string T. At this point we simply move the pattern string right one bit after the bad character. The following figure

What if we find a bad character in a pattern string?

At this time, our bad character is F, and we find that there is bad character F in the pattern string. We need to move the pattern string T to align the f in the pattern string with the bad character. See below.

Then we continue to compare from right to left and find that D is a bad character, so we need to align d and bad characters in the pattern string.

So let’s think about the case where there are multiple bad characters in the pattern string.

So why would we want the right-most corresponding element to match a bad character? Let’s see what happens if we don’t follow this rule in the example above.

If we don’t follow our rules above, we miss our true match. Our main string contains babac, but it did not match, so we should follow the rule that the right-most corresponding character is opposed to the bad character.

We have introduced three kinds of movement cases above, namely, no character corresponding to bad character is found in the pattern string below, one corresponding character is found, and two characters are found. So in each of these three cases, we’re moving different numbers, so how do we figure out how to move numbers? Let’s subscript the characters in the figure. See below

So let’s think about that.

At this point this situation is certainly not, not to move to the right, and may even move to the left, so do we have any way to solve this problem? Read on.

Good suffixes rule

Good suffixes are actually very easy to understand, we said before that BM algorithm is compared from right to left, so let’s look at this example.

Here if we move by the bad character is not reasonable, then we can use the good suffix rule, so what is a good suffix?

BM algorithm is compared from right to left. When bad characters are found, CAC has matched successfully and bad characters are found in the red shadow. At this time, the CAC that has been successfully matched is our good suffix. At this time, we take it to look in the pattern string. If we find another string matching the good suffix, we will slide another string matching the good suffix to the position aligned with the good suffix.

If it feels like a mouthful, it doesn’t matter, let’s look at the picture below, red for bad characters, green for good suffixes

So that’s the case, but let’s think about the case below

As we said above, if you don’t find a good suffix in the header of the pattern string, you can also find a substring with a good suffix. But why emphasize the head?

Let’s take a look at the situation

But what happens when we find substrings with good suffixes in the head?

Let’s take a look at the specific execution process of an example through a GIF

video

Having said that, the rule of bad characters and good suffixes is finished. Bad characters are easy to understand, so let’s summarize the good suffixes

1. If the pattern string has a good suffix, either the middle or the head can be moved according to the rules. If the good suffix appears more than once in the pattern string, the right-most good suffix is used as the baseline.

2. If the head of the pattern string contains a substring with good suffixes, it can be moved according to the rules. If the middle part contains a substring with good suffixes, it cannot be moved.

3. If there is a mismatch at the end of the pattern string, that is, there is no good suffix, then move according to the bad character. Some articles here are not mentioned, which needs special attention.

Boyer R S, Moore J s. A fast string searching algorithm [J]. Communications of the ACM, 1977,10:762-772.

Before we began say bad character, isn’t it possible negative situation, namely moves to the left, so we in order to solve this problem, we can calculate good suffix and bad character sliding backwards digit * * (in the case of good suffix not 0) * *, and then take the largest of the two Numbers, as a model series of digits sliding backwards.

It’s a hell of a picture to draw. Now let’s take a look at the algorithm code, it’s a bit long, I’ve annotated it and it’s AC on the website, if you’re interested, but you’re not interested in understanding bad characters and good suffixes. You can skip to the KMP section

class Solution {
    public int strStr(String haystack, String needle) {
        char[] hay = haystack.toCharArray();
        char[] need = needle.toCharArray();
        int haylen = haystack.length();
        int needlen = need.length;
        return bm(hay,haylen,need,needlen);
    }
    // Use to calculate the number of moving digits in the case of bad characters
    private static void badChar(char[] b, int m, int[] bc) {
        / / initialization
        for (int i = 0; i < 256; ++i) {
            bc[i] = -1;
        }
        //m is the length of the pattern string. If there are two as, the last one overwrites the first one
        for (int i = 0; i < m; ++i) {
            int ascii = (int)b[i];
            bc[ascii] = i;/ / the subscript}}// use the suffix to calculate the number of moving digits
    private static void goodSuffix (char[] b, int m, int[] suffix,boolean[] prefix) {
        / / initialization
        for (int i = 0; i < m; ++i) {
            suffix[i] = -1;
            prefix[i] = false;
        }
        for (int i = 0; i < m - 1; ++i) {
            int j = i;
            int k = 0;
            while (j >= 0 && b[j] == b[m-1-k]) {
                --j;
                ++k;
                suffix[k] = j + 1;
            }
            if (j == -1) prefix[k] = true; }}public static int bm (char[] a, int n, char[] b, int m) {

        int[] bc = new int[256];// Create an array to hold the subscript of the rightmost character
        badChar(b,m,bc);
        // The right-most array of suffixes of various lengths
        int[] suffix_index = new int[m];
        // Check if it is a header, true if it is a header
        boolean[] ispre = new boolean[m];
        goodSuffix(b,m,suffix_index,ispre);
        int i = 0;// The first matching character
        // Note the end condition
        while (i <= n-m) {
            int j;
            // Find bad characters from back to front
            for (j = m - 1; j >= 0; --j) {
                if(a[i+j] ! = b[j])break;
            }
            // Pattern string traversal completed, matching successfully
            if (j < 0) {
                return i;
            }
            // What should I do if the match fails
            // Find the number of digits moved under the bad character rule, which is the right subscript of our bad character subscript
            int x = j - bc[(int)a[i+j]];
            int y = 0;
            // In the case of good suffixes, calculate the number of moving digits in the case of good suffixes, if there is no good suffixes, follow the bad character
            if (y < m-1 && m - 1 - j > 0) {
                y = move(j, m, suffix_index,ispre);
            }
            / / move
            i = i + Math.max(x,y);

        }
        return -1;
    }
    // j stands for bad character subscript
    private static int move (int j, int m, int[] suffix_index, boolean[] ispre) {
        // Good suffix length
        int k = m - 1 - j;
        // If there is a good suffix of length k, return the moving digit,
        if(suffix_index[k] ! = -1) return j - suffix_index[k] + 1;
        // Find the maximum length of the substring with a good prefix, starting with the longest substring
        for (int r = j + 2; r <= m-1; ++r) {
            // If it is a header
            if (ispre[m-r] == true) {
                returnr; }}// If there is no string with a good suffix match, or if the header is a good suffix substring, move to the m bit, which is the length of the matching string
        returnm; }}Copy the code

Let’s understand the two arrays that we’re using in our code, because the two regular moves are only related to the pattern string, not the main string, so we can figure out the moves in each case in advance and store them in the array.

KMP Algorithm (Knuth-Morris-Pratt)

We just talked about BM algorithm, which is not very easy to understand, but you can certainly understand it if you pay attention to it, and let’s look at a new algorithm, which is required for the postgraduate entrance exam. In fact, the essence of BM and KMP algorithm is the same, you understand BM and then understand KMP that is a matter of minutes.

Let’s start with an example

video

In order to make it easier for readers to understand, we changed the pointer movement into mode string movement, which is the same relative to the main string movement, and the re-comparison is from the pointer position.

Through the above example is not soon can understand the idea of KMP algorithm, but the difficulty of KMP is not here, but more thinking, carefully look at the understanding is also very easy.

In the example above we mentioned a noun, longest common prefix and suffix. What does this mean? Let’s use a simpler example to illustrate this.

At this point, we failed to match in the red shadow, and the green part is the successful part, so we observe the successful part.

Let’s look at all the prefixes for the successful part

Our longest common prefix and suffix is shown below, so we need to move like this

Ok, so the core principles of KMP are pretty much settled, but the question is, how do we know what the longest common prefix and suffix is? How do you know how many places to move?

Just now we said in BM that the number of digits we move has nothing to do with the main string, only with the pattern string, and with the value of our BC,suffix and prefix arrays. We can know how many digits we move each time by these arrays. In fact, KMP also has an array, which is called the next array. So what does this next array store?

Next array stores our longest common prefix suffix, prefix ending character subscript. If it feels a little awkward, let’s go through an example.

Now that we know about the next array, it’s easy to implement our KMP algorithm, and let’s take a look at what the next array does.

The rest of the story goes without saying. It’s exactly the same. Let’s translate this example into the animation that we started with.

Because you can’t place the video, so if you want to watch the video, you can go to the official account, where the video is

Now let’s take a look at the code, marked with detailed comments, take a look.

Note: Many textbooks have inconsistent representations of the next array

class Solution {
    public int strStr(String haystack, String needle) {
        // Two special cases
        if (needle.length() == 0) {
            return 0;
        }
        if (haystack.length() == 0) {
            return -1;
        }
        / / char array
        char[] hasyarr = haystack.toCharArray();
        char[] nearr = needle.toCharArray();
        / / the length
        int halen = hasyarr.length;
        int nelen = nearr.length;
        // return the subscript
        return kmp(hasyarr,halen,nearr,nelen);

    }
    public int kmp (char[] hasyarr, int halen, char[] nearr, int nelen) {
        // Get the next array
        int[] next = next(nearr,nelen);
        int j = 0;
        for (int i = 0; i < halen; ++i) {
            // Find the mismatched character, then move the pointer according to the next array, move to the maximum common prefix,
            // The last bit of the prefix has the same meaning as the movement pattern string
            while (j > 0&& hasyarr[i] ! = nearr[j]) { j = next[j -1] + 1;
                // If the length is exceeded, you can return that it does not exist
                if (nelen - j + i > halen) {
                    return -1; }}// If they are the same, move the pointer back to compare the next character
            if (hasyarr[i] == nearr[j]) {
                ++j;
            }
            // Iterate over the entire pattern string, returning the starting index of the pattern string
            if (j == nelen) {
                return i - nelen + 1; }}return -1;
    }
    // This section is difficult to understand, do not want to see the students can ignore the general meaning, or debug yourself, see how it works
    // I will write a comment for each step
    public  int[] next (char[] needle,int len) {
        // Define the next array
        int[] next = new int[len];
        / / initialization
        next[0] = -1;
        int k = -1;
        for (int i = 1; i < len; ++i) {
            // We now know the longest prefix for [0,i-1], but we need to backtrack if k+1 points to a different value than I
            // Next [k] is used to record the length of the longest common suffix of the substring.
            Next [k+1] = next[k+1]
            while(k ! = -1 && needle[k + 1] != needle[i]) {
                k = next[k];
            }
            // In the same case, the next digit of k is the same as I, and we already know the longest prefix and suffix of [0,i-1]
            // K - 1 is the same as I
            if (needle[k+1] == needle[i]) {
                ++k;
            }
            next[i] = k;

        }
        returnnext; }}Copy the code

I am Yuan Chu, a programmer who loves cooking, and a young man who loves using dynamic graphic algorithm. The new man asks for support. This article really wrote for a long time, if you feel good, please point a like, we can also go to my public number to see all my articles, each has a GIF analysis, the public number: Yuan Chu’s algorithm hut