This is the fifth day of my participation in the August More text Challenge. For details, see: August More Text Challenge

We learned the principle of BM algorithm in BM algorithm, today we will look at how to implement the code.

Let’s look at how the BM algorithm code is implemented.

In the “bad character rule”, when we encounter bad characters, we need to calculate the backward digit AI-bi, where Bi is the key point of calculation. If we take the bad characters and look in order in the pattern string, this can be done, but it is less efficient. We can use an array of size 256 to record the position of each character in the pattern string. The subscript of the array corresponds to the Ascii encoding of the character, and the array stores the position of the character in the pattern string.

SIZE = 256
def generateBC(b, m):
    bc=[-1]*SIZE
    for i in range(m):
        accii=ord(b[i])
        bc[accii]=i
    return b
Copy the code

We first write the “bad character rule” in BM algorithm, and ignore the “good suffix rule”, and ignore the case where ai-bi is negative.

Def bm(a,n,b,m): # def bm(a,n,b,m): # def bm(a,n,b,m): # def bm(a,n,b,m): # def bm(a,n,b,m): # def bm(a,n,b,m): # def bm(a,n,b,m): # def bm(a,n,b,m): # def bm(a,n,b,m): # If (a[I +j]! I = I + (j - BC [ord(a[I +j])]) return-1 = I + (j - BC [ord(a[I +j])]) return-1Copy the code

So far, we have implemented the code framework for the “bad character rule”, and all that remains is the code logic to add the “good suffix rule” to it. Before we go any further, let’s briefly review the key points of the good suffix rule.

  • In the pattern string, look for another substring that matches the good suffix.
  • In the suffix substring of good suffixes, find the longest one that matches the prefix substring of the pattern string.

One way to think about it is that, since the good suffix is also a suffix substring of the pattern string, we can preprocess the pattern string by calculating the position of each suffix substring in the pattern string and the corresponding position in the other matched substring before matching the pattern string to the main string. This preprocessing process is a bit complicated. You can read it several times and draw a picture on the paper.

Let’s first look at how to represent the different suffix substrings in the pattern string. Since the position of the last character in the suffix substring is fixed, with the subscript m-1, we only need to record the length. We can uniquely identify a suffix substring by its length.

Let’s introduce a key array, suffix, next. The subscript k of the suffix array represents the length of the suffix substring, and the corresponding position of the array stores the starting subscript position of the substring {u*} that matches the suffix {u} in the pattern string.

One point to note is that if there are multiple substrings in the pattern string that match the suffix substring, we choose the beginning of the last substring so that we don’t slide too far and miss a possible match.

Let’s look at the second rule of suffixes, which is to find the longest suffixes that match the prefix substring of the pattern string in the substring of good suffixes. Next, let’s introduce another array variable, prefix, to record whether the suffix of the pattern string matches the prefix of the pattern string.

 

Now, how do we assign values to these two arrays? We take the substring with the subscript 0~ I (I can be 0 to m-2) and the whole pattern string, and find the common suffix substring. If the length of the common suffix substring is k, then we record suffix[k]=j (j represents the starting subscript of the common suffix substring). If j is equal to 0, that is, the common suffix substring is also the prefix substring of the pattern string, then we record prefix[k]=true. Let’s look at the code implementation.

def generateSP(b,m): suffix= [-1]*m prefix= [False]*m for i in range(m-1): J = I k = 0 # public suffix substring length while (> = 0 and b [j] j = = b - k] [m - 1) : j = k = k + 1 # j j - 1 + 1 said public suffix substring in the b [0, I] starting subscript suffix [k] = j + 1 if (j = = 1) : Prefix [k] = True return suffix,prefixCopy the code

Now let’s see how do we calculate the number of digits that we slide back in the pattern string according to the good suffix rule? Let’s say the length of the suffix is k. Let’s take the suffix and look for a matching substring in the suffix array. If suffix[k] is not equal to -1 (-1 means no matching substring), then we move the pattern string back j-suffix[k]+1 bit (j means the character subscript in the pattern string corresponding to the bad character). If suffix[k] is equal to -1, there is no other substring fragment in the pattern string that matches the suffix well. Let’s use this rule. B [r, m-1] (r from j+2 to m-1), k=m-r. If prefix[k] is true, we can move the pattern string back r bit. If neither rule finds a good match for the suffix and its suffix substring, we move the entire pattern string back m bits. So far, we have finished talking about the good suffix rules. Now we can insert the code of the good suffix rules into the previous framework and get the complete version of BM algorithm. `

Def generateBC(b, m): BC =[-1]*SIZE for I in range(m): accii=ord(b[i]) bc[accii]=i return bc def generateSP(b,m): suffix= [-1]*m prefix= [False]*m for i in range(m-1): J = I k = 0 # public suffix substring length while (> = 0 and b [j] j = = b - k] [m - 1) : j = k = k + 1 # j j - 1 + 1 said public suffix substring in the b [0, I] starting subscript suffix [k] = j + 1 if (j = = 1) : Def moveSP(j,m,suffix,prefix) def moveSP(j,m,suffix,prefix): K =m-1-j if suffix[k]! =-1: return j-suffix[k]+1 for r in range(j+2,m): if prefix[m-r]==True: return r return m def bm(a,n,b,m): BC =generateBC(b,m) suffix, prefix = generateSP(b,m) I = 0 while I <=n-m: For j in range(m-1,-2,-1): # if(a[I +j]! =b[j]): j break if(j<0) X = j-bc [ord(a[I +j])] y = 0 if j < m-1: y = moveSP(j, m, suffix, prefix) i = i + max(x, y) return -1Copy the code

So that’s it, BM algorithm.

If you want a PDF version of this article, you can follow the public account “Programmer senior” private message me.

More hardcore knowledge. Please follow the public number “programmer senior”.