This is the third day of my participation in the August More Text Challenge

Today we are going to talk about string matching algorithms. String matching is often used in our work, and it is also a common question in interviews. There are many algorithms for string matching, so there is a lot to learn in depth. We will have a series of articles that try to make string matching algorithms as clear as possible.

Today we are going to focus on the single pattern string matching algorithm – where one string is compared to another. Before we begin, for the sake of further explanation, let’s clarify two definitions, namely main string and pattern string. If we are looking for string B of length M in string A of length n, then A is the primary string and B is the pattern string, where n>m. Let’s start with the simplest BF algorithm.

BF algorithm

BF algorithm is also called violent matching algorithm, is the most direct, simple algorithm. The so-called violent matching algorithm is to fix the main string, and then the pattern string moves forward, one by one, until a matching substring is found in the main string. As shown in the figure below.

def bf(a, b): n = len(a) m = len(b) if n <= m: return 0 if b == a else -1 for i in range(n-m+1): for j in range(m): if a[i+j] == b[j]: if j == m-1: return i else: continue else: break return -1 if __name__ == '__main__': A = 'cbdac b =' ac 'start = bf (a, b) print (' result:' start) # # # # # # # # # output result: 3Copy the code

As we can see from the code above, the worst-case time complexity of this algorithm is O(n*m).

Fairly RK algorithm

The full name of RK algorithm is Rabin-Karp algorithm. It was named after its two inventors, Rabin and Karp. The idea of the RK algorithm is to compare the Hash values of two strings to determine if they are equal. In our BF algorithm, if the length of the main string is N and the length of the pattern string is M, we need to compare n-M +1 substring and pattern string by violence to find the substring matching the main string and pattern string. When comparing substring and pattern string, bit by bit comparison is required, so the time complexity of BF algorithm is high, which is O(N*M). The idea of RK algorithm is: hash the n-M +1 substring separately, and then compare the size with the hash value of the pattern string. If the hash value of a substring is equal to the pattern string. That means that the corresponding substring matches the pattern string (let’s ignore the hash conflicts for now). Because hash comparisons are fast, substring and pattern string comparisons are more efficient. As shown in the figure below.

However, the hash algorithm computes the hash value by iterating through each character in the substring. Although the efficiency of substring and pattern string comparison is improved, the overall efficiency of the algorithm is not improved. So how to improve the efficiency of the hash algorithm to calculate the substring hash value? This requires designing a more efficient hash algorithm. Assuming that the character set of the string to be matched contains only K characters, we can represent a substring with a k-base number, which is converted to a decimal number and used as the hash value of the substring. Let’s take an example. If we are dealing with a string that contains only the 26 lowercase letters a-Z, we map a-z to the 26 numbers 0 to 25, with a for 0, b for 1, and so on. So the hash of the string “CDB” is:

 Hash("cdb")=c*26*26+d*26+b=2*26*26+3*26+1=1431
Copy the code

This hash algorithm has a feature that in the main string, the adjacent two substrings S[i-1] and S[I] (where I represents the starting position of the substring in the main string), the corresponding hash value calculation formula is intersection, that is to say, we can quickly calculate the hash value of S[I] according to the hash value of S[i-1]. So let’s write it out in a formula.

 

We can take 26^0, 26^1, 26^2…… 26 to the m minus 1 is computed and stored in an array of length m, and the power in the formula corresponds to the index of the array. When we need to compute 26 to the x, we can just take the value of the array with the index x, and we’ll save time.

RK algorithm mainly includes the calculation of substring hash value and the comparison between pattern string hash value and substring hash value. Because we can design a special hash algorithm to compute the hash of all the substrings only by scanning the main string once, the time to compute the hash of the substring is O(n).

The time complexity of the comparison between the pattern string hash value and each substring hash value is O(1), and the total time complexity of the comparison between n-m+1 substring hash value is also O(n). Therefore, the overall time complexity of RK algorithm is O(n).

Another problem to note is that if the hash we computed above is too large for the computer to represent, what can we do about it? The hash algorithm we just designed is hashing free, that is, a string corresponds to a hexadecimal number one by one, and the hash value of different strings must be different. In fact, we can sacrifice allowing hash collisions in order to keep hash values in the range of integer data. So what’s the hash algorithm going to do at this point? Well, it’s very simple. We can take the magnitude of a large prime number. This can lead to hash collisions. If there is a hash collision, we need to compare the hash value of a substring with the hash value of the pattern string.

Ok, let’s stop here for today. In the next article, we will talk about more efficient string matching algorithms BM algorithm and KMP algorithm.