String matching algorithm

BF algorithm

BF is short for Brute Force, which is also called Brute Force matching algorithm. As can be seen from the name, the string matching method of this algorithm is very violent, of course, it will be relatively simple, easy to understand, but the corresponding performance is not high.

So before I start talking about this algorithm, let me define a couple of concepts, just so I can talk about them later. They are primary and pattern strings, respectively. These two concepts are easy to understand. Let me give you an example. For example, if we look for string B in string A, string A is the main string and string B is the pattern string. Let’s call the length of the main string n and the length of the pattern string m. Since we are looking for pattern strings in the main string, n>m.

As the simplest and most violent string matching algorithm, the idea of BF algorithm can be summed up in a sentence, that is, in the main string, we check the starting position is 0, 1, 2… N -m n-m+1 substring of length m, see if any pattern string matches.

In extreme cases, for example the main string is “AAAAA… Aaaaaa “(ellipsis indicates a lot of repeated characters) and the pattern string is” aaAAAAab “. We compare m characters at a time, n minus m plus one time, so the worst-case time of this algorithm is O(n times m).

In theory, BF algorithm has a high time complexity of O(n*m), but in practical development, it is a relatively common string matching algorithm. Why do you say that? There are two reasons.

First, in most cases in real software development, the pattern string and the main string are not too long. And every time the pattern string matches the substring in the main string, when it encounters an unmatched character halfway, it can stop, without comparing all m characters. So, although the theoretical worst-case time complexity is O(n*m), the algorithm is statistically much more efficient in most cases than that.

Second, the idea of naive string matching algorithm is simple, and the code implementation is very simple. Simplicity means less likely to make mistakes and easier to expose and fix if there are bugs. In engineering, simplicity is preferred when performance requirements are met. This is what we call the KISS(Keep it Simple and Stupid) design principle.

So, in real software development, for the most part, a naive string matching algorithm will suffice.

Fairly RK algorithm

The idea of RK algorithm is as follows: we hash n-m+1 substrings in the main string through the hash algorithm, and then compare the size one by one with the hash value of the pattern string. If the hash value of a substring is equal to that of the pattern string, that means that the substring matches the pattern string (hash conflicts are not a concern here, but we’ll get to that later). Because the hash value is a number, comparing numbers for equality is very fast, so the efficiency of pattern string and substring comparisons is improved.

However, when we hash a substring, we need to iterate over every character in the substring. Although the efficiency of comparing pattern strings to substrings is improved, the overall efficiency of the algorithm is not improved. Is there a way to make the hash algorithm more efficient at calculating substring hashes?

Special hash algorithm

And that’s where the hashing algorithm gets really tricky. Assuming that the character set of the string to be matched contains only K characters, we can represent a substring with a k-base number, which is converted to a decimal number and used as the hash value of the substring. It’s a little abstract, but I gave you an example, and it should make sense after you read it.

For example, if the string to be processed contains only 26 lower-case letters, a and Z, we use hexadecimal notation to represent a string. We map the 26 characters a and Z to the 26 numbers 0 to 25, a for 0, b for 1, and so on, z for 25.

In decimal notation, the value of a number is computed in the following way. In hexadecimal, which is a string of 26 characters a through Z, when we compute the hash, we just change the carry from 10 to 26.

For the sake of explanation, I will assume that the string contains only 26 lower-case characters, a to Z. We will represent a string in hexadecimal, and the corresponding hash value is the result of converting a hexadecimal number to decimal.

From this example, we can easily draw the following rule: Two adjacent substrings S [i-1] and S [I] (I represents the starting position of the substring in the main string, and the length of the substring is m), the corresponding hash value calculation formula has intersection, that is to say, we can use the hash value of S [i-1] to quickly calculate the hash value of S [I]. If expressed by a formula, it looks like this:

But there’s a little detail here, and that’s 26 to the m minus 1, and we can make it more efficient by looking up tables. We calculated 26^0, 26^1, 26^2…… 26 to the m minus 1, and it’s stored in an array of length m, and the power in the formula corresponds to the index of the array. When we need to calculate 26 to the x power, we can directly use the value of the array with the index x, saving the calculation time.

Complexity analysis

The whole RK algorithm consists of two parts, calculating the substring hash value and comparing the pattern string hash value with the substring hash value.

The first part, which we’ve already analyzed, is that by designing a special hash algorithm, you can compute the hash values of all the substrings just by scanning the main string once, so the time of this part is order n.

The time complexity of the comparison between the pattern string hash value and each substring hash value is O(1), and the total time complexity of the comparison between n-m+1 substring hash value is also O(n).

Therefore, the overall time complexity of RK algorithm is O(n).

Hash conflicts

All we need to do is compare the hash values of the pattern string and the substring, and if the two values are equal, the substring must match the pattern string. However, when there are hash conflicts, there may be cases where the hash values of the substring and pattern string are the same, but the two themselves do not match.

Actually, the solution is simple. When we find that the hash value of a substring is equal to the hash value of the pattern string, we only need to compare the string with the pattern string itself.

Therefore, the conflict probability of hash algorithm should be relatively low. If there are a large number of conflicts, it will lead to the degradation of time complexity and efficiency of RK algorithm. In extreme cases, if there are a large number of conflicts and the substring is compared to the pattern string itself every time, the time complexity degrades O(n*m). However, do not be too pessimistic. In general, there are not many conflicts, and the efficiency of RK algorithm is higher than that of BF algorithm.

BM algorithm

core idea

In the pattern string and the main string matching process, when the pattern string and the main string of a character does not match, can skip some definitely will not match the situation, the pattern string after sliding a few more bits.

The principle of analysis

BM algorithm consists of two parts, namely bad character rule and good suffix Shift.

Bad character rule

BM algorithm has a special matching order. It matches the pattern string in reverse order from large to small.

We match backwards from the end of the pattern string, when we find a character that doesn’t match. We call the character that does not match a bad character (the character in the main string).

When a mismatch occurs, we mark the bottom of the pattern string corresponding to the bad character as SI. If a bad character exists in the pattern string, we mark the bad character below the pattern string as xi. If it doesn’t exist, let’s call xi minus 1. So the number of bits moving back in the pattern string is si minus XI. (Note that the subscripts I’m talking about here are the subscripts of the characters in the pattern string).

If bad characters appear in multiple places in the pattern string, then we choose the one most to the right when calculating XI, because this will not allow the pattern string to slide too much, causing possible matches to be skipped.

Using the bad character rule, the BM algorithm has a very low time complexity of O(n/m) in the best case. For example, the main string is aaABAAABAAABAAab and the mode string is AAAA. Each time, the pattern string can be directly moved four positions later. Therefore, BM algorithm is very efficient when matching pattern string and main string with similar characteristics.

However, using the bad character rule alone is not enough. For example, the main string is aaAAAAAAAAAAAAAaaa, the mode string is baaa, si is 0, XI is 3, and si-xi = -3. Not only does it not slide the pattern string backward, but it can go backwards. Therefore, BM algorithm also needs to use the good suffix rule.

I think it’s perfectly fine to add a condition here, si has to be greater than XI, so you don’t have a situation where the number of moves could be negative.

figure

Good suffixes rule

We call the matched BC suffix {u}. We take it and look it up in the pattern string. Two things can happen:

  • If another substring {u ∗u^*u∗} is found that matches {u}, we slide the pattern string to where the substring {u ∗u^*u∗} is aligned with {u} in the main string
  • If no matching substring is found, we need to check whether the suffix substring matches the prefix substring of the pattern string
    • Does not exist, move the pattern string directly after the good suffix
    • If so, we find the longest suffix substring of the good suffix that matches the prefix substring of the pattern string, say {v}, and slide the pattern string to the position shown.

figure

How do bad characters and good suffixes work together?

We can calculate the number of digits that the suffix and bad characters slide backwards, and then take the largest of the two digits as the number of digits that the pattern string slides backwards.

Code implementation

#! /usr/bin/python
# -*- coding: UTF-8 -*-

SIZE = 256

def bm(main, pattern) :
    BM algorithm matching rules: 1. Bad character rule 2. Good character rule ""
    assert type(main) is str and type(pattern) is str
    n, m = len(main), len(pattern)

    if n <= m:
        return 0 if main == pattern else -1

    # bc
    bc = [-1] * SIZE
    generate_bc(pattern, m, bc)

    # gs
    suffix = [-1] * m
    prefix = [False] * m
    generate_gs(pattern, m, suffix, prefix)

    i = 0
    while i < n-m+1:
        j = m - 1
        while j >= 0:
            ifmain[i+j] ! = pattern[j]:break
            else:
                j -= 1

        The entire # pattern has been matched
        if j == -1:
            return i

        # 1. The BC rule computes the backward number of digits
        x = j - bc[ord(main[i+j])]

        # 2. The gs rule computes the backshift number
        y = 0
        ifj ! = m -1:    # there gs
            y = move_by_gs(j, m, suffix, prefix)

        i += max(x, y)

    return -1

def generate_bc(pattern, m, bc) :
    """ Generate bad character hash table ""
    for i in range(m):
        bc[ord(pattern[i])] = i

def generate_gs(pattern, m, suffix, prefix) :
    """ Good suffix preprocessing """
    for i in range(m-1):
        k = 0   # pattern[: I +1] and the common suffix length of pattern
        for j in range(i, -1, -1) :if pattern[j] == pattern[m-1-k]:
                k += 1
                suffix[k] = j
                if j == 0:
                    prefix[k] = True
            else:
                break

def move_by_gs(j, m, suffix, prefix) :
    """ Three cases need to be handled to calculate movement values by good suffixes: 1. The whole good suffixes can still be found in pattern 2. Good suffix exists suffix substring * can match the prefix * of pattern 3. Other "" "
    k = m - 1 - j           # j refers to the first bad character from back, and k is the length of the good suffix for this match

    ifsuffix[k] ! = -1:             # 1. The whole good suffix still appears in the rest of pattern characters
        return j - suffix[k] + 1
    else:
        for r in range(j+2, m):     # 2. Suffix substring searches from long to short
            if prefix[m-r]:
                return r
        return m                    # 3. Other information

if __name__ == '__main__':
    print('--- search ---')
    m_str = 'dfasdeeeetewtweyyyhtruuueyytewtweyyhtrhrth'
    p_str = 'eyytewtweyy'
    print('[Built-in Functions] result:', m_str.find(p_str))
    print('[bm] result:', bm(m_str, p_str))
Copy the code

Performance analysis

Let’s first analyze the memory consumption of BM algorithm. The whole algorithm uses three additional arrays, where the size of the BC array is related to the character set size, and the size of the suffix array and prefix array is related to the pattern string length M.

If we are dealing with string matching problems with large character sets, BC arrays will be more memory consuming. Because the rules for good suffixes and bad characters are independent, if we are running in a memory-demanding environment, we can use only the rules for good suffixes and not the rules for bad characters to avoid excessive memory consumption for BC arrays. However, the efficiency of BM algorithm with good suffixed rules will be reduced.

The time complexity of BM algorithm is O(Mn) in the worst case and O(n/m) in the best case.

KMP algorithm

KMP algorithm is trying to find a rule: in the process of matching pattern string and main string, when bad characters are encountered, can we find a rule to slide pattern string many bits at a time for good prefixes that have been compared?

We just need to find the longest suffix substring that matches the prefix substring of the good prefix. And for the sake of expression, I call the longest matched prefix substring of all prefix substrings the longest matched prefix substring; The corresponding prefix substring is called the longest matched prefix substring.

Assuming that the longest matching prefix substring is {v}, the length is K, and the bad character corresponds to the character in the pattern string with the subscript j, we can slide the pattern string back j-K bits at once, and then iterate.

How to find the longest matching prefix and suffix substring? I discovered that this problem doesn’t really involve a main string, just the pattern string itself. So, I was wondering if I could precompute it and use it directly in the process of matching the pattern string and the main string.

The KMP algorithm needs to build an array to store the ending subscript of the longest matched prefix substring for each (potentially good) prefix in the pattern string. We define this array as the next array, and many books give it a name called the invalidation function. The subscript of the array is the subscript of the ending character of each prefix, and the value of the array is the ending character of the longest matched prefix substring of that prefix.

Failure function calculation method

Next [0], next[1]… Next [I] = next[I]

  1. First fetch the previous longest matched prefix substring with the subscript next[i-1]
  2. Compare the next character, and if it matches, assign directlynext[i]next[i-1]+1Because I minus one is already the longest
  3. If they don’t match, you recurse to find the prefix substring that matches the sub-length, and the tricky thing here is recursively,next[i-1]Is the subscript end of the longest matching prefix substring of i-1, thennext[next[i-1]]Is the subscript end of the substring with the next-longest matching prefix
  4. The exit of recursion is that the next character of the sub-long prefix substring matches or encounters -1. If -1 is encountered, it indicates that no matching prefix substring is found. In this case, the first character of pattern should be found for comparison

The so-called sub-length matched prefix substring is to take the original longest matched prefix substring as a good prefix and find its longest matched prefix substring

Code implementation

#! /usr/bin/python
# -*- coding: UTF-8 -*-

def kmp(main, pattern) :
    """
    kmp字符串匹配
    """
    assert type(main) is str and type(pattern) is str

    n, m = len(main), len(pattern)

    if m == 0:
        return 0
    if n <= m:
        return 0 if main == pattern else -1

    # next array
    next = get_next(pattern)

    j = 0 # j is the length of a good prefix
    for i in range(n):
        In pattern[:j], recurse from long to short to find the longest prefix substring that matches the suffix substring
        while j > 0 andmain[i] ! = pattern[j]: j =next[j-1] + 1   # if next[j-1] = -1, select match from start character
            1. J > 0 indicates that a good prefix already exists. = pattern[j] = pattern[j] = pattern[j] = pattern[j] = pattern[j]

        if main[i] == pattern[j]:
            if j == m-1:  The pattern string matched successfully
                return i-m+1
            else:
                j += 1
    return -1


def get_next(pattern) :
    Next [I] according to next[0], next[1]... Next [I] = next[I] = next[I] = next[I] = next First fetch the longest prefix substring that matches next[i-1] 2. Compare next[I] to next[i-1]+1 if the next character matches, because i-1 is already 3. Next [i-1] = next[i-1] = next[i-1] = next[i-1] = next[i-1] = next[i-1] = next[i-1] = next[i-1] The exit of recursion is that the next character of the sub-long prefix substring matches or encounters -1. If -1 is encountered, it indicates that no matching prefix substring is found. In this case, the first character of pattern needs to be found to compare the value of PS: next[m-1], which is actually meaningless and can be ignored when solving. You can also pan the next array to the right. "" "
    m = len(pattern)
    next = [-1] * m

    next[0] = -1

    # for i in range(1, m):
    for i in range(1, m-1):
        j = next[i-1]       # take the longest prefix substring matched at i-1
        whilej ! = -1 and pattern[j+1] != pattern[i]:
            j = next[j]     Next [next[i-1]] next[i-1]

        Pattern [0] is compared with the current character when j=-1
        # if j! Pattern [j+1] and pattern[I] must be equal
        if pattern[j+1] == pattern[i]:  If the following characters match, the longest prefixed substring of I is next[i-1]+1
            j += 1
        next[i] = j

    return next


if __name__ == '__main__':
    m_str = "aabbbbaaabbababbabbbabaaabb"
    p_str = "abbabbbabaa"

    print('--- search ---')
    print('[Built-in Functions] result:', m_str.find(p_str))
    print('[kmp] result:', kmp(m_str, p_str))
Copy the code

Complexity analysis

The spatial complexity is easy to analyze, and the KMP algorithm only requires an extra next array, the size of which is the same as the pattern string. So the space complexity is O(m), where M is the length of the pattern string.

The KMP algorithm consists of two parts. The first part is to build the next array, and the second part is to match with the next array. Therefore, we will analyze time complexity from these two parts respectively.

Let’s analyze the time complexity of the first part.

In the code that evaluates the next array, the first layer of the for loop is I from 1 to m-1, that is, the internal code is executed m-1 times. The code inside the for loop has a while loop. If we can know the average number of times each for loop and while loop is executed, let’s say k, then the time complexity is O(k*m). However, the number of times the while loop executes is not easy to count, so we abandon this method of analysis.

We can find some reference variables, I and K. I increases from 1 to m, and k does not increase every time through the for loop, so the cumulative increase in k must be less than m. While k=next[k] in the while loop is actually decreasing the value of k. The accumulation of k does not increase by more than m, so the total number of executions of k=next[k] in the while loop cannot exceed M. Therefore, the time complexity of the next array calculation is O(m).

Let’s look at the time complexity of the second part. The method of analysis is similar.

I goes from 0 to n minus 1, j can’t grow by more than I, so it must be less than n. J =next[j-1]+1 while j=next[j-1]+1 does not increase j. Is it possible to keep j constant? No way. Since the value of next[J-1] is definitely less than j-1, this statement in the while loop actually decreases the value of j as well. J can’t grow by more than n, and it can’t decrease by more than n, so this statement in the while loop can’t be executed more than n times, so this part of the time is O(n).

Therefore, combining the time complexity of the two parts, the time complexity of KMP algorithm is O(m+ N).

Trie tree

Trie tree, also known as dictionary tree. As the name implies, it is a tree structure. It is a data structure that deals specifically with string matching to solve the problem of quickly finding a string in a collection of strings.

The core idea of Trie is space for time. The common prefix of string is used to reduce the cost of query time to improve efficiency.

Basic properties of Trie tree:

  • The root node contains no information, and each node except the root node contains only one character.
  • A path from the root node to the red node represents a string (note: not all red nodes are leaf nodes).

The main operations of the Trie tree

Trie tree construction (insert string) :

For a string, start at the root and follow the branches of nodes in the tree for each character of the string until the string is traversed, marking the last node in red to indicate that the string has been inserted into the Trie tree.

Query a string from the Trie tree:

The Trie tree is traversed from the root down in character order of the string. Once a node marker is found to be absent or the last node is not marked in red after the string traversal is complete, the string does not exist. If the last node is marked in red, the string exists.

Storage structure of Trie tree

For each node in the Trie tree, we construct an array that is the length of the character set. The subscripts of the array correspond to the characters in the character set. The array stores Pointers to the child nodes.

If the character set is 26 lowercase letters, the array is 26 in length, and we store a pointer to child node A at subscript 0, a pointer to child node B at subscript 1, and so on, a pointer to child node Z at subscript 25. If a child node of a character does not exist, we store NULL at the corresponding subscript.

When looking for a string in a Trie tree, we can quickly find Pointers to matching child nodes by subtracting the ASCII of a from the ASCII of the character. For example, if the ASCII number of D minus the ASCII number of A is 3, then the pointer to child d is stored at subscript 3 in the array.

Problems and Optimization

If according to the classic way to store, use an array to store the child of a node pointer, if the string contains the 26 characters from a to z, that each node to store an array of length of 26, even if a node only a few child nodes, is far less than 26, three, four, for example, we have to maintain a length of 26 array. If the string contains not only lowercase letters, but also uppercase letters, numbers, and even Chinese characters, more storage is required.

The problem is that Trie trees waste a lot of memory when there are not many duplicate prefixes.

We can sacrifice the efficiency of the query slightly by replacing the arrays in each node with other data structures to store Pointers to the child nodes of a node. What kind of data structure? There are many options, such as ordered arrays, hoptables, hash tables, red-black trees, and so on.

Suppose we take an ordered array, the size of which is the number of children, and the Pointers in the array are sorted by the size of the characters in the children to which they point. When querying, we can quickly find the pointer of the child node that a character should match through the binary search method. However, when inserting a string into the Trie tree, we take a bit longer to maintain the order of the data in the array.

Trie tree compared to hash table, red black tree

In fact, the string matching problem, generally speaking, is actually a data search problem. We’ve already talked a lot about data structures that support efficient manipulation of dynamic data, such as hash tables, red-black trees, jump tables, and so on. In fact, these data structures can also be used to find strings in a set of strings. We chose two data structures, hash tables and red-black trees, and compared them to Trie trees to see their strengths, weaknesses and application scenarios.

In the scenario just described, looking for strings in a list of strings, the Trie tree doesn’t actually perform very well. It has very strict requirements on the strings to be processed.

  • First, the character set contained in a string cannot be too large. As we mentioned earlier, if the character set is too large, you can waste a lot of storage space. Even if it can be optimized, it has to pay the price of sacrificing query and insert efficiency.
  • Second, the prefix of the string should overlap a lot, otherwise the space consumption will be much larger.
  • Third, if you want to solve a problem with a Trie tree, you have to implement a Trie tree from scratch and make it bug-free. This is engineering that complicates a simple problem more than necessary and is generally not recommended.
  • Fourth, we know that blocks of data strung together by Pointers are not contiguous, and Trie trees use Pointers, so it is not cache-friendly and performance is compromised.

Taking these points together, we tend to use hash tables or red-black trees in our projects for finding strings in a set of strings. Because of these two data structures, we don’t need to implement them ourselves, just use the ready-made libraries provided in the programming language.

At this point, you may be wondering if Trie trees are useless if I say no to them and then ask you to use red-black trees or hash tables. Is today’s lesson in vain?

In fact, Trie trees are simply not good for exact match lookups, a problem better solved with hash tables or red-black trees. Trie trees are good for looking for strings with matching prefixes.

Code implementation

#! /usr/bin/python
# -*- coding: UTF-8 -*-

class Node:
    def __init__(self, c) :
        self.data = c
        self.is_ending_char = False
        # Use ordered arrays to reduce space consumption and support more characters
        self.children = []

    def insert_child(self, c) :
        self._insert_child(Node(c))

    def _insert_child(self, node) :
        "" insert a child node ""
        v = ord(node.data)
        idx = self._find_insert_idx(v)
        length = len(self.children)

        if idx == length:
            self.children.append(node)
        else:
            self.children.append(None)
            for i in range(length, idx, -1):
                self.children[i] = self.children[i-1]
            self.children[idx] = node

    def has_child(self, c) :
        return True if self.get_child(c) is not None else False

    def get_child(self, c) :
        """ Search for child nodes and return """
        start = 0
        end = len(self.children) - 1
        v = ord(c)

        while start <= end:
            mid = (start + end)//2
            if v == ord(self.children[mid].data):
                return self.children[mid]
            elif v < ord(self.children[mid].data):
                end = mid - 1
            else:
                start = mid + 1
        Return None if not found
        return None

    def _find_insert_idx(self, v) :
        Binary search to find the insertion position of an ordered array.
        start = 0
        end = len(self.children) - 1

        while start <= end:
            mid = (start + end)//2
            if v < ord(self.children[mid].data):
                end = mid - 1
            else:
                if mid + 1= =len(self.children) or v < ord(self.children[mid+1].data):
                    return mid + 1
                else:
                    start = mid + 1
        # v < self.children[0]
        return 0

    def __repr__(self) :
        return 'node value: {}'.format(self.data) + '\n' \
               + 'children:{}'.format([n.data for n in self.children])


class Trie:
    def __init__(self) :
        self.root = Node(None)

    def gen_tree(self, string_list) :
        "" create trie tree 1. Iterate over the characters of each string, starting with the root node and creating 2 if there are no child nodes. The end node of each string is marked red (is_ending_char) """
        for string in string_list:
            n = self.root
            for c in string:
                if n.get_child(c) is None:
                    n.insert_child(c)
                n = n.get_child(c)
            n.is_ending_char = True

    def search(self, pattern) :
        "" search 1. Traverse the characters of the pattern string, starting with the root node. If no child node exists, return False 2. If the last node in the tree is red, return True, otherwise False ""
        assert type(pattern) is str and len(pattern) > 0

        n = self.root
        for c in pattern:
            if n.get_child(c) is None:
                return False
            n = n.get_child(c)

        return True if n.is_ending_char is True else False


if __name__ == '__main__':
    string_list = ['abc'.'abd'.'abcc'.'accd'.'acml'.'P@trick'.'data'.'structure'.'algorithm']

    print('--- gen trie ---')
    print(string_list)
    trie = Trie()
    trie.gen_tree(string_list)
    trie.draw_img()

    print('\n')
    print('--- search result ---')
    search_string = ['a'.'ab'.'abc'.'abcc'.'abe'.'P@trick'.'P@tric'.'Patrick']
    for ss in search_string:
        print('[pattern]: {}'.format(ss), '[result]: {}'.format(trie.search(ss)))
Copy the code

summary

Trie tree is a data structure that solves the problem of fast string matching. If the string used to build a Trie tree does not have many repeated prefixes, the Trie tree is an overall memory – consuming data structure, a space-for-time solution to the problem.

String matching in a Trie tree can be very efficient even if it is more memory intensive, but not memory sensitive or within an acceptable range of memory consumption. The time complexity is O(k), where K represents the length of the string to be matched.

The advantage of Trie trees, however, is not that they can be used to look up dynamic sets of data, as this can be replaced by more suitable hash tables or red-black trees. Trie tree is the most advantageous search prefix matching string, such as the keyword prompt function in the search engine scenario, is more suitable to use it to solve, is also the Trie tree more classic application scenario.

Application scenarios: keyword prompt function in search engine, automatic completion function of input method, automatic completion function of IDE code editor, automatic completion function of browser URL input, etc

AC automata

Trie tree realizes sensitive word filtering

Single pattern string matching algorithm is to match between a pattern string and a main string, that is, to find a pattern string in a main string. Multi-pattern string matching algorithm is to match between multiple pattern strings and a main string, that is, to find multiple pattern strings in a main string.

Trie tree is a multi-pattern string matching algorithm. How to use Trie tree to filter sensitive words?

We can preprocess sensitive word dictionary to construct Trie tree structure. This preprocessing operation only needs to be done once, if the sensitive word dictionary is dynamically updated, such as deleting or adding a sensitive word, then we only need to dynamically update the Trie tree.

When the user enters a text, we take the user’s input as the main string, starting with the first character (suppose character C), and match it in the Trie tree. When matching to a leaf node of the Trie tree (containing the pattern string), or when we encounter a mismatched character halfway through, we move the start of the main string one bit back, starting with the next character of character C, and re-match it in the Trie tree.

This processing method based on Trie tree is somewhat similar to BF algorithm of single pattern string matching. As we know, in the single pattern string matching algorithm, THE KMP algorithm improves the BF algorithm and introduces the next array, so that when the matching fails, the pattern string will slide back as many bits as possible. Referring to the optimization and improvement method of single pattern string, can the multi-pattern string Trie tree be improved to further improve the efficiency of Trie tree? And that’s where the AC automata algorithm comes in.

Classical multi-pattern string matching algorithm: AC automata

AC automata actually adds a kMP-like next array to the Trie tree, except that the next array is built on the tree. We call it the Fail pointer. When a mismatched character is found, it jumps to the position pointed by the Fail pointer, and then matches again.

AC automata construction, including two operations:

  • Multiple pattern strings are built into a Trie tree
  • Build the Fail pointer on the Trie tree

Building Fail Pointers

Every node in the Trie tree has a Fail pointer, and the Fail pointer for root is null.

Suppose we match all the way from the root node to node, p from the root to form a good p prefix, suffix substring if good prefix can match a pattern string prefix substring, then we can call this suffix substring can match the suffix substring, we can from all the matched suffix substring, find out one of the longest, call it a maximum matching suffix substring, Then the Fail pointer of node P points to the last node of the prefix substring of the pattern string corresponding to the longest matched suffix substring, which is the node pointed to by the arrow in the following figure.

The process of calculating the failure pointer for each node may seem complicated. In fact, if we put nodes of the same depth in the tree at the same level, the failure pointer of a node can only appear at the level above it.

We can do the same with the KMP algorithm, when we want a failure pointer for a node, we can deduce it from the failure Pointers of the nodes that we have obtained, which are less deep. In other words, we can solve the failure pointer of each node layer by layer. So, the process of building a failure pointer is a process of traversing the tree layer by layer.

First, root’s failure pointer is NULL, pointing to itself. Once we have found the failure pointer to a node P, how do we find the failure pointer to its children?

Let’s assume that the failure pointer of node P points to node Q. Let’s see if the character corresponding to PC of node P can also be found in the child node of node Q. If a child qc of node Q is found with the same character as that of node PC, the failure pointer of node PC points to node QC.

If no child node in node Q has characters equal to the characters contained in node PC, make q=q->fail and continue the above search until q is root. If no child node with the same character is found until q=root, let the failure pointer of the node PC point to root.

For example, there are four pattern strings: C, BC, BCD, and ABcd; The main string is ABcd, and the final AC automaton will look like this:

Code implementation

#! /usr/bin/python
# -*- coding: UTF-8 -*-

from trie_ import Node, Trie
from queue import Queue

class ACNode(Node) :
    def __init__(self, c: str) :
        super(ACNode, self).__init__(c)
        self.fail = None
        self.length = 0

    def insert_child(self, c: str) :
        self._insert_child(ACNode(c))

class ACTrie(Trie) :
    def __init__(self) :
        self.root = ACNode(None)

def ac_automata(main: str, ac_trie: ACTrie) - >list:
    root = ac_trie.root
    build_failure_pointer(ac_trie)

    ret = []
    p = root
    for i, c in enumerate(main):
        whilep ! = rootand not p.has_child(c):
            p = p.fail

        if p.has_child(c):  # a char matched, try to find all potential pattern matched
            q = p.get_child(c)
            whileq ! = root:if q.is_ending_char:
                    ret.append((i-q.length+1, i))
                    # ret.append(main[i-q.length+1:i+1])
                q = q.fail
            p = p.get_child(c)

    return ret


def build_failure_pointer(ac_trie: ACTrie) - >None:
    root = ac_trie.root

    # queue: [(node, node.length) ....]
    node_queue = Queue()
    node_queue.put((root, root.length))

    root.fail = None
    while not node_queue.empty():
        p, length = node_queue.get()
        for pc in p.children:
            pc.length = length + 1
            if p == root:
                pc.fail = root
            else:
                q = p.fail
                # same as kmp
                whileq ! = rootand not q.has_child(pc.data):
                    q = q.fail

                # cases now:
                # 1. q == root and q.has_child(pc.data)
                # 2. q ! = root and q.has_child(pc.data)
                # 3. q == root and not q.has_child(pc.data)

                if q.has_child(pc.data):
                    pc.fail = q.get_child(pc.data)
                else:
                    pc.fail = root
            node_queue.put((pc, pc.length))


if __name__ == '__main__':
    ac_trie = ACTrie()
    ac_trie.gen_tree(['fuck'.'shit'.'TMD'.'all'])

    print('--- ac automata ---')
    m_str = 'Fuck you, what is that shit, what the fuck you are
    print('original str : {}'.format(m_str))

    filter_range_list = ac_automata(m_str, ac_trie)
    str_filtered = m_str
    for start, end in filter_range_list:
        str_filtered = str_filtered.replace(str_filtered[start:end+1].The '*'*(end+1-start))

    print('after filtered: {}'.format(str_filtered))
Copy the code

Complexity analysis

First, we need to build sensitive words into AC automata, including building Trie tree and building failure pointer.

As we mentioned in the previous section, the time complexity of Trie tree construction is O(m*len), where Len represents the average length of sensitive words and M represents the number of sensitive words. What is the time complexity of the build failure pointer? I’m going to give you an upper bound that’s not very tight. Given that the total number of nodes in the Trie tree is K, the most time-consuming part for each node to build a failed pointer is q=q->fail in the while loop. Each time you run this statement, the depth of the node to which Q points decreases by 1, and the height of the tree does not exceed len. So the time complexity of each node build failure pointer is O(len). The entire construction of the failure pointer is O(k*len). However, the construction process of AC automata is pre-processed, and it will not be updated frequently after construction, so it will not affect the operation efficiency of sensitive word filtering.

Let’s see, what is the time complexity of matching with AC automata?

Similar to the analysis that just built the failed pointer, the for loop iterates through each character in the main string. The most time-consuming part of the for loop is also the while loop, which is O(len), so the total matching time is O(n*len), the length of the main string. Because sensitive words are not very long, and the time complexity is only a very broad upper limit, which may be approximate to O(n) in the actual situation, the AC automata performs sensitive word filtering with very high performance.

The time complexity of multiple pattern string matching using Trie tree is O(n*len).

You might say, in terms of time complexity, AC automata matches as efficiently as Trie trees. In fact, because the dead pointer probably points to the root node most of the time, the efficiency of matching on the AC machine is far greater in most cases than the broader time complexity just calculated. Only in extreme cases, as shown in the figure, does the AC automaton degrade like the Trie tree.