preface

We use search engines almost every day to search for information. I believe you must have noticed such a detail: when you type a certain character, there will be multiple recommendation words at the bottom of the search box, as follows: After you type “python”, there will be a lot of recommended search text with python as the prefix. How does this work?

The title of the article gives the answer, yes, with the Trie tree. This article will be from the following aspects to brief the Trie tree principle, in order to let you have a more comprehensive understanding of Trie tree.

  • What is a Trie tree
  • Trie tree implementation
  • How to implement automatic prompt for search string
  • Talk about a Trie tree

I believe you must have a harvest

What is a Trie tree

Trie tree, also known as the prefix tree, field in the tree, or a word search tree, is a kind of tree structure, is also a hash table of variant, it is a specialized processing field string that matches the data structure, to solve a set of strings in the collection quickly find a string of questions, mainly by search engines used for text word frequency statistics.

Draw key points: fast string matching, word frequency statistics.

1. Fast string matching

So if you want to look for the existence of a string multiple times in a string like A, to, tea, TED, ten, I, in, inn, what do you do? The intuitive idea is to use a hash, and that’s fine, but if the hash function is well designed, If the hash function is not well designed, it is easy to cause conflicts and degenerate into comparisons between strings. In addition, many words in English have common prefixes. For example, tea, TED and ten have common prefixes te. There is no doubt that these common prefixes are equivalent to repeating many times, comparing space.

A Trie tree can solve these two problems. Let’s look at how a Trie tree is represented. Take the string A, to, tea, TED, ten, I, in, inn as an example.

If you want to find a string, start from the root node and take one character from the string to find it. You can see that its search time is O(N) (N is the length of the string), which is still very fast (English words are generally short).

2, word frequency statistics As long as add a counter on each node, through words, all the end of the string the calculator is added 1 a character corresponding nodes, such as a, an, and construct the Trie tree as follows, each node calculator is 1, on behalf of the node to store character for termination character word one respectively.

As you can see from the previous diagram of the Trie tree, the Trie tree is essentially a prefix tree that extracts the common prefix (if any) of the string to quickly match the string.

Using the Trie tree to find strings is much more efficient with prefix matching!

From the diagram above, we can get the following characteristics of the Trie tree

  1. The root node contains no characters and each node except the root node contains only one character
  2. From the root node to a node, the characters passing through the path are concatenated to be the string corresponding to the node. The characters “T” and “O” pass from the root node to node O, as shown above, so it represents the word to.
  3. All children of each node contain different characters, which ensures that the same prefix can be reused.

So how do WE represent the Trie tree

Trie tree implementation

If the string is composed of 26 lowercase letters, then the Trie tree should be a 26-fork tree, with each node containing 26 child nodes, as follows

As you can see in the figure above, 26 child nodes can be represented by an array of size 26, so the Trie tree is shown as follows


/** * 26 letters */
static final int ALPHABET_SIZE = 26;

/** * Nodes in the Trie tree represent */
static class TriedNode {
    /** * dedicated to the root node, store "/" */
    public char val;

    /** * The number of strings in which this node character is the terminating character */
    public int frequency;

    /** * the child node to which the node points */
    TriedNode[] children = new TriedNode[ALPHABET_SIZE];

    public TriedNode(char val) {
        this.val = val; }}/** * Trie tree */
static class TrieTree {
    private TriedNode root = new TriedNode('/'); / / the root node
}
Copy the code

Now that we have the Trie tree representation, let’s look at the two main operations on the Trie tree

  1. Construct a Trie tree from a set of strings
  2. Looks for the existence of a string in the Trie tree

Let’s look at how to construct a Trie tree from a set of strings. How to construct a Trie tree from a single word. Suppose we take the word “and” and look at what a Trie tree looks like

Note: The numbers in the figure represent the positions of the elements in the array

As you can see, the main steps for building a Trie tree are as follows

  1. Build the root node that holds an array of 26 elements
  2. Traversing the string “and”
  3. When iterating over the first character a, assign the first element of the array above to an instance of TriedNode (assuming its name is a)
  4. When iterating over the second character n, assign the element of the TriedNode array at node A with subscript N-a = 13 (ASCII for A is 97, ASCII for N is 110) to an instance of TriedNode (assuming its name is n)
  5. Similarly, when iterating through the third character D, assign the fourth element (subscript 3) of the TriedNode array at node N to an instance of TriedNode (let’s say d), and increase the frequency of its node by one, indicating that this character is terminated by one more string.

From the above analysis, it is not difficult to write the code that builds a Trie tree from a string, as follows

/** * Trie tree */
static class TrieTree {
    private TriedNode root = new TriedNode('/'); / / the root node

    /** * Build Trie tree with String as condition *@param s
     */
    public void insertString(String s) {
        TriedNode p = root;
        for (int i = 0; i < s.length(); i++){
            char c = s.charAt(i);
            int index  = c-'a';
            if (p.children[index] == null) {
                p.children[index] = new TriedNode(c);
            }
            p = p.children[index];
            //Process char} p.frequency++; }}Copy the code

The Trie tree is constructed, and it is much easier to find the existence of a string in the Trie tree. The Trie tree iterates through the string to see if the element at the corresponding array position of each character is empty. If yes, it does not exist

/** * finds if the string is in the original string collection *@param s
 * @return boolean
 */
public boolean findStr(String s) {
    TriedNode p = root;
    for (int i = 0; i < s.length(); i++){
        // The character currently traversed
        char c = s.charAt(i);
        int index  = c-'a';
        if (p.children[index] == null) {
            // If the array element corresponding to the character position is empty, the character must not exist, terminating the character traversal
            return false;
        }
        // If so, continue to iterate through the string
        p = p.children[index];
    }
    return true;
}
Copy the code

Since frequency also stores the number of words in the node, if a string is eventually found in the Trie tree, you can look up the number of strings at will.

How to realize search string automatic prompt function

With Trie tree, I believe it is not difficult to solve the problem at the beginning. First of all, the search engine constructs a Trie tree according to the user’s search terms. Assuming that the search terms are A, to, tea, TED, ten, I, in, Inn, the Trie tree constructed is

Therefore, when the user enters “te” in the search box, according to the characteristics of Trie tree, the strings prefixed with TE include tea, TED and ten, and these three strings should be displayed in the prompt words of the search box. There’s a little bit of a problem here, the average search box only displays 10 search terms, but there can be more than 10 strings prefixed by the user’s input, so which 10 should be displayed? The simplest rule is to display the 10 most searched strings, and then the problem becomes a TopK problem, To maintain a small top heap with 10 elements, do the following

  1. Find all strings containing the prefix in the tree based on the prefix entered by the user
  2. We know that the search times of strings are saved in the node, so we can use the small top heap to calculate the 10 strings that are searched the most, and then we can get the prompt words that are finally displayed to the user.

Note: here to calculate TopK is to use the small top heap, not the big top heap oh, inClassical data structures and algorithms behind search enginesThe small Top heap is the largest TopK value, and the large Top heap is the smallest TopK value. Since we are looking for the largest Top 10 search terms, we should use the small Top heap.

Consider the following phenomenon: when we enter a search term, the search engine may give us a prompt that is not prefixed with the string that the user typed

How does that work? It actually uses the idea of string edit distance, where one string can be changed into another string by adding, deleting, changing, and looking up characters

As illustrated, brekfa becomes breakfa with the addition of a

Obviously, the less times you add, delete, change and check, the more efficient it is. After editing the least number of characters into another valid string, you can find the prompt word in the Trie tree with the string as the prefix.

Of course, there must be a lot of work behind a search engine like Google to display these results in real time. But the principles are pretty much the same.

Talk about a Trie tree

We can see from the previous introduction that the use of Trie trees does play an important role in quickly finding strings and word frequency statistics, but there is no free lunch. If the character set is large, the use of Trie trees can be a waste of space

Each node maintains an array of 26 elements, with a total of 4 arrays. That is, 26 x 4 = 104 elements of space are allocated, but only three elements of space (a, N, d) are actually allocated, wasting 101 space, and the space rate is very low. Therefore, it is generally more suitable for repeated string prefixes. Of course, it can also be considered to optimize the Trie tree as follows, which can save some space

Of course, this optimization also increases the difficulty of coding code, so it depends on the situation.

In addition, if Trie tree is used, we generally need to code by ourselves, which has a high requirement on the coding ability of engineers. Therefore, we generally suggest whether to use Trie tree as follows:

  1. If it is a string of exact matching search, we generally recommend the use of hash table or red black tree to solve, after all, many languages have a ready-made library, do not need to implement their own, use
  2. If you need a prefix match lookup, a Trie tree is more appropriate

conclusion

In this article, through the search engine is a brief overview of the string the implementation principle, I believe you should understand, it is important to note the usage scenario, recommend more in need of prefix matching search with Trie tree, otherwise, such as general exact match to find more recommended use hash table and these very mature red and black tree data structure, After all, these two data structure implementations are generally implemented in the class library, do not need to implement, try not to duplicate the wheel.

Thank you for reading, welcome to pay attention to the public number to communicate together, common progress!