The introduction

Recently, I have been studying some knowledge related to artificial intelligence. In the first stage, I hope to focus on the realization principle of chatbot. After I had a general understanding of the key technology of chatbot design, I decided to start with Google’s famous open source project Word2VEc. Huffman coding is a very important background to the word2VEc principle.

What is a Huffman tree

As we all know, tree is a very important nonlinear data structure in computer science. It is a structure of data elements organized according to branch relations. Data elements in tree are called nodes. A number of unrelated trees we call a forest.

  • Path: In a tree, the path from any node down to the children or grandchildren of that node is called a path;
  • Path length: the data of branches in the path is the path length. If a tree has N layers, the path length from the root node to the node at the NTH layer is N-1.
  • Weight of nodes: If a node in the tree is assigned a significant value that is positive, then that value is called the weight of the node;
  • Weighted path length: the product of the length of the path from the root node to a node and the weight of the node, called the weighted path length;
  • Weighted path length of the tree: The sum of the weighted path length of all leaf nodes on the tree is the weighted path length of the tree.

A binary tree is an ordered tree with at most two subtrees per node. The two subtrees are called “left subtrees” and “right subtrees”, and the ordered meaning is that there is no mistaking the left and right subtrees. Given N weights as N leaf nodes, a binary tree is constructed. If its weighted path length reaches the minimum, such a binary tree is called an optimal binary tree, which is also the Huffman tree we will learn today.

Huffman tree construction

Given N weights [X1, X2, X3… Xn] as N leaf nodes of a binary tree, we construct a Huffman tree

  • 1. Consider [X1, X2, X3… Xn] as a forest with N trees, and each tree has only one node;
  • 2. Select two trees with the lowest root weights from the forest and merge them as a new subtree, and the root weights of the new tree are the sum of the weights of its left and right sub-roots;
  • 3. Replace two subtrees in the original forest with the merged new tree;
  • 4. Repeat 2 and 3 until all subtrees are merged and only one tree is left, which is the Huffman tree.

For example, I watched the flag-raising in Tian ‘anmen Square in Beijing. Divide the sentence into “I”, “at”, “Beijing tian ‘anmen”, “square”, “watch” and “flag raising”, assuming they occur in frequency 15,8,6,5,3,1 respectively. We construct a Huffman tree by taking these 6 words as leaf nodes and the corresponding word frequency as node weights.













Through five steps, we constructed this Huffman tree. It can be seen from the figure that words with higher frequency are closer to the root.



In the construction process, the nodes newly added after merging are marked as blue nodes, and every two nodes need to be merged once. Therefore, if the number of leaf nodes is N, the number of newly added nodes in the Constructed Huffman tree is N-1. There are 6 words in the chestnut above, so 5 new nodes are added.

Huffman coding

In data communication, the transmitted text needs to be converted into binary strings, which are represented by different permutations of 0,1 code. For example, the text to be sent is “after data ear are art area”. The characters used here are “a,e,r,t,f,d”. Each letter appears 8,4,5,3,1,1. To distinguish six letters, the simplest binary encoding method is equal length encoding, fixed with three bits of binary, which can be encoded with 000,001,010,011,100,101 pairs of “A, E, R, T, F, D”, and then decoded by the receiver in accordance with three digits and one point. Therefore, the length of the encoding depends on the number of different characters in the message. If 26 different characters may appear in the message, the fixed encoding degree is 5, that is, 2 to the power of 5. Send the message is always hope, however, the total length as short as possible, so in practice, the frequency of each character or use number is different, if a, b, c use frequency is much higher than the x, y, z, that in the design code, can be used with high frequency with short code, but use frequency is low, with a long code, so as to optimize the whole message coding. To make different length coding for the prefix encoding, which requires a character encoding is not another character encoding prefix, each character in a character set can be used as a leaf node generates an encoding binary tree, in order to get sent a message of the shortest length, in the frequency of each character as characters node weights, apparently using frequency, the smaller the smaller the weight, The smaller the weight is, the lower the leaf will be, so the lower the frequency is, the longer the code is, and the higher the frequency is, the shorter the code is. In this way, the minimum weight path length of the tree is guaranteed, which is the shortest length of the message transmitted. Therefore, the problem of finding the shortest length of the transmitted message is transformed into the problem of finding the Huffman tree generated by taking all characters in the character set as leaf nodes and the frequency of character occurrence as its weight. The binary prefix encoding designed by using Huffman tree is called Huffman encoding, which can not only meet the conditions of prefix encoding, It also ensures the shortest total length of the packet code.

So far, Huffman trees and Huffman coding have two conventions, 1. The node with high weight is taken as the left child node, and the node with low weight is taken as the right child node. 2. The left child node is encoded as 1, and the right child node is encoded as 0. Huffman coding will also be used in word2vec tool. It takes the words in the training expectation as leaf nodes, and The Times of their occurrence in the expectation as the weight. Huffman coding is carried out for each word by constructing the corresponding Huffman tree, and the child node with the larger weight is encoded as 1 and the child node with the smaller weight is encoded as 0. According to this principle, we can then mark the corresponding code on the Huffman tree constructed in front to get the Huffman code of “I watch flag raising in Beijing Tian ‘anmen Square” : “I” -0, “in” -111, “Beijing Tian ‘anmen square” -110, “square” -101, “Watch” 1001, “Flag raising” -1000. It can be seen that the higher the word frequency, the shorter the code.