Data structures and algorithms from scratch (10) : Huffman tree

1 Huffman tree

1.1 introduction

In the process of learning programming, there must have been similar to the following situation: record the results of a class of students, through the score output evaluation.

if (score < 60) {
    result = "Fail";
} else if (score < 70) {
    result = "Pass";
} else if (score < 80) {
    result = "Medium";
} else if (score < 90) {
    result = "Good";
} else {
    result = "Good";
}
Copy the code

Suppose that according to the statistical results or the data of previous years, the proportion of each stage is as follows:

score	0 ~ 59	60 ~ 69	70 ~ 79	80 ~ 89	90 ~ 100
The proportion	5%	15%	40%	30%	10%

You can see that 70+ and 80+ are the majority. In this case, for the above code, the first two layers of if have a low hit rate, resulting in a lot of redundant judgments. In general, there is no problem, but if the amount of data is large and the IF judgment of each layer is time-consuming, the structure of the judgment process needs to be optimized.

We treat each if judgment as a node, and the process becomes a binary tree:

The depth of a tree was introduced in the previous tree section: the root is defined at level 1, the children of the root at level 2, and so on. Here we analyze the depth of nodes in terms of paths. The number of branches (line segments) between each node and the root node is called the path length of this node. The path length of a tree is the sum of the path lengths from the root to each node.

The path length of each node in the figure is as follows:

score	A	A’	B	B’	C	C’	D	E
The length of the path	1	1	2	2	3	3	4	4

The path length of the entire tree is: 1 + 1 + 2 + 2 + 3 + 3 + 4 + 4 = 20.

In this case, the path length represents the number of if judgments. The path length of D is 4, which means it takes 4 judgments to get the result D.

Combined with the previous statistical data, we multiplied the Path Length of each possible result (i.e., each leaf node) by its probability of occurrence (i.e., weight) and finally added them together to obtain the Weighted Path Length of the Tree (WPL).

The path length of the tree includes the path length of all nodes. (It is understood that the path of the root node is 0.)

When calculating the WPL of the tree, only the weighted length of the leaf node is counted.

node	A	B	C	D	E
The length of the path	1	2	3	4	4
A weight	5	15	40	30	10

The WPL of the whole tree is: 1 x 5 + 2 x 15 + 3 x 40 + 4 x 30 + 4 x 10 = 315.

This number represents the average path length of each result with the above probability, that is, the average times of if execution when judging a result, which is 3.15 times. Intuitively, the average of four IFs is 3.15 times. If judge D, C, B, E and A according to the probability from high to low, it can also be easily guessed that WPL must be less than 3.15.

In addition, we can construct Huffman tree to solve the shortest path and obtain the following shortest path tree:

node	A	B	C	D	E
The length of the path	4	3	1	2	4
A weight	5	15	40	30	10

The WPL of the entire tree is: 4 x 5 + 3 x 15 + 1 x 40 + 2 x 30 + 4 x 10 = 205.

Compared with the previous 315, it can be understood that the average execution times of IF decreased from 3.15 times to 2.05 times, showing a significant decrease.

So how do you build Huffman trees? Here’s what a Huffman tree is and how Huffman coding works.

1.2 Introduction to Huffman trees

Huffman Tree: given n weights as N leaf nodes, a binary Tree is constructed. If the weighted path length (WPL) of the Tree reaches the minimum, such a binary Tree is called an optimal binary Tree, also known as Huffman Tree (Huffman Tree). Huffman tree is a tree with the shortest weight path length, and the node with higher weight is closer to the root.

Huffman encoding: In a computer, each character is encoded in binary. We call the method of using the same length coding fixed-length coding. Like a string “Hello world! “Requires 12 bytes or 96 binary bits to store. However, in practice, the occurrence probability of each character is different. For example, the occurrence probability of letters A, E, I, O, and U may be high, while the occurrence probability of letters X and Z may be low. Huffman coding is a variable length coding method based on probability. The encoding length of characters with more occurrences is correspondingly smaller, and the encoding length is correspondingly longer.

Huffman coding is the ancestor of compression, although there are better compression methods, but its ideas are still influential today. Huffman coding has a wide range of applications, such as jpeg file, the application of Huffman coding to achieve the last step of compression; With the development of digital TV, Huffman coding has become the main way of video signal compression. It should be said that the appearance of Huffman coding ended the history that entropy coding could not achieve the shortest coding, and also made Huffman coding become a very important lossless coding. (baidu)

1.3 Construction of Huffman tree

In the scenario shown, we use three binaries to encode English letters. Text containing N characters requires sending 3N binary data.

Assume that the occurrence probability of each letter is shown as follows through statistical probability:

character	A	B	C	D	E	F
The probability of	27%	8%	15%	15%	30%	5%

Then we can construct Huffman tree to carry out variable-length coding.

The steps of Huffman tree construction:

Take the two nodes with the lowest weight in the set, build a binary tree, and remove the two nodes from the set.
The root node of the binary tree is added to the set as a new node, and its weight is the sum of the weights of the two children.
Repeat step 1 until there are no nodes in the collection.

It provides that:

The one with low weight is the left subtree, and the one with high weight is the right subtree. (There is no influence on left and right when it is equal)
The left subtree is 0 and the right subtree is 1.

Follow the steps to start building a Huffman tree:

Take the two nodes B:8 and F:5 with the lowest weight to construct a binary tree. And the one with small weight is the left subtree, and the one with large weight is the right subtree.

At this point, the situation of nodes in the set is shown above, N1 is the newly added node, and its weight is.

The two smallest nodes n1:13 and C:15 are taken from the new set to construct a binary tree.

The two smallest nodes A1:13 and C:15 are taken from the set to construct A binary tree. N2 is larger than these two nodes and does not participate in this step, so A and D will construct A new subtree.

The two smallest nodes N2:28 and E:30 are taken from the set to construct a binary tree.

The Huffman tree is constructed from the remaining two nodes N3:42 and N4:58 in the set.

By constructing the Huffman tree, the left subtree is coded as 0 and the right subtree is coded as 1, then the new coding value is obtained:

The probability of occurrence of each letter is shown as follows:

character	A	B	C	D	E	F
The probability of	27%	8%	15%	15%	30%	5%
The original code	000	001	010	011	100	101
Huffman coding	01	1001	101	00	11	1000

Based on the new encoding value, send the following message: BADCADFEED (A:2 B:1 C:1 D:3 E:2 F:1)

001 000 011 010 000 011 101 100 100 011

New encoding binary: 1001 01 00 101 01 00 1001 11 11 00

It can be clearly seen that Huffman coding effectively compresses the data at this probability.

2 Code implementation

2.1 Structure Definition

/// Huffman tree node
typedef struct {
    int weight;     / / has a weight
    int flag;       / / sign
    int parent;     // Parent index
    int lch, rch;   // Left and right child index
} HaffmanNode;

// The length of the encoding
static const int CODE_LEN = 3;  

/// character encoding structure
typedef struct  {
    int code[CODE_LEN]; // Save the encoded array
    int length;         // Encoding length
    int weight;         // The weight of the encoding character
} HaffmanCode;
Copy the code

Because the number of nodes in Huffman tree is determined, the sequential storage structure is adopted to realize binary tree, which is more convenient to access nodes.

Huffman tree node in addition to the required three elements of binary tree node, also contains the parent node pointer, weight and a flag bit, this is to facilitate the subsequent up traversal to obtain the code process.

In addition to defining tree nodes, a coding structure is also defined to hold encoded information, since Huffman codes are variable-length codes.

2.2 Constructing Huffman tree

// build a Huffman tree based on the weight array
/// @param weight array of weights
/// @param n Number of arrays
/// @param haffmanNodes array, sequentially stored Huffman tree
Status buildHaffmanTree(int weight[], int n, HaffmanNode *haffmanNodes) {
    // 1 assigns the weight to the first node of the node array
    for (int i = 0; i < 2 * n - 1; i++) {
        if (i < n) {
            haffmanNodes[i].weight = weight[i];
        } else {
            haffmanNodes[i].weight = 0;
        }
        // Initialize other elements
        haffmanNodes[i].parent = - 1;
        haffmanNodes[i].lch = - 1;
        haffmanNodes[i].rch = - 1;
        haffmanNodes[i].flag = 0;
    }
    // 2 = 1; // 2 = 1; // 2 = 2
    // Use to record the current minimum two weights
    int min1, min2;
    // The array index used to record the current minimum two weights
    int index1, index2;
    for (int i = 0; i < n - 1; i++) {
        // 2.1 Initialize weights and indexes
        min1 = min2 = UINT16_MAX;
        index1 = index2 = 0;
        // 2.2 Between the current n leaves and the newly created I branches,
        // Find two minimum weights that have not been added to the tree
        for (int j = 0; j < n + i; j++) {
            HaffmanNode node = haffmanNodes[j];
            if (node.weight < min1 && node.flag == 0) {
                // The weight is less than min1 and is not added to the tree
                // Transfer min1 and index1 to min2 and index2
                min2 = min1;
                index2 = index1;
                Min1 and index1 record the new minimum weight node
                min1 = node.weight;
                index1 = j;
            } else if (node.weight < min2 && node.flag == 0) {
                // The weight is less than min2 and is not added to the treemin2 = node.weight; index2 = j; }}// 2.3 Two nodes with minimum weights have been found to construct a new node
        haffmanNodes[n + i].weight = haffmanNodes[index1].weight + haffmanNodes[index2].weight;
        haffmanNodes[n + i].lch = index1;
        haffmanNodes[n + i].rch = index2;
        // 2.4 Modify the two nodes found
        // Modify the parent
        haffmanNodes[index1].parent = n + i;
        haffmanNodes[index2].parent = n + i;
        // Modify the flag bit to indicate that it has been added to the tree
        haffmanNodes[index1].flag = 1;
        haffmanNodes[index2].flag = 1;
    }
    return SUCCESS;
}
Copy the code

2.3 Obtaining the Code

// get the encoding value from the Huffman tree
/// @param haffmanNodes array, sequentially stored Huffman tree
/// @param n Number of arrays
/// @param haffmanCodes codes array
Status getHaffmanCode(HaffmanNode haffmanNodes[], int n, HaffmanCode haffmanCodes[]) {
    // Declare an encoding structure to hold temporary data
    HaffmanCode code;
    // Record the index record traversed up, encoding 0 or 1 by judging the left and right subtrees
    int child, parent;
    // Get the encoding values of n leaf nodes in turn
    for (int i = 0; i < n; i++) {
        code.length = 0;
        code.weight = haffmanNodes[i].weight;
        // The parent of the root node is -1
        child = i;
        parent = haffmanNodes[child].parent;
        while(parent ! =- 1) {
            if (child == haffmanNodes[parent].lch) {
                // is the left subtree, add the encoding value 0
                code.code[code.length] = 0;
            } else {
                // is the right subtree, add the encoding value 1
                code.code[code.length] = 1;
            }
            // The encoding length is increased by one
            code.length++;
            // Continue traversing to the next level
            child = parent;
            parent = haffmanNodes[child].parent;
        }
        // Complete the code collection, but the code is in reverse order, need to be reversed
        int maxIndex = code.length - 1;
        for (int j = 0; j <= maxIndex; j++) {
            haffmanCodes[i].code[j] = code.code[maxIndex - j];
        }
        // Save weights and lengths
        haffmanCodes[i].weight = code.weight;
        haffmanCodes[i].length = code.length;
    }
    return SUCCESS;
}
Copy the code

2.4 the use of

int mainHaffmanTree(a) {
    // Array of weights
    int n = 4;
    int weight[] = {2.4.5.7};
    // Create an array of nodes, store the Huffman binary tree sequentially, total 2 * n-1 nodes
    // There are n leaf nodes representing each weight, and n-1 non-leaf nodes are newly constructed weight nodes
    HaffmanNode *haffmanNodes = malloc(sizeof(HaffmanNode) * 2 * n - 1);
    // Create an array of encoded values
    HaffmanCode *haffmanCodes = malloc(sizeof(HaffmanCode) * n);
    // Build the Haverman tree
    buildHaffmanTree(weight, n, haffmanNodes);
    // Get the encoding value from the Huffman tree
    getHaffmanCode(haffmanNodes, n, haffmanCodes);
    // Prints the encoding value
    int wpl = 0;
    for (int i = 0; i < n; i++) {
        HaffmanCode code = haffmanCodes[i];
        printf("Weight: %d code:", code.weight);
        for (int j = 0; j < code.length; j++) {
            printf("%d", code.code[j]);
        }
        printf("\n");
        wpl += code.weight * code.length;
    }
    printf("Huffman tree weighted path length (WPL) is %d\n", wpl);
    return 0;
}
Copy the code

Print result:

Weight:2Code:110Weight:4Code:111Weight:5Code:10Weight:7Code:0The weighted path length (WPL) of Huffman tree is35
Copy the code

Huffman tree structure can be easily obtained from the weight array, as shown in the figure below. According to the structure of the tree, the coded value of the corresponding weight can be obtained, which is consistent with the printed result.