Understand the DEFLATE Compression behind the zip and gzip Formats

Originally by Dornhoth

The Nuggets translation Project

Permanent link to this article: github.com/xitu/gold-m…

Translator: JohnieXu

Proofread by Jessica and Shixi-Li

Understand the compression algorithms behind zip and Gzip compression formats

As we all know, every byte of data uploaded or downloaded over the Internet costs traffic, that is, money. Although there are dozens or hundreds of compression algorithms in existence, the most popular compression algorithm is probably ZIP. The gzip compression algorithm has a similar name to ZIP, but is a different algorithm. The GZIP algorithm is also widely used, as it is adopted by one of the three standard HTTP compression specifications. Although various compression algorithms are suitable for different scenarios, their underlying base is based on DEFLATE. DEFLATE is a lossless data compression algorithm using both LZ77 algorithm and Huffman Coding.

LZ77

DEFLATE is based on the LZ77 algorithm, a lossless compression technology for text compression.

The compression

LZ77 algorithm achieves compression by replacing the current data with the corresponding matching data information that has already appeared in the encoder or decoder.

Instead of looking for repeated letters in the entire text at the same time, the algorithm typically starts with a fixed-size search buffer, such as 20 (in a real-world scenario, this buffer would be tens of kilobytes). Then, when encoding the letters in the text one by one, it first determines whether the current letter appears in any of the 20 letters in the preceding buffer. If a match is found, the offset d of the current letter and the found letter is recorded, thus completing the first stage of a letter encoding. Next, the next letter in the current adjacent to the code letter and the next letter in the buffer matching the adjacent letter to match, if the match continues to the next letter matching, and so on until the buffer 20 letters matching or adjacent letters are not matched, the matching process ends. After the above process is complete, replace the consecutive letters matching the current position with the offset from the buffer letter and the number of consecutive letters l. In this way, the second stage of alphabetic coding is complete.

Let’s use this example to see how it works:

ERTOORTORR
Copy the code

The easiest thing to think of is to simply replace the second occurrence of the O with a flag pointing to the first occurrence of the O, or replace the second occurrence of the RTO pointing to the first occurrence of the RTO.

To describe this process in more detail, suppose we set the buffer size to 4 and think of the buffer size as a sliding window sliding right along the body text:

1) [....E]RTOORTORR 2) [...ER]TOORTORR 3) [..ERT]OORTORR 4) [.ERTO]ORTORR 5) [ERTOO]RTORR -> ERTO(1, 1)RTOORR 6) E[RTOOR]TORR -> ERTO(1, 1)(4, 3)RR 7) ERTO[ORTOR]R -> ERTO(1, 1)(4, 3)(3, 1)R 8) ERTOO[RTORR] -> ERTO(1, 1) (4, 3) (3, 1) (1, 1)Copy the code

The sliding window moved to the right step by step with the iteration of coding, and no repetition was found in the letters in the sliding window in the previous 4 steps. At the fifth step, the letter O in the sliding window has been repeated, then check the R to the right of letter O and find that the adjacent letter to the right of letter O in the sliding window is not R, so we will not continue the right matching, and replace the second O with (1, 1) (indicating: In the sliding window, the offset distance between the matched letter and the current letter is 1, and the length of the matched consecutive letter is 1). In step 6, the letter R in the slide window matches the fourth letter to the left. Continue to check the next letter T and find RT in the slide window matches as well. And then we move on to the next letter O, and RTO matches in the slide window and RTO matches, and that’s it, because the next letter matches. In the sliding window, the offset distance between the matched letter and the current letter is 4, and three consecutive letters are matched, so the three matched letters are replaced by (4, 3). Then in step 7, the letter R matches the offset 3, but the RR does not match, and in step 8 it is found that the offset of the most recent matching R is 1. The result of encoding the whole text is as follows:

ERTO(1, 1)(4, 3)(3, 1)(1, 1)
Copy the code

Unpack the

Compressed text is actually made up of a series of such (d, 1) tag pairs and letters, and the tag pairs cannot directly find matching letters. During decompression, the letters remain the same, and the pair of markers is converted to the letters they point to. Here is an example of decompression:

abc(3, 2)(1, 1)
Copy the code

The letter ABC remains the same, and the marker pair (3, 2) indicates moving 3 units to the left from the current position and then taking out 2 letters, thus converting it to AB. The original text now looks like this: abCAB (1, 1), the last tag pair represents moving 1 unit to the left from the current position and then taking out 1 letter, thus converting to B. The resulting text is abcabb.

Huffman coding

After eliminating repeated letters in the text with LZ77, Huffman encoding is used for a second compression. This method reduces the overall length of text by substituting shorter codes for more commonly used letters and longer codes for less commonly used letters.

Let’s see how it works with a simple sample text.

The compression

EFTUPOEERRREOOPRRUTUTTEEE
Copy the code

In this case, we want to compress the text losslessly. Usually a letter takes up 8 bytes, so the total length of the text is 200 bytes. In this passage, we find that the letter F appears only once, while the letter E appears seven times. Huffman coding takes advantage of this feature to reduce the total length of the entire text by reducing the byte length of the most frequent letters themselves.

To use Huffman encoding to compress the article, the frequency of each letter in each text should be counted first. The frequency of the letter in the above example is as follows:

Frequency: E: 7, R: 5, T: 4, U: 3, O: 3, P: 2, F: 1Copy the code

We need to use the letters in the text as leaf nodes to build a binary tree that encodes each letter in the text. Starting with the least frequent letters, P and F, and using them as the bottom leaf nodes, and the sum of their frequencies as the parent node, we get the following binary tree:

                                (3)
                               /   \
                             P(2)  F(1)
Copy the code

Repeat the process, using the least frequent letters: U and O and R and T, leaving the most frequent letter, E, alone.

       (6)                      (9)                      E(7)
      /   \                    /   \
    U(3)  O(3)               R(5)  T(4)
Copy the code

Next, the four binary trees obtained above are used as child nodes to create a larger binary tree, and the frequency values of the root nodes of the above binary tree are sorted in ascending order, and the binary tree with low root node frequency value is preferentially used as the new binary tree child nodes. Here, two groups of binary trees, U and O, R and T, are used to form the following binary tree:

                                (9)
                               /   \
                              /     \
                            (6)     (3)
                           /   \   /   \
                          U     O P     F
Copy the code

At this time, there are still three binary trees, and the root nodes are 9, 9 and 7 respectively (the first 9 is the binary tree created in the previous step). Similarly, the two with the lowest root node frequency value are used as child nodes to create new binary trees as follows:

                               (16)
                              /    \
                             (9)    E
                            /   \
                           /     \
                         (6)     (3)
                        /   \   /   \
                       U     O P     F
Copy the code

Now there is a binary tree with root node value of 16 and a binary tree with root node value of 9 and leaf nodes R and T left. Create a new binary tree using them as child nodes as follows:

                                   (25)
                                  /    \
                                 /      \
                               (16)     (9)
                              /    \   /   \
                             (9)    E R     T
                            /   \
                           /     \
                         (6)     (3)
                        /   \   /   \
                       U     O P     F
Copy the code

Now all we have to do is encode the text according to this binary tree. Access each letter in turn from the following node, and treat the left branch as a 0 and the right branch as a 1. Connect the zeros and ones in the order that the letters are accessed along the binary tree. For example, it takes one left branch and one right branch to get from the root node to the letter E, so the code of the letter E is 10. The letter U branches left four times and is coded 1111; F requires two left branches and two right branches, and its code is 1100. You can see that the letter E, which appears very frequently in this example, has fewer coded digits than the letter F, which appears less frequently. After such encoding processing, the final compressed text is as follows:

10110000111111011110101001010110111011101101010111110011110000101010
Copy the code

The compressed text is only 68 bits long, far less than the original 200 bits.

Unpack the

If we receive such compressed text, we want to be able to unzip it and make it understandable. We all know that for a period of not compressed text takes up eight, a character in the above said after Huffman encoding compression is not a character of digits fixed 8 bits, so do not know a piece of data (for example: 011) is one character, two characters or 3 characters, so this will be how to extract the compressed text?

There is nothing magical about this step, and to extract it accurately you need the binary tree built in the above code. There are two ways to get this binary tree for encoding. The first way is to put it together with the compressed text as the compression result of the original text, which may cause the compressed text to be larger than the original text. The second option is to use a predefined binary tree. We know how often each letter is used in English, and we can construct the binary tree according to this frequency. Partial text compression using such a pre-defined common letter frequency binary tree may be less effective than letter frequency compression by text content, but there is no need to save the letter frequency binary tree to the compressed file. To sum up, both schemes have their advantages and disadvantages.

Although this article does not go into the details and implementations of various compression algorithms, you should have a general idea of how text can be compressed into zip and gzip formats. Hopefully this article satisfies your curiosity about the mystery of compression algorithms 🙂

* Technically, the ZIP compression format supports the use of other compression algorithms, but DEFLATE is one of the most commonly used.

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.

The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.

Understand the compression algorithms behind zip and Gzip compression formats

Understand the compression algorithms behind zip and Gzip compression formats

LZ77

The compression

Unpack the

Huffman coding

The compression

Unpack the

Related Posts

[2021-07-01] Initialize an empty ArrayList. What is the capacity of the underlying array? After adding an element, what is the capacity?

A Preliminary study on SQL Index

Talk about the pitfalls encountered by cglib dynamic proxies