Compression algorithm

Cognitive compression algorithm

We’ve all had the experience of compressing and decompressing files, using file compression to reduce the size of files when they become too large. For example, the limit for uploading files on wechat is 100 MB. I have a folder here that cannot be uploaded, but the files I decompress will be less than 100 MB, so my files can be uploaded.

In addition, when we save the photos taken by the camera to the computer, we also use a compression algorithm for file compression, the file compression format is generally JPEG.

So what is a compression algorithm? How is the compression algorithm defined? So before we get to the algorithm we need to know a little bit about how files are stored, right

File storage

A file is a form of storing data on a storage medium such as a disk. The most basic unit of stored data in a program file is the byte. The size of a file is expressed as xxxKB or xxxMB, because the file is stored in the unit of B = Byte.

A file is a collection of bytes of data. There are 256 types of byte data in 1-byte (8-bit) format, which is 0000 0000-1111 1111 in binary format. A file is a text file if the data stored in it is literal. If it is a graph, the file is an image file. In any case, the number of bytes in the file is stored consecutively.

Definition of compression algorithm

File collection is actually a collection of bytes of data, so we can give a definition of compression algorithm.

A compaction algorithm is an algorithm that compacts data. It consists of two steps: compaction and uncompaction.

In fact, it is an algorithm that reduces the file byte space and occupies space without changing the original file attributes.

According to the definition of compression algorithm, we can divide it into different types:

Lossy and lossless

Lossless compression: it can reconstruct the compressed data without distortion and restore the original data accurately. It can be used to compress executable files, common files, disks, and multimedia data when data accuracy is strictly required. The compression of this method is small. Such as difference coding, RLE, Huffman coding, LZW coding, arithmetic coding.

Lossy compression: there is distortion, the original data can not be completely accurate recovery, the reconstructed data is only an approximation of the original data. It can be used in situations where the accuracy of data is not high, such as multimedia data compression. The compression of this method is relatively large. Examples are predictive coding, sound sensing coding, fractal compression, wavelet compression, JPEG/MPEG.

symmetry

If the complexity of the codec algorithm and the required time are about the same, it is a symmetric coding method, and most compression algorithms are symmetric. However, there are also asymmetries, which are generally difficult to encode and easy to decode, such as Huffman coding and fractal coding. But the encoding method used in cryptography is the opposite. It is easy to encode and very difficult to decode.

Between frames and within frames

In video coding, in-frame and inter-frame encoding methods are used at the same time. In-frame encoding refers to the encoding method completed independently in a frame of image, and static image encoding, such as JPEG; Interframe coding, on the other hand, requires reference to the before and after frames to encode and decode, and takes into account the compression of time redundancy between frames in the encoding process, such as MPEG.

The real time

In some multimedia applications, data need to be processed or transmitted in real time (such as live digital recording and video recording, MP3/RM/VCD/DVD, video/audio on demand, live network broadcast, video telephone, and video conference). The codec delay is generally less than or equal to 50 ms. This requires simple/fast/efficient algorithms and high speed/complex CPU/DSP chips.

grading

Some compression algorithms can simultaneously process multimedia data of different resolutions, different transmission rates and different quality levels, such as JPEG2000 and MPEG-2/4.

These concepts are abstract, mainly in order to let you know about the classification of compression algorithms, we will analyze the characteristics and advantages of several commonly used compression algorithms on the concrete

Understanding of several commonly used compression algorithms

Mechanism of RLE algorithm

Let’s take a formal look at file compression. First let’s try to compress the file (text file) of AAAAAABBCDDEEEEEF with 17 half-corner characters. Although these words have no practical meaning, they are a good way to describe the compression mechanism of RLE.

Since corner characters are stored as one byte in the file, the size of the file is 17 bytes. As shown in figure

(Here’s a question for the reader to ponder: Why is 17 characters 17 bytes in size but taking up a lot of space? This question will not be discussed in this article.

So, how do you compress this file? Consider, too, that we can use any compression algorithm that makes the file less than 17 bytes.

The most obvious way to compress, which I think you’ve already figured out, is to reduplicate the same character, which is the number of times a character is repeated. So the top file will look like this when compressed

From the figure, we can see that the 17 characters of AAAAAABBCDDEEEEEF are successfully compressed into 12 characters of A6B2C1D2E5F1, that is, 12/17 = 70%, the compression ratio is 70%, the compression is successful.

As such, the compression method of expressing the file contents as data * repeats is called the RLE(Run Length Encoding) algorithm. RLE algorithm is a good compression method, often used to compress fax images, etc. Since the essence of image files is also a collection of bytes of data, RLE algorithm can be used for compression

Disadvantages of RLE algorithm

RLE compression mechanism is relatively simple, so RLE algorithm procedures are relatively easy to write, so the use of RLE in this way can let you realize the idea of compression, but RLE only for a specific sequence of data, the following is the RLE algorithm compression summary

The file type File size before compression Compressed file size The compression ratio
Text file 14862 bytes 29065 bytes 199%
The image file 96062 bytes 38328 bytes 40%
EXE file 24576 bytes 15198 bytes 62%

As you can see from the above table, using RLE to compress the text file after the data is not reduced but increased! Almost twice as much as before compression! Because it is rare to see sequential characters in text.

As we discussed above, RLE compression works well only for consecutive byte sequences. What if there is a series of different characters? For example ABCDEFGHIJKLMNOPQRSTUVWXYZ, 26 English letters of space should be 26 bytes, We use the RLE compression algorithm to the result of the compressed A1B1C1D1E1F1G1H1I1J1K1L1M1N1O1P1Q1R1S1T1U1V1W1X1Y1Z1, 52 bytes, compression after the completion of the capacity did not decrease but increase! This is obviously not what we want, so in this case we can no longer use RLE compression.

Huffman algorithm and Morse code

Next we introduce another compression algorithm, namely Huffman algorithm. Before you understand Huffman’s algorithm, you must discard the fact that each character of a semicolon number is 1 byte (8 bits) of data. Let’s take a look at the basic idea of Huffman’s algorithm.

Text files are composed of different types of characters, and the number of occurrences of different characters is also different. For example, in A text file, it is not uncommon for A to appear 100 times or so and Q to be used only three times. The key of Huffman’s algorithm is that data that occurs repeatedly can be represented in bytes less than 8 bits, while data that is not frequently used can be represented in bytes more than 8 bits. If A and Q are represented by 8 bits, the size of the original file is 100 times * 8 bits + 3 times * 8 bits = 824 bits. If A is represented by 2 bits and Q is represented by 10 bits, it is 2 times 100 + 3 times 10 = 230 bits.

Note, however, that the final disk storage is 8 bits per byte to hold files.

Huffman algorithm is a bit more complicated, but before we get into it let’s have some dessert and learn about Morse code. You must have seen American TV shows or war movies, and Morse code is often used to transmit information during war, as shown below

Now let’s talk about Morse code. Here’s an example of Morse code. Think of 1 as a short dot (di) and 11 as a long dot (da).

Morse coding generally represents the character with the highest frequency in the text as a short code. As shown in the table, if the short bit is 1 and the long bit is 11, then the data character E (di) can be represented by 1, and C (tick-tock) can be represented by 9-bit 110101101. In actual Morse code, if the length of the short point is 1, the length of the long point is 3, and the interval between the short point and the long point is 1. The length here refers to the length of the sound. For example, we want to use the AAAAAABBCDDEEEEEF example above to rewrite it in Morse code, where characters are separated by symbols representing time intervals. We’re going to use zero zero here.

So, AAAAAABBCDDEEEEEF The text becomes A * 6 times + B * 2 times + C * 1 times + D * 2 times + E * 5 times + F * 1 times + character interval * 16 = 4 bits * 6 times + 8 bits * 2 times + 9 bits * 1 times + 6 bits * 2 times + 1 bits * 5 times + 8 bits * 1 times + 2 bits * 16 times = 106 bits = 14 bytes.

So the compression ratio using Morse code is 14/17 = 82%. Efficiency is not outstanding.

Implement Huffman algorithm with binary tree

As mentioned earlier, Morse coding determines the length of encoding data representing each character according to the frequency of occurrence of each character in daily text. However, AAAAAABBCDDEEEEEF is not the most efficient text in the system.

Now let’s look at Huffman’s algorithm. Huffman algorithm is to construct the best coding system for each compressed object file and compress it on the basis of this coding system. Therefore, what kind of code (Huffman code) is used to split the data depends on the file. Huffman encoded information and compressed data are stored in files compressed by Huffman algorithm.

Next, we are sorting out the a-F characters in AAAAAABBCDDEEEEEF according to the principle that characters with high frequency should be represented with as few digit codes as possible. The results, sorted in order of frequency from highest to lowest, also list the coding schemes.

character frequency Coding (scheme) digits
A 6 0 1
E 5 1 1
B 2 10 2
D 2 11 2
C 1 100 3
F 1 101 3

In the encoding scheme in the above table, as the frequency of occurrence decreases, the data bits of character encoding information also gradually increase, from 1 and 2 bits at the beginning to 3 bits in turn. But there is A problem with the code system. You don’t know the three-digit code 100, which means that E, A and A are represented by 1, 0 and 0. Or do we write B and A in terms of 10 and 0? Let’s call C 100 again.

In Huffman algorithm, the coding system that can distinguish clearly can be constructed even without using characters to distinguish symbols through constructing the coding system of Huffman tree. However, the algorithm of Huffman tree is more complex. The following is the construction process of a Huffman tree.

In natural trees, leaves grow from the root, whereas in Huffman trees, leaves grow from the branches

Huffman trees can increase the compression ratio

After using Huffman tree, the data with higher frequency occupies fewer bits, which is also the core idea of Huffman tree. From step 2 in the figure above, we can see that when the branches connect data, we start with the data with low frequency. This means that less frequently occurring data reaches more branches in the root. More branches means more bits of code.

Next, let’s look at the compression ratio of Huffman tree, as shown in the figure above. AAAAAABBCDDEEEEEF is 000000000000 100100 110 101101 0101010101 111,40 bits = 5 bytes. Before compression, the data was 17 bytes, and after compression, the data reached an astonishing 5 bytes, or compression ratio = 5/17 = 29%. Such a high compression ratio is simply amazing.

For reference, you can use Huffman trees as compression algorithms for any type of data

The file type Before compression After the compression The compression ratio
Text file 14862 bytes 4119 bytes 28%
The image file 96062 bytes 9456 bytes 10%
EXE file 24576 bytes 4652 bytes 19%

Reversible compression and non-reversible compression

Finally, let’s look at the data form of the image file. The purpose of image file is to output image data to display, printer and other devices. Commonly used image formats are: BMP, JPEG, TIFF, GIF format, etc.

  • BMP: An image form created by using the Windows paintbrush
  • JPEG: a form of image data commonly used in digital cameras
  • TIFF: Is a form of image that quickly displays the nature of the data by including “tags” in the file
  • GIF: A form of data developed in the United States that requires no more than 256 colors

Image files can use the RLE algorithm and Huffman algorithm introduced earlier, because image files in most cases do not require data to be restored to the same state as before compression, allowing the loss of some data. We call the compression that can be restored to the pre-compression state reversible compression, and the compression that cannot be restored to the pre-compression state non-reversible compression.

Generally speaking, JPEG file is not reversible compression, so some image information is blurred after restoration. GIF is reversible compression