Make writing a habit together! This is the fourth day of my participation in the “Gold Digging Day New Plan · April More text Challenge”. Click here for more details.

For more blog posts, seeTotal catalog of romantic carriages for audio and video systems learning

YUV color coding analysis video coding principle — From The film of Sun Yizhen (1) Analysis video coding principle — from the film of Sun Yizhen (2)

The last blog analyzes the principle of video coding — from Sun Yizhen’s film (a) briefly introduces the historical background of H264, and focuses on the introduction of the H264 inter-frame prediction, today continues to explain the rest of H264 coding — transformation quantization, entropy coding.

Transform quantitative

DCT transform

So let’s talk about what the image frequency is. Images can be seen as a signal was defined as a two-dimensional plane, the signal amplitude corresponds to the pixel gray value (for the three component of color image is RGB), if we only consider a row of pixels in the image, you can see it as a signal on a definition in one dimensional space, the signal in the form with the traditional signal processing in the field of time-varying signal is similar, The time-varying signal has a certain frequency component. The frequency of the image is also called spatial frequency, which reflects the change of the pixel gray scale of the image in space.

General pictures have more high-frequency information but smaller amplitude. High frequency information mainly describes the edges or details of the image. For images with rapid spatial changes, such as mountains full of gullies, the high frequency component is relatively strong, while the low frequency is weak. The low frequency is mainly the overall contour information of the image. Because the visual sensitivity of human eyes is limited, sometimes when we remove part of the high-frequency information, the difference between human eyes appears to be not much. This is the removal of visual redundancy.

Therefore, we can first transform the image into the frequency domain by DCT (Discrete cosine transform), and then remove some high frequency information. In this way, we can reduce the amount of information, thus achieving the purpose of compression.

How do we get into the frequency domain? Remember the Fourier transform, any periodic signal can be represented by a series of sinusoids in a harmonic relationship. Since this article is not going to go into the algorithmic part, we are not going to go into the math part, but one graph will give you a general idea:

The transformation formula of DCT is as follows:

Where X is the residual block, Y is the transformed block, and A is:

And without going into the math details, what we need to know is, what does the DCT process basically do?

Under normal circumstances, DCT transformation is carried out on 4×4 sub-blocks. Suppose that the current residual block is:

(Image from:Transformation quantization: How to reduce visual redundancy)

After the transformation of the above formula, we can get:

(Image from:Transformation quantization: How to reduce visual redundancy)

Meaning of coefficient

The key point here, which the network rarely makes clear, is what each coefficient of the transformed block represents.

The central idea here is the superposition of harmonics, which is similar to the Fourier transform idea above.

In the basic theory of DCT, all 4*4 image blocks can be superimposed with the following basic images multiplied by their respective coefficients:

From upper left to lower right are the dc components to harmonics of higher and higher frequency.Each coefficient on the resulting block by DCT transformation is exactly the coefficient corresponding to the position of each basic pattern diagram in the figure above.

(Image from:Transformation quantization: How to reduce visual redundancy)

In the figure above, for example, -10 is the dc componentThe coefficient,

35 is the second basic pattern in the first rowThe coefficient. And so on.

Notice that in the transformed block, the coefficient in the upper left corner is usually larger, and the coefficient closer to the lower right corner is smaller. This is because the high-frequency coefficient is generally smaller, while the low-frequency component is generally larger.

quantitative

The DCT transform is not compressed, which requires the support of the next step, quantization.

Quantization is just a division operation. The coefficient can be reduced by division operation, while the amplitude of high-frequency information is relatively small, so it is easier to be quantized to 0, so as to achieve the purpose of compression.

The quantization formula is as follows:

X is the coefficient after DCT transformation, QP is also called quantization parameter, and round is round. FQ is the quantized coefficient.

The 4*4 macroblock of the above example is quantized by taking QP to be 30 to get:

(Image from:Transformation quantization: How to reduce visual redundancy)

And as you can see, most of the coefficients are now zero, and that’s what happens when you get rid of the high frequencies by quantizing, and the lower right corner is basically zero.

(In addition, about H264 transform quantization to say:

1.DCT transform is relatively slow due to floating point operation, so H264 also introduces the Hadamard transform that converts image blocks to the frequency domain more quickly. It should be mentioned here that another Hadamard transform similar to DCT transform will be used in the brightness 16×16 in-frame prediction block and chroma 8×8 prediction block of H264.

2. Floating point operation is involved in THE process of DCT transformation, and the decoding on different machines will produce drift and lead to errors due to accuracy problems. In order to reduce the error caused by floating point operation drift, DCT transform is changed to integer transform. Floating point operation and quantization process in DCT transform are combined, so that there is only one floating point operation process

But this is a mathematical detail, so I won’t get into too much detail in this article.)

Entropy coding

The real “compression” step in video coding is mainly to remove information entropy redundancy. The removal of space, time and visual redundancy mentioned above is actually preparation for this step.

It is noted that the residual predicted between the previous frames and the quantization of the subsequent transformation are to convert the data into as many continuous zeros as possible, and then use a reasonable coding algorithm to encode the final code stream.

reorder

In order to make the code stream have as many ‘0s’ as possible, according to the characteristics that the blocks near the lower right corner after transformation and quantization are basically’ 0s’, the ‘Zigzag’ scanning method will be used to make the number string obtained have a large number of continuous’ 0s’.

For example, after the block in the figure above is scanned in “Zigzag” mode, the number string obtained is “0200010000000000”, which shows a large number of continuous’ 0 ‘.

Coding algorithm

The reordered numeric string is then encoded and compressed to form the final code stream. Encoding is generally divided into fixed-length encoding and variable length encoding. Utf-16 and ASCII are common fixed-length encoding, while Haverman encoding is common side length encoding. Generally speaking, if it is short data with high frequency, coding with side length saves more space. In H264, according to the characteristics of different data segments, the bit stream parameters are divided into fixed-length coding and Columbus coding, and content adaptive (CABAC or CAVLC) arithmetic coding is used in the video content part. The specific coding algorithm is not detailed here, but the core idea is to use continuous’ 0 ‘to compress as much as possible.

We should have seen a programming problem that compresses the string “aaaabBBCCCCC” into “4a3b5c”. The string is compressed from 13 characters to 7 characters. This is called stroke encoding. In entropy coding, the idea similar to stroke coding is used to process the pixel data scanned from the image. However, stroke coding may not compress the data. If the string of the programming question just now is “ABCDabCDABCD”, then it will be “1A1B1C1D1a1B1C1D1a1B1C1D” after coding. The string size went from 13 characters to 25 characters, and it actually got bigger. To make it smaller, the answer is to make it as contiguous as possible, preferably contiguous zeros, because zeros can be stored in the smallest number of digits (such as the exponential Columbus code, which can take up only one bit). . So the code here is using a similar idea.

General process combing

The previous content is more and fragmentary, and finally knead the previous content together and put it in the process to form a whole. The following figure is the whole coding process of H264: The whole process is divided into forward coding branch and backward reconstruction decoding branch.

Forward coding branch: Coding a macro block branch experience frame or interframe prediction to remove redundant information, time, space, and then through the transformation, quantitative removing visual redundancy information, and then after a reordering, entropy coding several stages, including the steps in front of which is to try to remove the redundancy data quantity to try to reduce the formation of more continuous’ 0 ‘, Entropy coding is the step that actually encodes data into a code stream.

Backward reconstruction branch: the encoding macroblock is re-decoded and reconstructed in order to provide reference frames for inter-frame prediction, because the reference macroblock for inter-frame prediction is the reconstruction of the previously encoded macroblock.

The decoding process is basically the same as the backward reconstruction branch of encoding:

conclusion

H264 based on the principle of video coding first said here, finally to this article and the previous blog analysis of video coding principle — from Sun Yizhen’s film (a) combing the following video coding principle of the overall process:

Individuals have also made a brain map of the above for easy understanding:

I hope you can understand the video coding process after reading this article.

Due to the complex video coding technology, my level is limited, there are mistakes in the place please correct ha ~

Reference article: How video is encoded and compressed? Intra-frame prediction: How to reduce spatial redundancy? Interframe prediction: How to reduce temporal redundancy? Transformation quantization: How to reduce visual redundancy? H264 I frame P frame B frame in-depth Understanding of video Codec Technology H.264 and MPEG-4 Video Compression video Coding technology of the new generation of multimedia

Original is not easy, if you feel that this article is helpful to yourself, don’t forget to click on the likes and attention, but also to the author’s affirmation ~