Make writing a habit together! This is the third day of my participation in the “Gold Digging Day New Plan · April More text Challenge”. Click here for more details.

For more blog posts, seeTotal catalog of romantic carriages for audio and video systems learning

YUV color coding analysis H264 video coding principle — from The film of Sun Yizhen (1) Analysis H264 video coding principle — from the film of Sun Yizhen (2)

There’s a star on the other side and you know I miss how many loves there are just like moonlight on the sea

Every time I hear teacher Li Jian’s song, I will immediately floating the picture of “If Love has Providence”, and now most of the movies are compressed by coding, so video coding has become a very key link of video development.

The last audio and video introduction YUV color coding, introduced the basic color coding way of video, with color coding, then you can naturally enter the theme of video development zone ~

Starting from today, I will talk about video coding technology. I plan to use two blog posts to explain it. The first one will explain the basic principles of coding, and the second one will explain the specific code stream structure. These two blog posts can be said to be very important two blog posts in the general catalogue series of the romantic carriage of audio and video system learning. They are the most important part of the theoretical knowledge of video and lay the most important foundation for the development of the following videos. Only a thorough understanding of coding principles and code streams can do a good job in the development of video, because the development of video is no longer a simple call to the API development, if not a thorough understanding of the problem, there is no way to start, even do not know how to use the API.

The encoding here can be understood as compression, which is to compress the original YUV data into smaller data. Today only intends to talk about the basic principle of video coding technology, does not involve specific mathematical knowledge, from a macro perspective to see the whole video coding process.

We already know how pixels are represented in video, so let’s assume that we have a movie video with a resolution of 1080P, a frame rate of 25fps, and a duration of 2 hours. In the YUV420 format, each pixel is 1.5 bytes. So if you don’t do video compression, its size is 1920 x 1080 x 1.5 x 25 x 2 x 3600 = 521.4G, how many movies can a computer store? If the network transmission, so the consumption of traffic and bandwidth is very large (you know, at the same point in time, your community is watching a lot of people), so, in order to meet the needs of the masses, it is necessary to show the video compression, so the video compression technology came into being.

Introduction of h. 264

Say the video coding technology, have to mention the famous H.264 coding. At present, common coding standards in the market include H264, H265, VP8, VP9 and AV1, among which the most common video coding is H.264.

H264 and H265 are coding standards developed by the International Organization for Standardization (ISO) and the International Telecommunication Union (ITU), while VP8, VP9 and AV1 are coding standards developed by Google. So VP8, VP9 and AV1 (all free) were also developed by Google to counter their high patent fees.

Since H.264 is the most commonly used coding standard, this article mainly introduces it. H.264 is a new generation of digital video compression format after MPEG4 jointly proposed by THE International Organization for Standardization (ISO) and the International Telecommunication Union (ITU). It can also be seen as the H26X series proposed by the International Telecommunication Union (ITU) and the MPEG series proposed by the International Telecommunication Union (ITU) and finally decided to fight for years. The product of efforts for the common benefit of mankind.

This is a rough history of the two organizations’ struggle to cooperation, you can understand the following.Next, it is the introduction of H264 part of the thesis.

Video coding fundamentals

General idea of coding

Since it is called compression, redundant information must be removed. Generally speaking, redundant information is either redundant and can be directly discarded or replaced with another way to save space, or it is not sensitive to human perception. Even if some information is removed, it is difficult for people to perceive it. For our Android development, the most familiar compression is Bitmap compression, common 2 kinds, one is compression resolution, like Android common picture frame is according to the size of the control of the research picture resolution, this is the corresponding to remove redundant information, one is quality compression, Corresponding to remove some people’s perception of insensitive information. So video has similar redundant information:

  1. Spatial redundancy, in which adjacent pixels tend to be very similar.
  2. Temporal redundancy means that the contents of adjacent frames tend to be similar.
  3. Visual redundancy means that the human eye perceives information that is insensitive.

H264 compression technology is aimed at the above redundant information to break one by one, mainly using the following methods for video data compression. The most important steps include:

  1. Intra-frame predictive compression: it solves the problem of spatial data redundancy.
  2. Interframe predictive compression (motion estimation and compensation) : it solves the time-domain data redundant creep problem.
  3. Integer discrete cosine transform (DCT) transforms spatial correlation into irrelevant data in frequency domain and then quantizes it to solve the problem of visual redundancy.
  4. Entropy coding: the real coding process, which encodes the data obtained from the previous three parts into the final code stream using the coding algorithm.

Read on for details:

Frame prediction

Spatial redundancy

The brightness and chromaticity information of adjacent pixels in an image is relatively close, and the brightness and chromaticity information also changes gradually, and mutation is not likely to occur. In other words, the image has spatial correlation. Using this correlation, video compression can remove spatial redundant information.

Back to my favorite scene of Sohn Ye-jin at the beginning of the movie “If Love has Providence” :

If you circle several areas where the brightness and chroma information are relatively close, can you express all of these areas by using only a small part of the data record instead of the complete data record?

For example, in android development, we don’t need to specify all pixel data for the entire image. We just record the beginning and end and the changing colors in between, plus the gradient position and the gradient direction.

Video also uses a similar method to remove redundant information, which is called intra-frame prediction. In other words, intra-frame prediction predicts the pixel value to be encoded by using the values of adjacent pixels that have been encoded, and finally achieves the purpose of reducing spatial redundancy.

Specific prediction method

The whole idea is to use the encoded part of an image to predict the unencoded part of the image. The difference between the actual value and the predicted value is called residual. In fact, what is really encoded is residual data, because the residual is generally small, so the residual coding is much smaller than the actual data coding.

Birds of a feather flock together, and the idea of division and rule is playing an important role again. In order to use the encoded part to predict the unencoded part of the image, it is necessary to divide a frame of the image into several parts according to the specific situation, each part is called blocks, some of which can predict other blocks. In H264, an image is divided into macro blocks for intra-frame prediction. Macro blocks can predict adjacent macro blocks, so the pixels of the same macro block will use a prediction mode. So what is a macroblock?

A picture

For example, an image looks like this:H264 defaults to using an area of 16X16 pixels as a macro block,The luminance block is 16 x 16, the chromaticity block is 8 x 8, and a luminance block Y corresponds to two chromaticity blocks (UV). In the intra-frame prediction, the luminance block and chromaticity block are predicted separately and independently, such as the upper-left area:

H264 uses 16X16 macroblocks for relatively flat images.However, in order to achieve higher compression rate, in complex details, you can also divide the 16X16 macro block into smaller sub-blocks. Sub-block sizes can be 8X16, 16X8, 8X8, 4X8, 8X4, 4X4, very flexible.

Again, sun Yizhen, the red block is the 16*16 macro block, the green cut head refers to the further division of the sub-block:

For a 4*4 macro block, there are a total of nine intra-frame prediction modes. There are eight direction modes and one DC mode. Next, we will briefly introduce various intra-frame prediction modes:

In Vertical mode, the pixel values of each column of the current encoded brightness block are copied to the corresponding pixel values of the bottom row of the encoded block.

Horizontal mode refers to a Horizontal mode in which the Horizontal value of each row of the current encoded brightness block replicates the corresponding Horizontal value of the rightmost column of the already encoded block on the left

DC mode means that each pixel value of the current encoded brightness block is the average value of all pixel values of the bottom row of the encoded block and the last column to the right of the encoded block, so that the predicted value of each pixel in the DC mode is the same.

The Diagonal down-left mode is the pixels of the upper block and the upper right block obtained by interpolation. The mode is invalid if the upper block and the upper right block do not exist.

The Diagonal mode of the top corner, the left corner, and the upper corner is determined by interpolation. If one of these three does not exist, the pattern is invalid.

Vertical-right mode is created by interpolating the pixels of the top block, the left block, and the corner of the upper left corner.

The horizontal-down mode is obtained by interpolating the pixels of the upper block, the left block, and the corner of the upper left corner. All three must be valid to use, otherwise the mode is invalid.

The vertical-left mode is obtained by interpolating the pixels in the bottom row of the upper block and the upper right block.

The horizontal-up mode is created by interpolating the pixels of the left block.

To sum up:

16 * 16 and 8 * 8 macro block prediction modes are the same, there are four intra-frame prediction modes:

The first three in 4 * 4 macroblock prediction mode already exist, only the last plane is 4 * 4 macroblock prediction mode does not.

Plane predicts each pixel value of the block, is the bottom line of the upper coding block, and the pixel value of the last column on the right of the left coding block is calculated by a certain algorithm.

Each macro block can only use one prediction model, so how to choose? The algorithm is very complicated, and the general idea is that for each block or sub-block, we can get the prediction block, and then subtract the prediction block from the actual block to be encoded to get the residual block. Then in different scenarios, the residual block is calculated according to different algorithms to obtain the optimal prediction mode.

The so-called work to good things will first sharpen its device, may be the above basic are theoretical things, it still looks a little misty, then you can use H264Visa this excellent video viewing tool to view a H264 file specific coding data, really feel how to code.

Here’s a video of one of Douyin’s stars:

Convert FFmpeg to H264 with FFmpeg, open with H264 VISA:

I’m just going to look at frame 1 for the moment (because frame 1 must be frame I, which is only the predicted frame within the frame, and I will be covered in more detail later)

Each number above corresponds to the specific image value of each macro block, which can represent the quantization parameter QP of the macro block (about quantization, we will talk about it later), or can be switched to represent the macro block code stream size.

If you select a macro block, you can view the basic information about the macro block:

In the red box on the right, you can see the coordinate, size, type and QP of the macro block. You can see that the macro block is 4*4, the I macro block (using intra-frame prediction only), and the coordinate is (16,48).

You can directly see the prediction mode used by the macroblock, you can see that the brightness (Y) and chroma (UV) are predicted separately, and the macroblock size used is also different.

This is the prediction frame, and you can see that it is still different from the original frame, it looks a little more blurry:

The residual frame is obtained by subtracting the original frame from the prediction frame:

This can be said to have been unrecognizable, only a little outline, and these residual data will be sent together with the relevant prediction information to the next step, namely transformation and quantization, coding processing after the input code stream.

To view the macro block in a more quantitative way, look directly at the raw frame YUV data of the macro block:Yuv data of residual frames:It can be seen that after the parameter frame, most pixel values become 0, and this part of data will be transmitted to the next stage. It is because of the feature of residual data with a large number of 0’s (the number of 0’s will be increased in the later transformation quantization), this part of data will be put into the code stream after encoding, which will greatly reduce the code stream size.

Interframe prediction

Time redundancy

In a video, the changes of the two frames are usually relatively small, which is the time correlation of the video. Video typically plays 20-30 frames a second, so there’s a lot of repetitive image data, so there’s a lot of compression.

Or sohn Ye-jin in this film in a row of 5 frames:

Obviously, the difference between 5 frames is very small, so it’s easy to think, can we only record the first frame and then only record the difference between the next few frames?

In video coding, it is to predict the pixel of the block to be encoded by finding a block in the already encoded frame, so as to achieve the purpose of reducing time redundancy. The official name is inter-frame prediction.

Specifically, if you find a block that is close to each other in a previous frame, you can represent the current block by adding the motion vector.

For example, if the two frames are before and after (figure from: Interframe prediction: How to reduce time redundancy), and the trees in the background are all still, and only the cars are moving, establish a coordinate system for the whole picture, then the movement of the cars can be represented by the motion vector (-163,0) (16 because each macro block is 1616) :

Back to the real video example, for example, in the previous frame Son Ye-jin’s left hand is a block:

After a few frames, the block is moved, but the content is basically the same (there is a small residual data)

Therefore, the following frame does not need to record the image information of the left hand block, but only needs to record the motion vector, the reference macro block identifier and residual data, which is much smaller than directly encoding the image data of the left hand in the following frame into the code stream. The preceding frame is the reference frame for the next frame.

Similar to the green haircut shown below:

Open the motion vector analysis chart with professional analysis software:

You can see a lot of dense, xenophobic thin lines representing motion vectors, while the red line on the block of Sohn ye-jin’s left hand (green box), which we focus on, is roughly in the same direction as the movement.

Motion estimation

An important concept of inter-frame prediction is motion estimation, which is to find the best corresponding block of the currently encoded block in the encoded image.

As shown in the figure, assuming P is the current encoding frame, Pr is the reference frame, and the current encoding block is B, the motion estimation needs to find the block Br with the smallest residual subtraction from B in Pr, which is called the best matching block of B.

In the case of Sun Yi-zhen’s video, motion estimation is to find the best image block corresponding to Sun yi-zhen’s left hand in the first picture in the second picture above. Of course, human eyes can quickly find the best match between the left hand in the next frame and the left hand in the previous frame, but the computer is not easy to find, so the difficulty of inter-frame prediction lies in how to find the best reference block to predict the current block. There are many complex motion search algorithms involved, mainly including the following two algorithms:

(1) Global search algorithm. In this method, all pixel blocks in the search area are compared with the current macro block one by one, and the pixel block with the minimum matching error is found to be the matching block. The advantage of this method is that it can find the best matching block, but the disadvantage is that it is too slow. At present, global search algorithms are rarely used. (2) Fast search algorithm. The method searches matching blocks according to certain mathematical rules. The advantage of this method is that it is fast, but the disadvantage is that it may only get sub-optimal matching blocks.

For details, see inter-frame prediction: How to reduce time redundancy. Here, it is important to know the concepts of I-frame, P-frame and B-frame.

GOP

Two compression technologies, intra – frame prediction and inter – frame prediction, have been introduced previously. In order to obtain better compression rate, we can use different compression methods for different frames. As we know, a video will have several scenes, that is, similar people or objects in a similar background or space, and each scene will be composed of several frames, and these frames are often strongly correlated, which is the most suitable for inter-frame prediction. Here, the combination of frames of a scene is called GOP(Group of Pictures).

For example, Sun Yizhen was sitting at the window of her house, opening her collection box:

The scene lasts for a few seconds, playing many, many frames, and then cuts to the scene where Sohn answers the phone:

Here are two VIDEO scene GOP. According to the strong correlation of frames within the scene and the frame prediction theory,So a scene of the first frames can be remove the spatial redundancy frame prediction, and then the back of the frame to refer to the first frame, then the frame of reference has been predicted in front of the frame, in order to reduce the compression ratio, can be more than one reference frame, such as a frame can refer to before and after 2 frames, then appeared the I frame, P frame and B frame.

After compression, frames are artificially divided into: I frame, P frame and B frame: I frame: key frame, using intra-frame compression technology. P frame: Forward reference frame. When compressed, only the previously processed frame is referred to. Interframe compression and in-frame technology are used. B frame: bidirectional reference frame, in compression, it refers to the frame before and after it. Interframe and intra-frame compression techniques are used.

Since P and B frames contain inter-frame prediction, the macro block is also divided into I, P and B macro blocks. I macro block is used for intra-frame prediction, forward reference frame prediction, and B macro block is used for two-way reference frame prediction. I frames contain only I macroblocks, P frames contain P macroblocks and I macroblocks, and B frames contain B macroblocks and I macroblocks.

So what exactly is GOP?

GOP is an image sequence, which can be generally understood as several frames of a scene. For example, a movie segment in which the protagonist is in the park, can be put into a GOP because the overall picture is not very different. Then, it is cut to the protagonist in the room, and another GOP will be started at this time. There is only one I frame in an image sequence. As shown below:Each GOP first frame isThe I frame, it uses intra-frame prediction,Is a full-frame compression coding frame, describing the details of the image background and the motion subject, without considering the motion vector, and only using the data of frame I during decoding can reconstruct the complete image, which is the reference frame of P frame and B frame.

The following frames will use the inter-frame prediction technology to refer to the I frame or the encoded P frame. The P frame represents the difference between this frame and the previous key frame (or P frame). There is no complete picture data in the P frame, but only the data that differs from the previous frame. P frame takes I frame or the previous P frame as the reference frame. The predicted value and motion vector of “certain point” of P frame are found in the reference frame, and the prediction difference and motion vector are transmitted together. At the receiving end, the predicted value of “certain point” of P frame is found out from I frame according to the running vector and the sample value of “certain point” of P frame is obtained by combining it with the difference value, thus the complete P frame can be obtained.

B frame is a bidirectional differential frame. The main difference between B frame and P frame is that it refers to two frames before and after the reference, that is, B frame records the difference between this frame and the before and after frame. B frame before I or P frame and P frame after P frame as reference frame, “find” B frame “certain point” predicted value and two motion vector, and take the prediction difference and motion vector transmission. However, due to the need to refer to the following P frame, although B frame improves the compression rate, it also brings the problem of encoding delay (it needs to wait for the next frame to be encoded before encoding).

Bidirectional prediction diagram of B frame:

Since inter-frame prediction requires reference encoded frames, the cache queue is required to cache encoded and decoded reconstructed frames to serve as reference frames for subsequent encoded frames. So why not just take the original macroblock and specifically re-decode the encoded macroblock as a reference? The key point is the reference macroblock to be consistent with the decoding process, because the encoded macroblock and the decoded reconstruction macroblock are not exactly the same, so if the inter-frame prediction is inconsistent between the encoding side and the decoder side reference frame, it will be wrong.

For example, B frames require reference frames before and after, so two cache queues are required:

There’s a phenomenon that happens because of the introduction of B framesThe sequence of encoded frames will be inconsistent with the sequence of played frames: So PTS and DTS2 timestamps are derived as well“, this is a very important point to pay attention to in the later code development, otherwise use the wrong problem do not know what reason.

For example, there is a row of people, each say a sentence, they each say something close, but the words should be as small as possible. In order to keep the total number of speeches as small as possible, the first person said a complete sentence (try to summarize), the latter person only said the part that is different from the first person, and the others said the part that is different from the two people before and after, so that the total number of speeches need to be less. In this way you can derive the full content of each of their words by just doing the derivation.

Obviously, I frame has the lowest compression rate, followed by P frame and B frame.

Here also mentioned a special kind of frames, I called the IDR frames, because of the interframe prediction technology are always in front of the reference has been coded frame of success, that if one of the frame coding error, it might cause behind the frame error transfer, such as the first man to speak wrong a word, then refer to his people in the back of the part based on the words of the difference, Then the derivation is bound to be wrong. Therefore, THE function of IDR frame is to block error transmission. It restricts the following SAME GOP frame from referring to the previous GOP frame. In this way, once a frame fails, the error is limited to one GOP and will not be transmitted to the next GOP.

The longer the Gop, the fewer I frames encoded and the higher the compression rate, but the poorer the video quality. How long is the Gop set? Local video files generally choose a compromise between compression rate and quality based on specific scenarios. The live stream is generally a multiple of frame rate, and it will not be set too long, because if the Gop is too long, it will take longer to wait for I frame when entering the live broadcast room, leading to the user experience problem of black screen lag (because the whole Gop can only be decoded with I frame). In addition, if the Gop is too long, it will be inconvenient to conduct seek operation of video when the scene is on-demand.

Continue analyzing the previous video of douyin with H264Visa, now look at frame 3:You can see that the selected block is a 16×16 B macro block of B frame, which supports both intra-frame and inter-frame prediction.

Open the prediction information TAB, you can see the specific motion vector data (MV (Motion vector) on the right of the figure) is very rich, because the macro block at this time is the image of the anchor’s hand, which is in motion in the video.

For comparison, look at another static macro block:

You can see that the motion vectors are all 0.

Notice that this macro block is of skip type, that is, it will not be put into the code stream. Why? Because it is a non-moving macro block, it is exactly the same as the corresponding reference macro block of the reference frame, so there is no need to store this macro block data in the current frame.

conclusion

Due to the space relationship, that will do for the first half of the video coding principle, this paper mainly introduces the H264 historical background, and emphatically introduces the frame of the H264 interframe prediction theoretical basis and specific methods, and through the tool to see the actual data, next article will explain the H264 encoding the rest – transform quantization and entropy coding: Analyzing the Principle of Video Coding — From Sun Yizhen’s Film (II)

Due to the complex video coding technology, my level is limited, there are mistakes in the place please correct ha ~

Reference article: How video is encoded and compressed? Intra-frame prediction: How to reduce spatial redundancy? Interframe prediction: How to reduce temporal redundancy? Transformation quantization: How to reduce visual redundancy? H264 I frame P frame B frame in-depth Understanding of video Codec Technology H.264 and MPEG-4 Video Compression video Coding technology of the new generation of multimedia

Original is not easy, if you feel that this article is helpful to yourself, don’t forget to click on the likes and attention, but also to the author’s affirmation ~