I, P, B frames of video coding

Coding is all about finding ways to compress the size of the video.

The use of inter-frame coding technology can remove redundant information in time, including the following parts.

  • Motion compensation: Motion compensation is to predict and compensate the current local image through the previous local image, which is an effective method to reduce the redundant information of frame sequence.

  • Motion representation: images in different areas need to use different motion vectors to describe motion information.

  • Motion estimation: Motion estimation is a set of techniques for extracting motion information from video sequences.

The redundant information in space can be removed by using intra – frame coding.

For video, ISO also developed a standard: Motion JPEG, namely MPEG, MPEG algorithm is suitable for dynamic video compression algorithm, it in addition to encoding a single image, but also use the relevant principles in the image sequence to remove redundancy, which can greatly improve the video compression ratio. Up to now, versions of MPEG have been constantly updated, mainly including the following versions: Mpeg1 (for VCD), Mpeg2 (for DVD), Mpeg4 AVC (now the most used streaming media is it). Compared with THE MPEG video compression standard developed by ISO, THE H.261, H.262, H.263 and H.264 video coding standards developed by ITU-T are a separate system. Among them, H.264 combines all the advantages of previous standards and absorbs the experience of previous standards with a simple design, which makes it easier to promote than Mpeg4. Now the most used standard is H.264, H.264 created a multi-reference frame, multi-block type, integer transformation, intra-frame prediction and other new compression technology, the use of finer pixel motion vector (1/4, 1/8) and a new generation of loop filter, which makes the compression performance has been greatly improved, the system has become more perfect

The concept of coding

In H264, the three types of frame data are

IPB frame

In video compression, each frame represents a still image. In actual compression, various algorithms will be adopted to reduce the data capacity, among which IPB frame is the most common one.

  • I frame: An intra picture. An I frame is usually the first frame of each GOP (a video compression technique used by MPEG), which is moderately compressed to serve as a reference point for random access and can be treated as a still image. An i-frame can be regarded as the product of an image after compression, which can achieve a 6:1 compression ratio without any noticeable blur phenomenon. I frame compression can remove the spatial redundancy information of the video. The following P frame and B frame are introduced to remove the time redundancy information.

  • P frame: Predictive frame. An encoded image that compresses the amount of transmitted data by fully removing the time redundancy of the previous encoded frame in the image sequence. It is also called predictive frame. P frame represents the difference between this frame and the previous I frame (or P frame) **. When decoding, the difference defined by this frame needs to be superimposed on the previous cached picture to generate the final picture **. (That is, the difference frame, P frame does not have the complete picture data, only the difference data from the previous frame)

  • B frame: BI-directional interpolated Prediction frame. The previous I or P frame and the following P frame are reference frames. The predicted value and two motion vectors of “a certain point” of B frame are “identified” and the predicted difference and motion vector are transmitted. The receiver “finds (calculates)” the predicted value in the two reference frames according to the motion vector and sums it with the difference to get a sample value of “point” in B-frame, thus obtaining the complete B-frame. In other words, in order to decode B frame, it is necessary to obtain not only the cached picture before, but also the picture after decoding, and obtain the final picture through the superposition of the data of before and after frames and this frame. B frame compression rate is high, but decoding will be a large CPU load.

Based on the above definition, we can understand IPB frames from the perspective of decoding. I frame itself can be decompressed into a single complete video frame by video decompression algorithm, so I frame removes the redundant information in the spatial dimension of video frame. The P frame needs to be decoded into a complete video picture by referring to an I frame or P frame in front of it. B frame needs to refer to the previous I frame or P frame and the following P frame to generate a complete video picture, so what P frame and B frame remove is the redundant information in the time dimension of video frame.

It might be a little more intuitive to explain this with a classic example. As shown in figure

I frame records the complete information, there is no need to say more. Let’s see how frame I relates to frame P. First we look at the original image, P frame has one more square than I frame. So the last thing a P-frame stores is a little square of information. You can think of it as I frame and P frame combined to get the original picture.

Similarly, let’s look at frame B. You can see that the combination of B and I frames and P frames will form the original image.

GOP(sequence) and IDR

In H264, images are organized in units of sequence, and a sequence is a data stream encoded by an image. The first image in a sequence is called an IDR image (refresh now image), and IDR images are i-frame images. H.264 introduces IDR image for decoding resynchronization. When the decoder decodes to IDR image, the reference frame queue will be emptied immediately, all the decoded data will be output or discarded, and the parameter set will be searched again to start a new sequence. This gives you an opportunity to resynchronize if a major error occurred in the previous sequence. Images after IDR images are never decoded using data from images before IDR. A sequence is a stream of data generated by encoding an image with not much difference in content. When the motion change is relatively small, a sequence can be very long, because the motion change is small on behalf of the content of the picture is very small, so you can compile an I frame, and then P frame, B frame. When the motion changes a lot, a sequence may be shorter, such as one I frame and three or four P frames. In video coding sequence, GOP refers to Group of picture, refers to the distance between two I frames, and Reference refers to the distance between two P frames. A Group Of pictures formed between two I frames is GOP (Group Of Pictures).

【GOP Schematic Diagram 】

PTS and DTS

[Why there are PTS and DTS concepts]

As can be seen from the above description, P frame needs to refer to the previous I frame or P frame to generate a complete picture, while B frame needs to refer to the previous I frame or P frame and the following P frame to generate a complete picture. This brings a problem: in the video stream, the first B frame cannot be decoded immediately, and it needs to wait for the subsequent I and P frames that it depends on to be decoded first. As a result, the playback time is inconsistent with the decoding time, and the order is disordered. Then how to play these frames? This is where two other concepts are introduced: DTS and PTS.

【PTS and DTS】

To understand the basic concepts of PTS and DTS:

DTS (Decoding Time Stamp) : refers to the Decoding Time Stamp. The meaning of this Time Stamp is to tell the player when to decode this frame of data. PTS (Presentation Time Stamp) : The display Time Stamp is used to tell the player when the frame should be displayed.

Although DTS and PTS are used to guide the behavior of the playback side, they are generated by the encoder at encoding time.

At the time of video acquisition is recording a frame is a frame to send a frame coding, at the time of coding will generate PTS, need special attention here is that the frame (frame) encoding, in the usual scenario, codec encoding an I frame, and then back to skip a few frames, with coding I frame as a future reference frame for P frame coding, And then jump back to the next frame after frame I. The frames between the encoded I and P frames are encoded as B frames. After that, the encoder skips a few frames again, encodes another p-frame using the first p-frame as a reference frame, and then jumps back again to fill the gaps in the display sequence with b-frames. This process continues, inserting a new I frame every 12 to 15 P and B frames. P frame is predicted by the image of the previous I frame or P frame, while B frame is predicted by the two P frames before and after, or one I frame and one P frame, so the display order of codec and frame is different, as shown below:

The process looks something like this so. Let’s go through the process step by step.

The first is our frame number. At first, the frame number is sequential, there is no doubt about that. And then these sequential frames are also typed. As mentioned above, the first frame is I frame, the second frame is B frame, and so on to get our frame type. If we want to play normally, we must play according to our original needs, that is, according to 1… 7 This needs to be played. So our PTS is also 1… 7 in this order.

Now comes the encoding sequence, and this is the key point, the ** codec encodes an I-frame, then skips a few frames back and encodes a future P-frame using the encoding I-frame as the base frame. And then jump back to the next frame after frame I. The frames between the encoded I and P frames are encoded as B frames. ** As shown below.

According to the sequence from the first step to the seventh step, the frame serial number corresponding to the coding sequence is

The frame types corresponding to these serial numbers are.

The decoded timestamp is sequential, i.e. 1234567

Do you see any problems here? If we display according to the decoded timestamp, then the display order is IPBBPBB and our original IBBPBBP order is not the same. This is like modifying our video. So it needs to be adjusted according to PTS. How do you adjust it?

First look at the received video stream (frame type) line. This line corresponds to the original, according to the frame type of this line, to need the frame number corresponding to the frame type, thus obtaining the corresponding PTS line data.

Finally, the corresponding PTS data can be adjusted to obtain the adjusted DTS. This whole process is a bit convoluted. If you don’t understand it, watch it several times and you will understand it.