The preface

The playing process of video can be simply understood as the process of frame by frame presented in chronological order, just like the feeling of drawing on each page of a book and flipping it quickly.

But in practice, not every frame is a complete picture, because if every frame is a complete picture, then the volume of a video will be very big, so for network transmission or video data storage cost is too high, so usually for part of the video image compression (code). Due to the different compression processing methods, the video frame is divided into different categories, including: I frame, P frame, B frame.

I, P, B frame

Intra coded frames: The I frame image adopts the method of intra-frame encoding, that is, only the spatial correlation in a single frame image is utilized, but the temporal correlation is not utilized. I frame uses intra-frame compression and does not use motion compensation. Since I frame does not depend on other frames, it is the entry point of random access and the reference frame for decoding. I frames are mainly used for initialization of receivers and acquisition of channels, as well as switching and inserting programs.

I frame image compression multiple is relatively low. The I frame image appears periodically in the image sequence, and the frequency can be selected by the encoder.

P Frames: P and B frames are encoded between frames, which means they are spatial-temporal correlated at the same time. P frame image only uses forward time prediction, which can improve compression efficiency and image quality. The p-frame image can contain the part of the in-frame encoding, that is, each macro block in the P-frame can be either a forward prediction or an in-frame encoding.

Bi-directional Predicted Frames can greatly improve compression ratio of B frame images by using bi-directional time prediction. It is worth noting that the transmission order and display order of image frames in the MPEG-2 encoding stream are different because the B-frame image uses future frames as a reference.

In other words, one I frame can decode a complete image without relying on other frames, while P frame and B frame cannot. The P frame relies on the frame that precedes it in the video stream to decode the image. B frames rely on the frames that precede or follow them in the video stream to decode the image.

This brings a problem: in the video stream, the first B frame cannot be decoded immediately, and it needs to wait for the subsequent I and P frames that it depends on to be decoded first. As a result, the playback time is inconsistent with the decoding time, and the order is disordered. Then how to play these frames? This brings us to two other concepts: DTS and PTS(see the description below).

A GOP is formed between two I frames. In X264, the size of BF can be set by parameters at the same time, namely, the number of I and P or B between two PS.

The above basically shows that the last frame of a GOP must be P if there is a B frame.

Looking at x264 code, it feels that GOP is a GOP between IDR frame and another IDR frame. In video coding sequence, GOP is Group of picture, refers to the distance between two I frames, and Reference refers to the distance between two P frames (as shown below). An I frame consumes more bytes than a P frame, and a P frame consumes more bytes than a B frame (as shown in the figure below).

Therefore, on the premise of constant bit rate, the larger the GOP value is, the more P and B frames will be, the more bytes will be occupied by each I, P and B frame on average, and it will be easier to obtain better image quality. The larger the Reference is, the more B frames there are, and the easier it is to obtain better image quality.

It should be noted that there is a limit to improving image quality by increasing GOP. In the case of scene switching, the H.264 encoder will automatically force an I frame to be inserted, at which point the actual GOP is shortened. On the other hand, in a GOP, P and B frames are predicted by I frame. When the image quality of I frame is poor, the image quality of subsequent P and B frames in a GOP will be affected, and it is not possible to recover until the next GOP starts, so the GOP value should not be too large.

At the same time, because the complexity of P and B frames is greater than that of I frames, excessive P and B frames will affect the coding efficiency and reduce the coding efficiency. In addition, long GOP can also affect the response speed of the Seek operation, because the P, B frames are made by I or in front of the P frame prediction, so the Seek operations require direct positioning, decoding a P and B frame, need to decode the GOP in the I frame and before N prediction frame can be, the longer the GOP value, the more you will need to decode the prediction frame, Seek also takes longer to respond.

DTS, PTS concept

The concepts of DTS and PTS are described as follows:

DTS (Decoding Time Stamp) : refers to the Decoding Time Stamp. The meaning of this Time Stamp is to tell the player when to decode this frame of data. PTS (Presentation Time Stamp) : The display Time Stamp is used to tell the player when the frame should be displayed. Note that although DTS and PTS are used to guide the behavior of the player side, they are generated by the encoder at encoding time.

When there are no B-frames in the video stream, DTS and PTS are usually in the same order. But if there is a B frame, it goes back to the problem we talked about earlier: the decoding order and the playback order are inconsistent.

For example, in a video, the display sequence of frames is: I, B, B, P. Now we need to know the information in P frame when decoding B frame, so the sequence of these frames in the video stream may be: I, P, B, B, at this time, it shows that each frame has the function of DTS and PTS. DTS tells us in what order to decode the frames, and PTS tells us in what order to display the frames. The order is roughly as follows:

PTS: 480 640 560 520 600 800 720 680 760 960 ... DTS: 400 440 480 520 560 600 640 680 720 760 ... Stream: I P B B B P B B B P ... 1 5 3 2 9 7 6 8 10... PTS >= DTSCopy the code

Audio and video synchronization

The concepts related to video frames, DTS and PTS are described above. We all know that a media stream usually includes audio in addition to video. Audio playback also has the concepts of DTS and PTS, but audio has no similar b-frame in video and does not need two-way prediction, so the ORDER of DTS and PTS of audio frame is consistent.

When audio and video are mixed together, they present the generalized video we often see. When audio and video are played together, we often have to face the problem of how to synchronize them so as not to draw the wrong sound.

In order to realize audio and video synchronization, it is usually necessary to select a reference clock, and the time on the reference clock is linearly increasing. When encoding audio and video stream, each frame of data is stamped according to the time on the reference clock. During playback, the timestamp on the data frame is read and the playback is scheduled with reference to the time on the current reference clock. The timestamp here is the PTS we talked about earlier. In practice, we have options: sync video to audio, sync audio to video, sync audio and video to the external clock.

Time base of PTS and DTS

What are the units of PST and DTS?

To answer this question, we first introduce the concept of time base in FFmpeg, also known as time_base. It also measures time. If you divide a second into 25 equal parts, which you can think of as a ruler, then each square represents 1/25 of a second. Time_base ={1, 25} If you divide 1 second into 90,000 parts, each scale is 1/90,000 seconds, and time_base={1, 90,000}. The time base is the number of seconds on each scale and the PTS value is the number of time scales (the number of cells). It’s not in seconds, it’s on a time scale. Only PTS plus time_base together can tell you what the time is. Let’s say I just tell you that the length of something is 20 scales on a ruler. But if I don’t tell you how many centimeters the ruler is, you can’t calculate the centimeters of each scale, and you can’t know the length of the object. PTS =20 scales time_base={1,10} each scale is 1/10 cm so the length of the object = ptSTIme_base =201/10 cm

In the ffmpeg. PTS * AV_q2D (time_base)= how many seconds per scale you should understand that PTS * AV_q2d (time_base) is the display timestamp of the frame.

Now let’s understand the time base transition, why we have time base transition. First, timebase is different for different encapsulation formats. In addition, the time base corresponding to different data states is not consistent in the whole transcoding process. Take the MPEGTS package format of 25FPS (just for video, audio is roughly the same, but slightly different). Uncompressed data (i.e., YUV or otherwise) in ffmpeg corresponds to a structure called AVFrame, whose time base is AVCodecContext’s time_base,AVRational{1,25}. The compressed data (the corresponding structure is AVPacket) corresponds to time_base of AVStream, AVRational{1,90000}. Because the data state is different, the time base is different, so we have to convert, at 1/25 of the time scale is 10 cells, how many cells at 1/90,000. This is the transformation of PTS.

Timestamp (s) = PTS * AV_q2D (st->time_base)

Duration, in the same unit as PTS, indicates how many squares the duration of the current frame occupies. Or how many Spaces are there between frames. Make sure you understand units. PTS: number of cells AV_q2D (ST ->time_base): seconds/cell

Time = ST ->duration * av_q2D (st->time_base)

Ffmpeg internal time = AV_TIME_BASE * time(seconds) AV_TIME_BASE_Q=1/AV_TIME_BASE

The av_rescale_q(INT64_t a, AVRational Bq, AVRational CQ) function computes a* Bq/Cq to adjust the timestamp from one time base to another. This function should be used first when performing time-based conversions, because it avoids overflows. The function tells us what the ratio of a cells under bq is, and what the ratio is under Cq.

Audio sample_rate:samples per second, which means how many sample points are collected per second. For example, 44100 Hz means that 44100 samples are collected in one second. In other words, the time of each sample is 1/44100 seconds

An audio frame has NB_samples for AVFrame, so an AVFrame takes nb_samples* (1/44100) seconds, that is, duration_S = NB_samples * (1/44100) seconds at standard time, Duration =duration_s/AV_q2d (ST ->time_base) The num value based on ST ->time_base is generally equal to the sampling rate, so duration=nb_samples. pts=nduration=nnb_samples