Learn from LearnVideo AndroidMediaCodecDemo

pixel

** pixel is the basic unit of the image, each pixel is composed of the image. You can think of a pixel as a point in an image. ** In the image below, you can see squares, which are pixels.

The resolution of the

The resolution of an image (or video) refers to the size or dimensions of the image. We usually use the number of pixels to indicate the size of the image. For example, in a 1920×1080 image, 1920 refers to 1920 pixels in the width direction, while 1080 refers to 1080 pixels in the height direction.

RGB

Generally speaking, we see the color image, there are three channels, the three channels are R, G, B channel, (sometimes there is an Alpha value, representing transparency) usually R, G, B each occupy 8 bits, we call this kind of image is an 8bit image.

YUV

For image display, it is through RGB model to display the image. YUV model is used when transmitting image data, because YUV model can save bandwidth. Therefore, it is necessary to convert RGB model to YUV model during image collection, and then convert YUV model to RGB model during display.

From the perspective of video acquisition and processing, the output code stream of the general video acquisition chip is generally in the form of YUV data stream, and from the perspective of video processing (such as H.264, MPEG video encoding and decoding), it is also encoded and analyzed in the original YUV code stream. If the collected resources are RGB, they also need to be converted to YUV.

There are two reasons to use YUV instead of RGB: 1) YUV extracts Y brightness signal, which can be used directly for black and white TV, compatible with black and white TV. 2) People’s sensitivity to UV is less than brightness, so we appropriately reduce the amount of UV and carry out video compression. That’s why you have 420, 422, 444, and so on

YUV color coding uses brightness and chroma to specify the color of a pixel. Y represents Luminance (Luma) and U and V represent Chrominance (Chroma). YUV is mainly divided into YUV 4:4:4,YUV 4:2:2 and YUV 4:2:0.

YUV formats come in two broad categories: planar and Packed. For planar YUV format, store Y for all pixels successively, followed by U for all pixels, followed by V for all pixels. For Packed YUV format, Y,U and V of each pixel point are stored continuously and cross.

1.YUV 4:4:4, each Y corresponds to a set of UV. 2.YUV 4:2:2, every two Y share a set of UV. 3.YUV 4:2:0, each of the four Y’s share a set of UV.

YUV 4:4:4 sampling means that the sampling ratio of Y, U and V components is the same, so in the generated image, the information of the three components of each pixel is 8bit.

For example, if the image pixel is: [Y0 U0 V0], [Y1 U1 V1], [Y2 U2 V2], [Y3 U3 V3], then the sampling code stream is: The final mapped pixels of Y0 U0 V0 Y1 U1 V1 Y2 U2 V2 Y3 U3 V3 are still [Y0 U0 V0], [Y1 U1 V1], [Y2 U2 V2], [Y3 U3 V3].Copy the code

The image size of this sampling method is the same as that of RGB color model, which does not achieve the purpose of saving bandwidth

YUV 4:2:2 sampling UV component is the general Y component, Y component and UV component are sampled according to the ratio of 2:1, if there are 10 pixels in the horizontal direction, then 10 Y components are sampled, only 5 UV components are sampled.

For example, if the image pixel is: [Y0 U0 V0], [Y1 U1 V1], [Y2 U2 V2], [Y3 U3 V3], then the sampling code stream is: Y0 U0 Y1 V1 Y2 U2 Y3 V3 Where, every pixel sampled, its Y component will be sampled, while U and V components will be collected one at an interval. Finally, the mapped pixels are [Y0 U0 V1], [Y1 U0 V1], [Y2 U2 V3], [Y3 U2 V3].Copy the code

Through this example, it can be found that the first pixel and the second pixel share [U0, V1] components, and the third pixel and the fourth pixel share [U2, V3] components, thus saving image space. For example, if a 1280*720 size picture is stored in RGB mode, it will cost:

(1280∗720∗8+1280∗720∗8+1280∗720∗8 +1280∗720∗8), 8/1024/1024=2.637𝑀𝐵 where 1280*720 refers to the number of pixels. But if yuV4:2: sampling format is used:

(1280∗720∗8+1280∗720∗0.5∗8∗2) /8/1024/1024=1.76𝑀𝐵 Saves 1/3 storage space and is suitable for image transmission.

YUV 4:2:0 sampling, not just sampling the U component but not sampling the V component. Instead, only one chromaticity component (U or V) is scanned for each line, and the Y component is sampled in a 2:1 fashion. For example, YU samples the first row in a 2:1 fashion, while YV components are sampled in a 2:1 fashion in the second row. For each chromaticity component, its horizontal and vertical samples are 2:1 relative to the Y component.

For example, suppose the image pixel is: [Y0 U0 V0], [Y1 U1 V1], [Y2 U2 V2], [Y3 U3 V3] [Y4 U4 V4], [Y5 U5 V5], [Y6 U6 V6], [Y7 U7 V7] then the sampling code stream is: Y0 U0 Y1 Y2 U2 Y3 Y4 V4 Y5 Y6 V6 Y7 Where, every sampled pixel point will sample its Y component, while U and V components will be sampled according to 2:1 in an interval. The final mapped pixels are: [Y0 U0 V5], [Y1 U0 V5], [Y2 U2 V7], [Y3 U2 V7] [Y5 U0 V5], [Y6 U0 V5], [Y7 U2 V7], [Y8 U2 V7]Copy the code

The size of images sampled by YUV 4:2:0 is as follows: (1280∗720∗8+1280∗720∗0.25∗8∗2) /8/1024/1024=1.32𝑀𝐵 Images sampled save half of the storage space compared with RGB model images, so they are relatively mainstream sampling methods.

Video and images and relationships

When a lot of images are added together, it’s video.What are the metrics used to measure video? The most important one is the Frame Rate.In video, a Frame, “Frame,” is a still picture. Frame rate refers to the number of frames per second (FPS) a video contains.

Why does video data need to be encoded?

With video, there are two problems: one is storage; Two are transmission.

Unencoded video, it’s huge. In all, 1920×1280=2,073,600 Pixels (Pixels), each pixel is 24bit(Pixels). That is, each image 2073600×24=49766400 bits, 8 bits =1 byte; So: 49766400bit=6220800byte≈6.22MB. This is the original size of a 1920×1280 image, multiplied by a frame rate of 30.

In other words: video per second is 186.6MB, which is about 11GB per minute, and a 90-minute movie is about 1000GB

It was obvious that such a large volume needed to be compressed, and coding was created.

What is coding?

Encoding: To convert information from one form (format) to another in a specified way. Video coding: Converting one video format to another.

The ultimate purpose of coding is compression. All kinds of video coding methods are designed to make video smaller, which is conducive to storage and transmission. To achieve compression, it is necessary to design various algorithms to remove redundant information in video data.

When you’re faced with an image, or a video, how would you compress it? I think the first thing you think about is finding patterns. Yeah, looking for correlations between pixels, and correlations between image frames at different times.

For example: if a picture (1920×1080 resolution) is all red, do I need to say 2073600 times [255,0,0]? I only have to say once [255,0,0] and say “ditto” 2073,599 times.

If a one-minute video is still for more than a dozen seconds, or 80% of the image area, the entire process remains unchanged. So, is this storage overhead that you can save?

Images generally have data redundancy, which mainly includes the following four types: ** Spatial redundancy. ** For example, if a frame is divided into 16×16 blocks, the adjacent blocks will often have obvious similarities, which is called spatial redundancy. ** Time redundancy. ** In a video with a frame rate of 25fps, the difference between the two frames is only 40ms. The changes of the two images are relatively small and the similarity is very high, which is called time redundancy.Visual redundancy. Our eyes have this thing called visual acuity. Human eyes are less sensitive to high frequency information than low frequency information in images. Sometimes when high-frequency information is removed from an image, the human eye will look the same as if the high-frequency information were not removed. This is called visual redundancy. ** Information is redundant. ** We usually use Zip and other compression tools to compress the file and reduce the file size. This can also be done for images. This kind of redundancy is called information state redundancy.

Various video compression algorithms are designed to reduce these kinds of redundancy. The priority goal of video coding technology is spatial redundancy and temporal redundancy.

The macro block

Each frame image is divided into blocks for coding, which is called macro block in H264, and super block in VP9 and AV1. In fact, the concept is the same. Macroblock sizes are generally 16×16 (H264, VP8), 32×32 (H265, VP9), 64×64 (H265, VP9, AV1), and 128×128 (AV1). H264, H265, VP8, VP9 and AV1 mentioned here are common coding standards in the market.

Intra-frame prediction inter-frame prediction

Intra-frame prediction — Based on the prediction of encoded blocks in the same frame, construct prediction blocks, calculate the residuals with the current block, and encode the residuals, prediction modes and other information. The main thing it removes is spatial redundancy.

Inter-frame prediction — Based on one or more encoded frame predictions, construct prediction blocks, calculate residuals with the current block, and encode information such as residuals, prediction mode, motion vector residuals, reference image indexes, etc. Its main removal is time redundancy.

The frame type

Inter-frame prediction needs to refer to the coded frame, which can be divided into the forward coded frame that only refers to the previous frame and the bi-directional coded frame that can refer to both the front and the back frame

I frame: it is an independent frame with all the information. It is the most complete picture (occupying the largest space) and can be decoded independently without reference to other images. The first frame in the video sequence is always frame I.

P frame: “Inter-frame predictive coding frame”, which requires reference to different parts of the previous I frame and/or P frame before encoding. The P frame has a dependency on the previous P and I reference frames. However, the compression rate of P frames is relatively high and occupies less space.

B frame: “bidirectional predictive coding frame”, the previous frame and the subsequent frame as the reference frame. It refers not only to the preceding frame, but also to the following frame, so it has the highest compression rate, which can reach 200:1.As shown, the arrow points from the reference frame to the encoding frame

GOP(sequence) and IDR

In H264, images are organized in units of sequence, and a sequence is a data stream encoded by an image. The first image in a sequence is called an IDR image (refresh now image), and IDR images are i-frame images. H.264 introduces IDR image for decoding resynchronization. When the decoder decodes to IDR image, the reference frame queue will be emptied immediately, all the decoded data will be output or discarded, and the parameter set will be searched again to start a new sequence. This gives you an opportunity to resynchronize if a major error occurred in the previous sequence. Images after IDR images are never decoded using data from images before IDR.

A sequence is a stream of data generated by encoding an image with not much difference in content. When the motion change is relatively small, a sequence can be very long, because the motion change is small on behalf of the content of the picture is very small, so you can compile an I frame, and then P frame, B frame. When the motion changes a lot, a sequence may be shorter, such as one I frame and three or four P frames. In video coding sequence, GOP refers to Group of picture, refers to the distance between two I frames, and Reference refers to the distance between two P frames. A Group Of pictures formed between two I frames is GOP (Group Of Pictures).

PTS and DTS

Why the concept of PTS and DTS?

P frame needs to refer to the previous I frame or P frame to generate a complete picture, while B frame needs to refer to the previous I frame or P frame and the following P frame to generate a complete picture. This brings a problem: in the video stream, the first B frame cannot be decoded immediately, and it needs to wait for the subsequent I and P frames that it depends on to be decoded first. As a result, the playback time is inconsistent with the decoding time, and the order is disordered. Then how to play these frames? This is where two other concepts are introduced: DTS and PTS.

DTS (Decoding Time Stamp) : refers to the Decoding Time Stamp. The meaning of this Time Stamp is to tell the player when to decode this frame of data. PTS (Presentation Time Stamp) : The display Time Stamp is used to tell the player when the frame should be displayed.

At the time of video acquisition is recording a frame is a frame to send a frame coding, at the time of coding will generate PTS, need special attention here is that the frame (frame) encoding, in the usual scenario, codec encoding an I frame, and then back to skip a few frames, with coding I frame as a future reference frame for P frame coding, And then jump back to the next frame after frame I. The frames between the encoded I and P frames are encoded as B frames. After that, the encoder skips a few frames again, encodes another p-frame using the first p-frame as a reference frame, and then jumps back again to fill the gaps in the display sequence with b-frames. This process continues, inserting a new I frame every 12 to 15 P and B frames. P frame is predicted by the image of the previous I frame or P frame, while B frame is predicted by the two P frames before and after, or one I frame and one P frame, so the display order of codec and frame is different, as shown below:

Suppose the frame captured by the encoder looks like this:

 I B B P B B P 
Copy the code

So its display order, which is PTS, should look like this:

One, two, three, four, five, six, sevenCopy the code

The encoding order of the encoder is:

1, 4, 2, 3, 7, 5, 6Copy the code

The push stream sequence is also pushed in accordance with the encoding sequence, i.e

I P B B P B B
Copy the code

Then to receive the received video stream is:

I P B B P B B
Copy the code

At this time to decode, but also according to the received video stream a frame to solve the frame, receive a frame to decode a frame, because in the time of coding has been in accordance with the I, B, P dependencies programmed, received data directly decoded good. Then the decoding order is:

I P B B P B B DTS: 1 2 3 4 5 6 7 PTS: 1 4 2 3 7 5 6Copy the code

It can be seen that the corresponding PTS decoded is not sequential. In order to display the video stream correctly, we must readjust the decoded frame(frame) according to PTS, i.e

I B B P B B P DTS: 1 3 4 2 6 7 5 PTS: 1 2 3 4 5 6 7Copy the code