Before learning H.264 coding, we first know how much data needs to be transmitted per second in the process of live video if YUV images collected by Camera are transmitted without any processing. YUV images collected by Camera are usually YUV420. According to the sampling structure of YUV420, the proportion of Y, U and V components in a pixel of YUV image is 1:1/4: 1/4, and a Y component takes up 1 byte, which means that for YUV images, the size of a pixel is (1+1/4+1/4)*Y=3/2 bytes. If the frame rate is set to 30 FPS and the resolution is 1280×720, the amount of data to be transferred per second is 1280*720(pixels)*30(frames)*3/2(bytes)=39.5MB. At a resolution of 1920×720, the amount of data to be transmitted per second is close to 60MB, which is difficult to achieve in real networks. Therefore, the video data must be compressed and encoded before transmission.

1. H. 264

H.264 is the tenth part of MPEG-4. It is a highly compressed digital video encoder standard jointly proposed by VCEG and MPEG. It is widely used in multimedia development and application. H.264 is characterized by low bit rate, high compression, high quality image, strong fault tolerance and strong network adaptability. Its biggest advantage is a high data compression ratio. Under the condition of the same image quality, H.264’s compression ratio is more than twice that of MPEG-2.

1.1 H.264 coding principle

In H.264 protocol, there are three types of frames: the complete code frame is called I frame (key frame), the frame generated by referring to the previous I frame and containing only the different part of the code is called P frame, and the frame before and after the reference frame is called B frame. The core algorithms used in H.264 coding are intra-frame compression and inter-frame compression. Among them, intra-frame compression is an algorithm to generate I frames. Its principle is that when compressing a frame image, only the data of the frame is considered without considering the redundant information between adjacent frames. Since intra-frame compression encodes a complete image, it can be independently decoded and displayed. Inter-frame compression is an algorithm to generate P and B frames. The principle of inter-frame compression is to compress data between two adjacent frames to further improve the compression amount and reduce the compression ratio. Generally speaking, h.264 encodes A complete image frame A for an image with little change, and then B frame does not encode the whole image, only the difference with A frame is written, so that the size of B frame is only 1/10 or smaller than that of the complete frame. If the C frame after B doesn’t change much, we can continue to encode C frame with reference to B, and so on.

H.264 coding framework is divided into two layers:

  • VCL(Video Coding Layer) : Responsible for efficient presentation of Video content;
  • NAL(Network Abstraction Layer) : packages and transmits data in a proper way required by the Network.

1.2 IDR (I frame)

IDR: Instantaneous Decoding Refresh. The first image in a sequence is called an IDR image (refresh now image), and IDR images are all I-frames (key frames). H.264 introduces IDR image for decoding resynchronization. When the decoder decodes to IDR image, the reference frame queue will be emptied immediately, all the decoded data will be output or discarded, and the parameter set will be searched again to start a new sequence. This gives you an opportunity to resynchronize if a major error occurred in the previous sequence. Images after IDR images are never decoded using data from images before IDR.

  • SPS(Sequence Parameter Sets) : Sequence Parameter Sets that act on a series of sequential encoded images.
  • PPS(Picture Parameter Set) : image Parameter Set, acting on one or more independent images in the encoded video sequence.
  • Supplemental Enhancement Information (SEI) : Supplementary Enhancement Information, including Information such as video timing, is usually placed before the main encoded image data. In some applications, it can be omitted.
  • P frame: forward prediction coding frame. P frame refers to the difference between this frame and the previous key frame (or P frame). P frame refers to the coding frame separated by 1 to 2 frames after frame I. There is no complete picture data in P frame, but only the data that is different from the previous frame.
  • B frame: bidirectional predictive interpolation coded frame. B frame records the difference between this frame and the previous frame. It is predicted by the I or P frame in front and the P frame in the following frame.

1.3 DIFFERENCES between H.264 and X.264

X264 is an open-source project that complies with the H.264 standard. It is free, a simplified version of H264 that does not support some advanced features. But X264 is very good, no worse than H264’s commercial encoder. The core algorithms H264 uses are intra – frame compression and inter – frame compression. Intra – frame compression is the algorithm to generate I frames, and inter – frame compression is the algorithm to generate B frames and P frames.

2.H.264 data organization form

In general, the data is organized in descending order: Sequence, frame/field-picture, sliceGroup, Slice, Macroblock, block, sub-block, Pixel. In H.264 code stream, images are organized by sequence. A sequence is the data stream encoded by multiple frames of images, starting with I frame and ending with the next I frame. A frame of image can be divided into one or more slices, which are composed of macro blocks, which are the basic unit of encoding. After the slice is encoded, it will be packed into an NALU, that is, one frame of image corresponds to one NALU. NALU is the basic unit of H.264 encoded data storage or transmission. It can also contain other data, such as SPS, PPS, SEI, etc.

According to the principle of H.264 coding, a sequence is a series of data streams generated by encoding a segment of images with little difference in content. When the motion change is relatively small, a sequence can be very long, this is because the motion change is small on behalf of the content of the picture is very small, so you can make up an I frame, and then followed by P frame, B frame; When the motion changes greatly, this sequence may be relatively short, because the content of the picture changes greatly, so the P frame and B frame are relatively reduced. In short, a sequence always starts with I frame and ends with the next I frame. The number of image frames contained in the sequence is related to the picture changes.

3. NAL in H.264

From the introduction of h.264, we can know that NAL is the NetworkAbstract Layer in h.264 /AVC coding framework, namely NetworkAbstract Layer, which is mainly responsible for formatting data and providing header information to ensure that data is suitable for effective transmission on various channels and storage media. NAL provides a friendly interface between the video encoder and the transmission system, so that the encoded video data can be effectively transmitted in a variety of different network environments.

In NAL Layer, NALU(Network Abstract Layer Unit) is the basic Unit of H.264 encoding storage or transmission. In H.264 code stream, every frame of data is an NALU(note: SPS, PPS, SEI are not frames). Each NALU contains a header structure of 1 byte (8 bits) that identifies the disposability, importance indication, and NALU type of the NAL unit as follows:

Among them:

  • Deny bit: When the network detects an error in the NALU, this bit is set to 1 so that the receiver can discard the NALU.
  • Importance indicator: indicates the importance of the NALU when it is used for reconstruction. The larger the value is, the more important it is.
  • NALU type: used to determine whether the NALU is PPS, SPS, I(key)/P/B frame, etc., generally H.264 bit stream

The first two NALUs are SPS and PPS, and the third NALU is IDR(I Frame). The NALU type is an important tool to determine the type of frames. As for how to use it to realize the detection of SPS, PPS and I/P/B frames, we will give detailed examples. The mapping between the related values and the NALU type is as follows:

From the figure above, when NALU type =5, it indicates that the NALU is a key frame (I frame). If the NALU type is 6, it indicates that the NALU is additional enhancement information. When NALU type =7, it indicates that the NALU is a sequence parameter set (SPS). When NALU type =8, it indicates that the NALU is an image parameter set (PPS), and so on.

4. SPS, PPS, I/P/B frame detection and parsing in H.264

4.1 H.264 bit stream hierarchy

Before analyzing SPS, PPS, I/P/B frame, we first understand the structure of h.264 bit stream layer. Viewed from the outside in, an H.264 stream is actually a collection of stream sequences (as shown in the first layer) composed of multiple NALUs, with one sequence beginning at i-frame and ending at i-frame. NALU is the basic unit of H.264 encoding storage or transmission. NALU consists of NALU head and NALU main body (as shown in the second layer). The NALU head takes up 1 byte, and the detection of SPS, PPS and I/P/B frames in H.264 is realized by the NALU type in the NALU head.

The hierarchical structure of H.264 bit stream is shown in the figure below:

4.2 H.264 File Parsing

Generally speaking, the first frame data compiled by the encoder is SPS and PPS, followed by I frame (key frame), followed by P frame, B frame… . For H.264 code stream, the delimiter of each frame image is 0x00000001, 0x000001, also known as the start code, which respectively occupy 4 bytes or 3 bytes, and the last byte of the start code is the NALU header, through this byte we can easily find the required SPS, PPS, I/P/B frame. Here, we will analyze a H.264 file to explain, using the H.264 Video ES Viewer tool to open a test.264 file, as for the generation of H264 file, I will be detailed in the next blog post, the structure of the H264 bit stream is as follows:

As can be seen from the figure above, each line represents a frame image (except SPS and PPS), and each line includes four columns, among which the first column is the logical address of the frame image. The second column is the byte length of the frame image data. Due to the coding principle of H264, it can be known that each frame image in the H264 code stream is not actually a frame image, but a collection of multiple frames of images. The third column represents the start code of the image frame, which is 0x00000001; The fourth column represents NAL type. It can be seen from the figure that the first frame of H264 encoder is SPS and PPS, followed by I frame (key frame), followed by P frame or B frame (non-I frame)…

4.3 SPS, PPS, I/P/B frame detection

With the above theoretical and analytical basis, it is very easy to determine (3) SPS, PPS, I/P/B frames in THE H.264 bit stream. As we know, a frame of data in THE H.264 bit stream always starts with 0x00000001 or 0x000001, and the next bit of the start bit is the NALU header. For example, the NALU header of the first frame is 0x67. We intercept the first few frames of data in the bit stream for analysis:

01 **67** 42 80 1F DA 02 D0 28 68... Frame 1:0000 01 **68** CE 06 E2 (8 bytes) Frame 3:0000 01 **65** B8 40 F7 8F FC EB 04... Frame 4:0000 00 01 **41** E2 01 10 EA 4E 9F... Frame 5:0000 00 01 **41** E4 01 10 EC 7B DF 13... (2,096 bytes)Copy the code

Since the NALU type is determined by the last five bits of the NALU head, that is, 3-7 bits of the byte subscript, we only need to get the decimal value of these five bits, and then compare with the NAL type comparison table to know whether the frame image is SPS, PPS or I frame (key frame). In the process of encoding, we can get the type of each frame by matching the next byte of the start code with 0x1F and obtaining the last five digits of the NAL header, for example:

0x67&0x1F =(0110 0111) &(0001 1111) =(0000 0111)=7(decimal) -- > SPS 0x68&0x1F =(0110 1000) &(0001 1111) =(0010) 1000)=8(decimal) - > PPS 0x65&0x1F =(0110 0101) & (0001 1111) =(0000 0101)=5(decimal) - > Keyframe (I) 0x41&0x1F =(01000001) & (0001) 1111) =(0000 0001)=1(decimal) -- > Non-keyframe (I frame)Copy the code