preface

H264 is the standard format that belongs to the coding layer of video, which is obviously designed to compress size. Let’s look at a completely uncompressed video data size. Let’s say the video is hd (1280 by 720) at 30 frames per second, which is data per second

1280 x 720 x 30/8 (bytes) /1024(KB) /1024(MB) = 3.11MB

That’s 16.7GB for a 90-minute movie, which is obviously unrealistic on the current network.

The principle of video compression is to remove the redundant parts of the video, as listed below

1. Temporal redundancy Temporal redundancy is often included in sequential images (television images, animation) and voice data. Two adjacent images in the image sequence, the latter image and the previous image have a large correlation, which is reflected as time redundancy. Similarly, in language, there is temporal redundancy because the sound of speech is a continuous gradual process rather than a completely time-independent process.

2. Spatial redundancy Spatial redundancy is a kind of redundancy that often exists in image data. In the same image, the surface physical properties of regular objects and regular background (the so-called regular means that the surface color distribution is orderly rather than chaotic) are correlated, and these correlated optical imaging structures are represented as data redundancy in digital images.

3, knowledge redundancy there are many image understanding and some basic knowledge has considerable correlation, for example: the image of the face has a fixed structure. For example, there is a nose above the mouth. There are eyes above the nose, the nose is on the middle line of the face image and so on. This kind of regular structure can be obtained from prior knowledge and background knowledge. We call this kind of redundancy knowledge redundancy.

4. Structural redundancy Some images have very strong texture structure in a large area, such as cloth image and straw mat image. We can say that they have redundancy in structure.

5. Visual redundancy The human visual system is not always able to perceive any change in the image field. For example, in the process of image encoding and decoding, some changes occur to the image due to noise introduced by compression or volume ratio truncation. If these changes cannot be perceived by vision, the image is still considered good enough. In fact, the general resolution of the human visual system is about 26 gray levels, and the general image quantization adopts 28 gray levels. This kind of redundancy is called visual redundancy. In general, the human visual system is sensitive to brightness changes, but relatively insensitive to chroma changes; In the region of high brightness, the sensitivity of the human eye to brightness changes decreases. Sensitive to the edge of the object, the inner region is relatively insensitive; Sensitive to overall structure and relatively insensitive to internal details.

Information entropy redundancy Information entropy refers to the amount of information carried by a set of data. It is generally defined as H = -∑ PI x log2pi. Where N is the number of code elements, PI is the probability of the occurrence of code element Yi. By definition, to make the unit amount of data d close to or equal to H, we should set d = ∑ PI ×b(yi), where B (yi) is the number of bits assigned to the symbol yi, and theoretically take -log2pi. In practice, it is difficult to estimate {Po, P1… , PN – 1}. B (yo) = b(y1) =… = b(yN – 1), for example, the length of the alphanumeric code is 7 bits, that is, B (Yo) = B (y1) =… = b(yN — 1) = 7, so d must be greater than H, and the resulting redundancy is called information entropy redundancy or coding redundancy.

H264 original code stream structure

Composition: H264 features are divided into two layers, VCL(video coding layer) and NAL(network extraction layer).

  1. VCL: includes syntax level definitions for core compression engines and blocks, macroblocks and slices, designed for efficient coding as much as possible independent of the network.
  2. NAL: Is responsible for adapting vCL-generated bit strings to a wide variety of network and multivariate environments, covering all syntax levels above the chip level.

VCL data is mapped to an NALU before transmission or storage, and H264 data contains one NALU after another. The following figure

An NALU = a set of NALU header information corresponding to the video encoding + a Raw Byte Sequence Payload (RBSP).

An original NALU cell structure consists of [StartCode][NALU Header][NALU Payload].

StartCode, which is an NALU unit to start with, must be 00 00 00 01 or 00 00 01.

1. NAL Header

The header protocol is shown above.

For example: 1 00 00 00 00 01 06: SEI information 2 00 00 00 00 01 67: 0x67&0x1F = 0x07 :SPS 3 00 00 00 00 01 68: 0x68&0x1f = 0x08 :PPS 4 00 00 00 01 65: 0x65&0x1f = 0x05: IDR Slice

  1. RBSP

The following is a description of the RBSP sequence

H264 code stream hierarchy

Let’s look at the structure of each layer one by one.

  1. Slice (film)

You can see that the body of the NALU is Slice. Slice is a new concept proposed by H264. An image has one or more slices. Load network transport over NALU.

The purpose of setting slice is to limit the spread and transmission of error code. The coding slice is project-independent, and the prediction of one slice cannot take macroblocks in other slices as reference images. This ensures that the prediction error of one slice will not spread to other slices.

A slice also contains slice Header + Slice Data

Slice has the following five types

(1) i-slice: all the MB (macro blocks) of slice are encoded by intra-prediction; (2) P-slice: MB (macro block) in slice is encoded by intra-prediction and Inter-prediction, but each Inter-prediction block can only use one moving vector at most; (3) B-slice: similar to p-slice, but each Inter-Prediction block can use two movement vectors. The ‘B’ of b-slice refers to bi-predictive, which can be made as inter-prediction from the I(or P, B)-slice of the previous and subsequent images as well as from the I(or P, B)-slice of the previous two different images. (4) Sp-slice: the so-called Switching P slice, a special type of p-slice, is used to concatenate two bitstreams with different bitrates; (5) Si-slice: the so-called Switching I slice, a special type of i-slice, is not only used to connect two different content bitstreams, but also can be used to perform random access to achieve the function of network VCR

  1. From the structure diagram above, you can see that the slice contains macroblocks. So what is a macro block?

Macroblock is the main bearer of video information. An encoded image is usually divided into multiple macro blocks. Contains brightness and chroma information for each pixel. The main task of video decoding is to provide an efficient way to obtain pixel arrays from the code stream.

A macro block = a 16*16 brightness pixel + an 8 * 8Cb + an 8 * 8Cr color pixel block. (YCbCr is a member of YUV family. In YCbCr, Y refers to brightness component, Cb refers to blue chromaticity component, and Cr refers to red chromaticity component)

Macroblock classification:

I macro block: intra-frame prediction

P macroblock: use the previous frame as reference for intra-frame prediction, and a macro block encoded within a frame can be further segmented

B macro block: bidirectional reference image (front frame and back frame) for intra-frame prediction

1 frame = 1 or n slices 1 slice = n macro blocks 1 macro block = 16x16yuv data

As shown in the figure below

The structure of a macro block is shown below

Mb_type determines that the MB is an intra-frame or inter-frame (P or B) encoding mode, determines the size of the MB partition, mb_pred determines the intra-frame prediction mode (intra-frame macroblock) determines the reference image in Table 0 or Table 1, and the differential encoding motion vector of each macroblock partition (inter-frame macroblock, Sub_mb_pred (only for 8×8MB frames) determine the submacro block segmentation of each submacro block, and the reference image of Table 0 and/or Table 1 for each submacro block segmentation; Each macroblock subdivides the differentially encoded motion vector. Coded_block_pattern indicates which 8×8 block (brightness and color) packet coding transform coefficient mb_qP_delta quantization parameter change value esidual predicted corresponding coding transform coefficient of residual image sampling

I,P,B,IDR frames, DTS and PTS, GOP

I frame: coding frame within a frame. I frame is usually the first frame of each GOP, moderately compressed, similar to the principle of image JPG compression. You get about a 6:1 compression ratio.

P frame: forward predictive coding frame. The compression ratio of about 20:1 can be obtained by compressing the time redundant information of the encoded frame before the image sequence, which is called predictive frame

B frame: bidirectional predictive interpolation coded frame, compressed by the time redundancy information of the preceding frame and the following frame, also known as bidirectional predictive frame. You can get a compression ratio of about 50:1

1. A special frame of I frames in which the first image of a sequence is called an IDR image (refresh image now).

When the decoder decodes the IDR image, the reference frame queue is emptied immediately, all the decoded data is output or discarded, and the parameter set is searched again to start a new sequence. This avoids the problem of a major error in the previous sequence.

DTS: (Decode Time Stamp) For the decoding sequence of the video PTS: (Presentation Time Stamp) for the display sequence of the video.

Because of the existence of bidirectional prediction frames such as B-frame, the decoding sequence of a frame is different from the actual display sequence. As shown in the figure below

GOP: (Group of Picture) A Group of pictures formed between two I frames is GOP. Generally, when setting parameters for the encoder, the value of gOP_size must be set, indicating the number of frames between two I frames. Relatively speaking, the smaller the value of GOP_size, the better the picture quality. But the corresponding capacity is larger.

Since the decoding must obtain the I frame first, the first image can be obtained, so the principle of the second opening of live broadcast is to cache a GOP picture group in CDN, so as to quickly decode the first frame.

References:

  1. Learn the H264 structure from zero

  2. H.264 Study Notes

There are plenty of references to H264 on the web, but this article is mainly for taking notes.