Hardware coding knowledge (H264,H265)

Reading audience: Research hardware encoder application in iOS development, from 0 research on hardware codec, code stream parsing data structure

About H264,H265 background, data structure, codec application in iOS development

Letter Address:Knowledge of hardware coding

Blog Address:Knowledge of hardware coding

Address of nuggets:Knowledge of hardware coding

I. Background and overview

1. After an update to iOS 11, photos stored on iPhone 7 and newer devices will no longer be JPEG, but a new image format, HEIF (pronounced heef), will be used. Heic on iOS, which is encoded as HEVC. Hahaha) format, also known as H.265 (this is familiar with the next generation of H.264), and video also uses HEVC as the encoder, the corresponding file suffix is still.mov.

HEIF is the image format, while HEVC is the encoding format (similar to H.264,VP8). HEIF is the image container (similar to video’s MKV, mp4 suffix). HEIF images encoded with HEVC are images with the suffix. Heic, which is the main format used by Apple.

3. HEIF stands for High Efficiency Image Format (HEIF) Is a format for storing pictures and Picture sequences developed by the Moving Picture Experts Group. Here is an English poem about HEIF. JPEG is big, but HEIF is small.

Advantages of 4.

High compression ratio, twice as high as JPEG for the same image quality
Can add auxiliary images such as depth information, transparent channels, etc.
Support storage of multiple pictures, similar to albums and collections. (Achieve the effect of multiple exposures)
Support multiple images to achieve GIF and livePhoto animation effects.
There is no maximum pixel limit similar to JPEG
Transparent pixel support
Block loading mechanism
Thumbnail support

5. Document composition

In a video file, the container and the code are opened independently. For example, MP4, MKV and other formats are containers and H.264, VP8 and other formats are codes
But image files, like JPEGs, are mixed together, so of course they don’t work very well. HEIF separates the container from the code and has containers for one or more images.

6. Compatible with

In general, users are not aware of this format, as only new hard decoded iOS phones store photos and videos in Heif & HEVC format internally, while incompatibable devices are automatically converted to JPEG format when transferred to a PC via Airdrop or data cable. So it will not affect your use of wechat, Weibo and other software.

Video codec

1. Concept of soft and hard knitting

Soft coding: Encoding using CPU.
Hard encoding: Use GPU, DSP, FPGA, ASIC chip and other hardware to encode instead of CPU.
To compare
- Soft coding: direct, simple, easy to adjust parameters, easy to upgrade, but the CPU load, performance is lower than hard coding, low bit rate quality is usually better than hard coding.
- With high performance and low bit rate, the quality is usually lower than that of soft encoder, but some products have transplanted excellent soft coding algorithm (such as X264) on GPU hardware platform, and the quality is basically the same as that of soft coding.
- Before iOS 8.0, Apple did not have the open system hardware encoding and decoding function, but Mac OS has always had a framework called Video ToolBox to handle hardware encoding and decoding. Finally, after iOS 8.0, Apple introduced this framework to iOS system.

2. Principle of H. 264 coding

H264 is a new generation of coding standard, known for its high compression and high quality and support for streaming media transmission on various networks. In terms of coding, I understand his theoretical basis as follows: According to the statistical results of images in a certain period of time, the difference of pixels in adjacent images is generally less than 10%, the change of brightness difference is less than 2%, and the change of chroma difference is less than 1%. Therefore, for A small change in the image picture, we can first encode A complete image frame A, and then B frame does not encode all the image, only write the difference with frame A, so that the size of B frame is only 1/10 of the complete frame or smaller! If the C frame after B doesn’t change much, we can continue to encode C frame with reference to B, and so on. We call the image a sequence (sequence is a piece of data with the same characteristics), as an image and image change a lot before, not in front of the reference frame to generate, then we will end in a sequence, begin the next period of sequence, which is on the image to generate a complete frame A1, then the image is generated reference A1, only the difference between A1 and write content.

There are three types of frames defined in H264 protocol. The complete code frame is called I frame, the frame generated by referring to the previous I frame that contains only the different part of the code is called P frame, and the frame before and after the reference frame is called B frame.

The core algorithms H264 uses are intra – frame compression and inter – frame compression. Intra – frame compression is the algorithm to generate I frames, and inter – frame compression is the algorithm to generate B frames and P frames.

3. Description of the sequence

In H264, images are organized in units of sequence. A sequence is an encoded data stream of an image, beginning with frame I and ending with frame I.

The first image in a sequence is called an IDR image (refresh now image), and IDR images are i-frame images. H.264 introduces IDR image for decoding resynchronization. When the decoder decodes to IDR image, the reference frame queue will be emptied immediately, all the decoded data will be output or discarded, and the parameter set will be searched again to start a new sequence. This gives you an opportunity to resynchronize if a major error occurred in the previous sequence. Images after IDR images are never decoded using data from images before IDR.

A sequence is a stream of data generated by encoding an image with not much difference in content. When the motion change is relatively small, a sequence can be very long, because the motion change is small on behalf of the content of the picture is very small, so you can compile an I frame, and then P frame, B frame. When the motion changes a lot, a sequence may be shorter, such as one I frame and three or four P frames.

4. Introduction to the three frames

The I frame
- I frame is the key frame. You can understand it as the complete preservation of this frame. Decoding only needs this frame data to complete (because it contains the whole picture).
- The characteristics of
  - It is a full frame compressed encoding frame. It carries on JPEG compression coding and transmission of the whole frame image information
  - The whole image can be reconstructed with only I frame data when decoding
  - The I frame describes the details of the image background and moving subject
  - I frames are generated without reference to other frames
  - I frame is the reference frame of P frame and B frame (its quality directly affects the quality of subsequent frames in the same group)
  - I frame is the base frame (the first frame) of frame group GOP, and there is only one I frame in a group
  - I frames do not need to consider the motion vector
  - I frame occupies a large amount of data information
P frame
- Forward prediction coding frame. P frame represents the difference between this frame and the previous key frame (or P frame). When decoding, the difference defined in this frame needs to be superimposed on the cached picture to generate the final picture. (In other words, differential frame, P frame has no complete picture data, but only the data that is different from the previous frame), the encoded image with transmitted data volume is compressed by fully reducing the time redundancy information of the previous encoded frame in the image sequence, which is also called predictive frame
- Prediction and reconstruction of P frame: P frame takes I frame as the reference frame, and the predicted value and motion vector of “certain point” of P frame are found in I frame, and the prediction difference and motion vector are transmitted together. At the receiving end, the predicted value of “certain point” of P frame is found from I frame according to the motion vector, and the sample value of “certain point” of P frame is obtained by combining it with the difference value, thus the complete P frame can be obtained.
- Features:
  - A P frame is a coded frame followed by an I frame separated by 1 to 2 frames
  - P frame uses motion compensation method to transmit the difference between it and the previous I or P frame and the motion vector (prediction error)
  - During decoding, the predicted value and the predicted error in I frame must be summed before the complete P frame image can be reconstructed
  - P frames belong to forward prediction interframe coding. It only refers to the nearest I frame or P frame
  - The P frame can be the reference frame of the P frame after it or the reference frame of the B frame before and after it
  - Since the P frame is a reference frame, it can cause a proliferation of decoding errors
  - Because of differential transmission, the compression of P frames is relatively high
B frame
- Bidirectional predictive interpolation coded frames. B frame is two-way difference frame, that is, B frames are recorded before and after this frame and the frame difference (specific is more complex, there are 4 kinds of circumstances, but I say so simple), in other words, to decode B frame, not only to get the cache before the picture, even after decoding, through the picture with the frame of the data before and after stacking the images. B frame compression rate is high, but the CPU will be more tired when decoding.
- Prediction and reconstruction of B frame: I or P frame before B frame and P frame after B frame as reference frame, “find” the predicted value of “certain point” of B frame and two motion vectors, and take the prediction difference and motion vector transmission. The receiver “finds (calculates)” the predicted value in the two reference frames according to the motion vector and sums it with the difference to get a sample value of “point” in B-frame, thus obtaining the complete B-frame.
- Features:
  - B frames are predicted by the I or P frames that precede and the P frames that follow
  - B frame transmits the prediction error and motion vector between it and the preceding I or P frame and the following P frame
  - B frames are bidirectional predictive coding frames
  - B frame compression ratio is the highest, because it only reflects the change of the moving subject of c reference frame, so the prediction is more accurate
  - B frames are not reference frames and will not cause the proliferation of decoding errors

Frames I, B, and P are artificially defined according to the requirements of the compression algorithm. They are all real physical frames. Generally speaking, the compression rate of I frame is 7 (similar to JPG), P frame is 20, and B frame can reach 50. It can be seen that using B frames can save a lot of space, which can be used to save more I frames, which can provide better picture quality at the same bit rate.

5. Explain the compression algorithm

H264 compression method:

Grouping: Several frames of images are divided into a group (GOP, that is, a sequence). In order to prevent motion changes, the number of frames should not be too many.
Frame definition: each frame image in each group is defined as three types, namely I frame, B frame and P frame;
Prediction frame: I frame is used as the basic frame, I frame predicts P frame, and then I frame and P frame predict B frame;
Data transmission: Finally, the difference between I frame data and predicted value information is stored and transmitted.
Intraframe compression is also known as Spatial compression.
- When compressing a frame of image, only the data of the frame is considered without considering the redundant information between adjacent frames, which is actually similar to static image compression. Intra-frame lossy compression algorithm is generally used, because intra-frame compression is to encode a complete image, so it can be independently decoded and displayed. In-frame compression generally does not achieve very high compression, similar to encoding JPEG.
Interframe compression
- There is a great correlation between the data of adjacent frames, or the characteristics of little information change between two frames. In other words, continuous video has redundant information between adjacent frames. According to this feature, the redundancy between adjacent frames can be compressed to further improve the compression amount and reduce the compression ratio. Inter-frame compression, also known as Temporal compression, is performed by comparing data between frames on a timeline. Interframe compression is generally lossless. Frame difference (Frame Differencing) algorithm is a typical time compression method. It compares the difference between the Frame and adjacent frames and records only the difference between the Frame and adjacent frames, which can greatly reduce the amount of data.
Lossy compression and Lossy less compression.
- Lossless compression means that data before compression and after compression is exactly the same. Most lossless compression uses RLE stroke coding.
- Lossy compression means that the data after decompression is inconsistent with the data before compression. In the process of compression, some images or audio information that is insensitive to human eyes and ears must be lost, and the lost information cannot be recovered. Almost all high-compression algorithms use lossy compression in order to achieve the goal of low data rates. The data loss rate is related to the compression ratio. The smaller the compression ratio is, the more data is lost and the worse the decompression effect is generally. In addition, some lossy compression algorithms use repeated compression, which can cause additional data loss.

6. Difference between DTS and PTS

DTS is mainly used for video decoding and is used in the decoding phase. PTS is mainly used for video synchronization and output. This is used in display. In the absence of a B frame. The output order of DTS and PTS is the same.

EX: Here is an example of a GOP of 15 whose decoding reference frame and decoding sequence are in it:

H.264 video hardware codec description of IOS system

1. Introduction to VideoToolbox

In iOS, there are five video-related interfaces, starting from the top layer, which are Avkit-AvFoundation – VideoToolboxer-Core Media – Core Video

The VideoToolbox can extract videos to CVPixelBuffer or CMSampleBuffer.

If you need to use hard coding, you need AVKit, AVFoundation and VideoToolbox for the 5 interfaces. I’m just going to introduce VideoToolbox here.

2. Objects in VideoToolbox

CVPixelBuffer: Image data structure before encoding and after decoding (Uncompressed Raster Image Buffer -Uncompressed Raster Image Buffer)

CVPixelBufferPool: As the name suggests, CVPixelBuffer

PixelBufferAttributes: A CFDictionary object that may contain information about the video’s width and height, pixel format type (32RGBA, YCbCr420), whether it can be used with OpenGL ES, and more
CMTime: timestamp dependent. Time appears in 64-BIG /32-bit format. The numerator is the 64-bit time value, and the denominator is the 32-bit time scale.
CMClock: timestamp dependent. Time appears in 64-BIG /32-bit format. The numerator is the 64-bit time value, and the denominator is the 32-bit time scale. It encapsulates the time source, where CMClockGetHostTimeClock() encapsulates mach_absolute_time()
CMTimebase: timestamp dependent. Time appears in 64-BIG /32-bit format. Control view on CMClock. Provides time mapping :CMTimebaseSetTime(timebase, kCMTimeZero); CMTimebaseSetRate(Timebase, 1.0);

CMBlockBuffer: Data structure of the resulting image after encoding
As shown in the figure, the video images before and after codec are encapsulated in CMSampleBuffer. If it is the encoded image, it is stored in CMBlockBuffe. The decoded images are stored as CVPixelBuffer. CMSampleBuffer also contains other time information CMTime and video description information CMVideoFormatDesc. Storage mode, codec format description
CMSampleBuffer: Container data structure that stores video images before and after codec
As shown in the figure, video images before and after codec are encapsulated in CMSampleBuffer. If they are encoded, they are stored in CMBlockBuffe. The decoded images are stored as CVPixelBuffer. CMSampleBuffer also contains other time information CMTime and video description information CMVideoFormatDesc.

3. Hard to decode

Illustrates how to use the hardware decoding interface through a typical application shown in figure 1. The application scenario is a video stream encoded by H264 transmitted from the network and displayed on the mobile phone screen.

1> Convert the H.264 stream to CMSampleBuffer

CMSampleBuffer = CMTime + FormatDesc + CMBlockBuffer. The above three messages need to be extracted from the H264 stream. Finally, CMSampleBuffer is combined to provide hard decoding interface for decoding.

In H.264’s syntax, there is a basic Layer called Network Abstraction Layer, which is called NAL for short. H.264 stream data consists of a series of NAL units (AUL).

The code stream of H264 consists of NALU units. An NALU may contain:

Video frame, video frame is a video clip, specifically P frame, I frame, B frame

H.264 Attribute Collection -FormatDesc(including SPS and PPS)

In the stream data, the set of attributes might look like this:

After processing, in Format Description:

Based the flow of data from the SPS and PPS is converted into the Format Desc, you need to call CMVideoFormatDescriptionCreateFromH264ParameterSets () method

NALU header

For stream data, the Header of an AUL may start with either 0x00 00 01 or 0x00 00 00 01 (both are possible, using 0x00 00 01 as an example below). 0x00 00 01 is therefore called the Start code.

To sum up the above knowledge, we know that the code stream of H264 consists of NALU unit, which contains video image data and parameter information of H264. The video image data is the CMBlockBuffer, and the H264 parameter information can be combined as FormatDesc. Specifically, the Parameter information includes SPS (Sequence Parameter Set) and PPS (Picture Parameter Set). The following figure shows an H.264 stream structure:

Extract SPS and PPS to generate FormatDesc
- The start code of each NALU is 0x00 00 01. Locate the NALU according to the start code
- SPS and PPS are found through the type information and extracted. The last 5 bits of the first byte after the start code, 7 represents SPS and 8 represents PPS
- Using CMVideoFormatDescriptionCreateFromH264ParameterSets function to construct CMVideoFormatDescriptionRef
Extract video image data to generate CMBlockBuffer
- Locate to NALU by start code
- Once the type is determined to be data, replace the start code with the length information of the NALU (4 Bytes)
- Using CMBlockBufferCreateWithMemoryBlock CMBlockBufferRef interface structure
Generate CMTime information as required. (In the actual test, there were unstable images after time information was added, but there was no image without time information, which requires further study. It is suggested not to add time information here.)

According to the above, CMBlockBufferRef CMVideoFormatDescriptionRef, and optional time information, using the data obtained CMSampleBuffer CMSampleBufferCreate interface this to decode the original data. The diagram of H264 data conversion is shown below.

2> Display CMSampleBuffer

There are two ways to display:

Will provide to CMSampleBuffers AVSampleBufferDisplayLayer direct display of the system
- It is used in a similar way to other Calayers. This layer has built-in hardware decoding function, which directly displays the original CMSampleBuffer decoded image on the screen, which is very simple and convenient.
Using OPenGL rendering itself through VTDecompression interface, decodes CMSampleBuffer into images and displays the images on UIImageView or OPenGL.
- Initialize VTDecompressionSession to set information about the decoder. The initialization information requires the FormatDescription in the CMSampleBuffer and sets how the decoded image should be stored. The CGBitmap mode set in demo is stored in RGB mode. After decoding the encoded image, a callback function will be called, and the decoded image will be handed over to this callback function for further processing. Within this callback, we send the decoded image to Control for display, passing the callback pointer as an argument to the CREATE interface function when initialized. Finally, the session is initialized using the CREATE interface.
- The callback function mentioned in a can complete the process of converting CGBitmap image into UIImage image, and send the image to Control through queue for display processing.
- Call VTDecompresSessionDecodeFrame interface decoding operation. The decoded image will be submitted to the callback function set in the above two steps for further processing.

4. Hard coded

The use of hard coding is also described through a typical application scenario. Firstly, images are collected through the camera, and then the collected images are encoded by hard coding. Finally, the encoded data are combined into H264 codes and transmitted through the network.

Camera data collection

Camera collection, iOS system provides AVCaptureSession to collect the image data of the camera. Set the session collection resolution. Then set input and output. When output is set, you need to set the delegate and output queue. In the Delegate method, process the collected image.

The output format of the image is unencoded CMSampleBuffer.
Use VTCompressionSession for hard coding
- Initialize VTCompressionSession
VTCompressionSession is initialized with width, height, and encoder type kCMVideoCodecType_H264. Then call the VTSessionSetProperty interface to set the frame rate and other properties, the demo provides some reference Settings, testing found almost no impact, may need further debugging. Finally, you need to set a callback function, which is called after the video image is successfully encoded. After all ready, use the VTCompressionSessionCreate create the session
- Extract the raw image data collected by the camera and give it to VTCompressionSession for hard coding
Image obtained after the camera is not encoded CMSampleBuffer form, using the given interface function CMSampleBufferGetImageBuffer extract CVPixelBufferRef, VTCompressionSessionEncodeFrame use hard-coded interface to the frame is hard-coded, after the success of the code, will automatically call the session initialization time set the callback function.
- By using the callback function, the CMSampleBuffer successfully encoded is converted into H264 bit stream and propagated through the network

It is basically a reverse process of hard decoding. The parameter sets SPS and PPS were parsed and assembled into NALU after adding the start code. Extract the video data, convert the length code to the start code, and form the NALU. Send out the NALU.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Hardware coding knowledge (H264,H265)