demand

Doing audio and video development on mobile terminal is different from basic UI business logic work. Audio and video development requires you to understand some basic concepts in audio and video. For codec, we must understand some characteristics of codec in advance, the structure of code stream, and some important information in code stream such as SPS, PPS, VPS and Start Code and the basic working principle, but most students only have a little understanding, so part of the content of the code can be easily understood but do not know its meaning, so, summarized here the current mainstream H.264,H.265 coding related principles for learning.


Prerequisites for reading:

  • Basic knowledge of audio and video
  • The VideoToolbox framework in iOS

1. An overview

1.1. Why code

As we all know, the raw volume of video data is huge. Take 720P 30fps video as an example, a pixel is about 3 bytes, as shown below, producing 87MB per second, so that one minute will produce 5.22GB.

Data volume/second =1280 x 720 x 33 x 3/1024/1024=87MBCopy the code

As a result, a video of this size cannot be transmitted directly over the network. And video coding technology was born because of transport. The video coding principle of the technology can refer to my other articles, not too much description here.

1.2. Coding techniques

After many years of development iteration, there have been a lot of big cattle to achieve video coding technology, the most mainstream of which is H.264 coding, and the new generation of H.265 coding, Google also developed VP8,VP9 coding technology. As for the mobile terminal, apple has implemented the coding such as H.264 and H.265 internally, and we need to use the VideoToolbox framework provided by Apple to achieve it.

1.3. Coding classification

  • Software coding (soft coding for short) : Coding using the CPU.
  • Hardware encoding (hard encoding for short) : Uses GPU, dedicated DSP, FPGA, and ASIC chips instead of CPU for encoding.

The advantages and disadvantages

  • Soft editing: direct, simple, easy to adjust parameters, easy to upgrade, but the CPU load, performance is lower than hard coding, low bit rate under the quality is usually better than hard coding.

  • Hard coding: with high performance and low bit rate, the quality is usually lower than that of hard encoder, but some products have transplanted excellent soft coding algorithm (such as X264) on GPU hardware platform, and the quality is basically the same as that of soft coding.

Before iOS 8.0, Apple did not have the open system hardware encoding and decoding function, but the Mac OS has always had a framework called Video ToolBox to handle hardware encoding and decoding. Finally, after iOS 8.0, Apple introduced the framework to iOS.

1.4. Coding principle

After encoding the video, the original video data will be compressed into three different types of video frames: I frame,P frame and B frame.

  • I frame: key frame. A fully coded frame. It can be interpreted as a complete picture, independent of other frames
  • P frame: refer to the previous I frame or P frame, that is, through the previous I frame and their record of different parts can form a complete picture. Therefore, a single P-frame cannot form a picture.
  • B frame: Refer to the I frame or P frame before and the P frame after

Add: I frame compression rate is 7 (similar to JPG), P frame compression rate is 20, B frame can reach 50. However, B frame is generally not enabled in iOS, because the existence of B frame will complicate timestamp synchronization.

Two core algorithms

  • Frame compression

When compressing a frame of image, only the data of the frame is considered without considering the redundant information between adjacent frames, which is actually similar to static image compression. Intra-frame lossy compression algorithm is generally used, because intra-frame compression is to encode a complete image, so it can be independently decoded and displayed. In-frame compression generally does not achieve very high compression, similar to encoding JPEG.

As shown below: we can infer and calculate the encoding of block 6 from the encoding of block 1, 2, 3, 4 and 5, so there is no need to encode block 6, thus compressing block 6 and saving space

  • Inter – frame compression: P – frame and B – frame compression algorithm

There is a great correlation between the data of adjacent frames, or the characteristics of little information change between two frames. In other words, continuous video has redundant information between adjacent frames. According to this feature, the redundancy between adjacent frames can be compressed to further improve the compression amount and reduce the compression ratio. Inter-frame compression, also known as Temporal compression, is performed by comparing data between frames on a timeline. Interframe compression is generally lossless. Frame difference (Frame Differencing) algorithm is a typical time compression method. It compares the difference between the Frame and adjacent frames and records only the difference between the Frame and adjacent frames, which can greatly reduce the amount of data.

As you can see, the difference between the two frames is very small, so using interframe compression makes a lot of sense.

Lossy compression and lossless compression

  • Lossy compression: The data after decompression is inconsistent with the data before compression. In the process of compression, some images or audio information that is insensitive to human eyes and ears must be lost, and the lost information cannot be recovered
  • Lossless compression: Data before compression and after compression is completely consistent. Optimize data arrangement, etc.

DTS and PTS

  • DTS: mainly used for video decoding, used in the decoding phase.
  • PTS: Used for video synchronization and output. Used during rendering. In the absence of a B frame. The output order of DTS and PTS is the same.

As shown above: the decoding of frame I does not depend on any other frame. The decoding of a P frame depends on the I frame or P frame that precedes it. B frame decoding depends on the most recent I frame or P frame before it and the most recent P frame after it.

2. Code stream structure of coded data

In our impression, a picture is an image, video is a collection of many pictures. But because we do audio and video programming, we need to understand the nature of video more deeply.

2.1 Refresh the image concept.

In the coded code stream, the image is a set concept, frame, top field, bottom field can be called the image, a frame is usually a complete image.

  • Step by step: The signal from each scan is an image, that is, a frame. Progressive scanning is suitable for moving images
  • Interlaced scanning: the scanned frame is divided into two parts, each part is called “field” and divided into “top field” and “bottom field” according to the order. Suitable for non-moving images

2.2. Important parameters

  • Video Parameter Set (VPS)

VPS is mainly used to transmit video grading information, which is conducive to the extension of compliant standards in scalable video coding or multi-view video.

(1) Used to explain the overall structure of the encoded video sequence, including time-domain sublayer dependencies, etc. The main purpose of adding this structure to HEVC is to accommodate the extension of the standard to multiple sub-layers of the system, dealing with problems such as future scalable or multi-view videos decoded using the original decoder but requiring information that may be ignored by the decoder.

(2) For a certain sublayer of a given video sequence, no matter its SPS phase is different, all share a VPS. It mainly contains the following information: syntax elements shared by multiple sub-layers or operation points; Key session information such as level and level; Other operation point specific information that is not part of the SPS.

(3) In the code stream generated by coding, the first NAL unit carries VPS information

  • SPS (Sequence Parameter Set)

Contains a shared encoding parameter for all encoded images in CVS.

(1) A HEVC stream may contain one or more encoded video sequences, and each video sequence starts from a random access point, namely IDR/BLA/CRA. The sequence parameter set SPS contains the information required for all slices in the video sequence.

(2) SPS can be roughly divided into several parts: 1. Self-citation ID; 2. Decode relevant information, such as level, resolution, sub-layer number, etc.; 3. Identification of the function switch of a certain level and parameters of the function; 4. Restriction information on coding flexibility of structure and transform coefficient; 5. Time domain classifiable information; 6, VUI.

  • PPS (Picture Parameter Set)

Contains a common parameter used by an image, that is, all Slice segments (SS) in an image reference the same PPS.

(1) PPS contains setting information that may be different for each frame, and its content is roughly similar to that in H.264, mainly including: 1. Self-citation information; 2. Initial image control information, such as initial QP, etc.; 3. Block information.

(2) At the beginning of decoding, all PPS are inactive, and only one PPS can be activated at most at any time of decoding. When a part of the stream references a PPS, this PPS is activated, called active PPS, until another PPS is activated.

The parameter set contains information about the corresponding encoded image. SPS contains parameters for a continuous encoded video sequence (identifier seq_parameter_set_id, constraints on frame number and POC, reference frame number, decoded image size, frame field encoding mode selection identifier, etc.). PPS corresponds to an image or several images in a sequence, and its parameters such as identifier PIC_parameter_set_id, optional seq_parameter_set_id, entropy encoding mode selection identifier, number of slices, initial quantization parameters, and filter coefficient adjustment identifier.

Generally, SPS and PPS are transmitted to the decoder before the header information and data of the chip are decoded. The header information of each chip corresponds to a PIC_parameter_set_ID. After the PPS is activated, it remains valid until the next PPS is activated. Similarly, each PPS corresponds to a SEQ_parameter_set_ID. Once an SPS is activated, it remains valid until the next SPS is activated. Parameter set mechanism separates some important and less changed sequence parameters and image parameters from the encoding chip, and transmits them to the decoding end before the encoding chip, or transmits them through other mechanisms.

Profile, Tier, Level

  • Grade: mainly specifies which coding tools or algorithms can be used by the encoder.

  • Level: limits key parameters (maximum sampling rate, maximum image size, resolution, minimum compression ratio, maximum bit rate, DPB size of decoding buffer, etc.) according to the load and storage space of decoding terminal.

Considering that applications can be distinguished by maximum bit rate and CPB size, some levels define two tiers: main Tier for most applications and high Tier for the most demanding applications.

2.3. Raw codestream

  • IDR

The first image in a sequence is called an IDR image (refresh now image), and IDR images are i-frame images. The IDR image is introduced for the resynchronization of decoding. When the decoder decodes to the IDR image, the reference frame queue is emptied immediately, all the decoded data is output or discarded, and the parameter set is searched again to start a new sequence. This gives you an opportunity to resynchronize if a major error occurred in the previous sequence. Images after IDR images are never decoded using data from images before IDR.

  • structure

It consists of one NALU after another, and its function is divided into two layers, VCL(video coding layer) and NAL(network extraction layer).

The following figure takes the bit stream structure of H264 as an example. If it is H265, there are VPS before SPS.

  • composition

NALU (Nal Unit) = NALU头 + RBSP 在 VCL

Before data transmission or storage, the encoded VCL data is mapped or encapsulated into NAL Unit (NALU, NAL Unit). Each NALU includes a Raw Byte Sequence Payload (RBSP) and a set of NALU headers corresponding to the video encoding. The basic structure of RBSP is that the end bit is added to the end of the original encoded data. One bit “1” and several bits “0” for byte alignment.

2.3.1. H. 264 stream

A raw H.264 NALU cell usually consists of [StartCode] [NALU Header] [NALU Payload]

  • StartCode: StartCode is used to indicate that this is the Start of a NALU unit and must be “00 00 00 01” or “00 00 01”.

  • NALU Header The following table shows NAL Header types

For example, the following figure represents the specific bitstream information of IDR and non-IDR frames respectively:

In an NALU, the first byte (the NALU header) is used to indicate the type of data it contains and other information. Let’s assume a header byte of 0x67 as an example:

hexadecimal binary
0x67 0 11 00111

As shown in the table, the header byte can be parsed into three parts:

1>. Forbidden_zero_bit = 0:1 bit, forbidden_zero_bit = 0, used to check whether errors occur during transmission. 0 indicates normal, 1 indicates syntax violation.

2>. Nal_ref_idc = 3: contains two bits, which indicates the priority of the NAL unit. Non-zero values represent reference field/frame/image data, and other less important data is zero. For non-zero values, the larger the value, the more important the NALU

3>. Nal_unit_type = 7: The last five bits specify the NALU type. The NALU type is defined in the table above

We can learn from the table that NALU types 1-5 are video frames, and the rest are non-video frames. In the decoding process, we only need to take out the last 5 bits of the NALU header byte, namely the NALU header byte and 0x1F, and calculate the NALU type, namely:

NALU type = NALU header byte & 0x1FCopy the code

Header is a key of a certain type. Payload is the value of the key.

Streaming format

The H.264 standard specifies how video is encoded into separate packages, but how these packages are stored and transferred is not regulated, and although the standard includes an Annex that describes one possible format, Annex B, this is not a mandatory one. Two packaging approaches have emerged to address different storage transport requirements. One is the Annex B format and the other is the AVCC format.

  • Annex B

As can be seen from the above, the data in a NALU does not contain its size (length) information, so we cannot simply connect one NALU to generate a stream, because the receiver of the data stream does not know where one NALU ends and another NALU begins. The Annex B format solves this problem with a Start Code, which adds a three – or four-byte Start Code 0x000001 or 0x00000001 at the beginning of each NALU. By locating the start code, the decoder can easily identify the NALU boundary. Of course, the problem with locating the NALU boundary with the start code is that there may be the same data in the NALU as the start code. In order to prevent this problem, the build NALU, need the data of 0 x000000, 0 x000001, 0 x000002, 0 x000003 insert competition Prevention Bytes (Emulation Prevention Bytes) 0 x03, has turned:

0x000000 = 0x000003 00 0x000001 = 0x000003 01 0x000002 = 0x000003 02 0x000003 = 0x000003 03 Decoder detects 0x000003, Discard 0x03 and restore the original data.

Because each NALU contains a start code, the decoder can start decoding at a random point in the video stream and is often used in real-time stream formats. SPS and PPS are usually repeated periodically in this format, and often before each keyframe.

  • AVCC

The AVCC format does not use a start code to delimit the NALU. This format prefixes each NALU with a big-encoder representation that specifies the length of the NALU. The prefix can be 1, 2, or 4 bytes, so when parsing the AVCC format you need to store the specified prefix number of bytes in a header object, often called extradata or Sequence header. SPS and PPS data also need to be stored in extradata. H.264 Extradata syntax is as follows:

bits line by byte remark
8 version always
8 avc profile sps[0][1]
8 avc compatibility sps[0][2]
8 avc level sps[0][3]
6 reserved all bits on
2 NALULengthSizeMinusOne
3 reserved all bits on
5 number of SPS NALUs usually 1
16 SPS size
N variable SPS NALU data
8 number of PPS NALUs usually 1
16 PPS size
N variable PPS NALU data

The last two bits of the fifth byte indicate the number of bytes of NAL size. Note that this NALULengthSizeMinusOne is the NALU prefix length minus one, that is, assuming the prefix length is 4, this value should be 3. Another thing to note here is that while the AVCC format does not use start codes, content-proof bytes do exist.

One advantage of the AVCC format is that the decoder configuration parameters are configured from the beginning, so the system can easily identify the NALU boundary, no extra start code is required, reducing the waste of resources, and it can be adjusted to the middle of the video during playback. This format is commonly used for multimedia data that can be accessed randomly, such as files stored on a hard disk.

2.3.2. H. 265 stream

HEVC stands for High Efficiency Video Coding(ALSO known as H.265), which is a better Video compression standard than H.264. HEVC has outstanding performance in low bit rate video compression, improving video quality, reducing capacity and saving bandwidth. H.265 standard around the H.264 coding standard, retain some of the original technology, while improving some of the technology, the coding structure is roughly similar to h.264 architecture. Here emphasize the difference between the two coding formats. Like H.264, H.265 is organized as NALU. On the NALU header, h. 264’s HALU header is one byte and H.265 is two bytes. Let’s also assume a header of 0x4001 as an example:

hexadecimal binary
0x4001 0, 100000, 000000, 001

As shown in the table, the header information can be parsed into four parts:

  • Forbidden_zero_bit = 0:1 bit, the same as H.264, used to check whether errors occur during transmission. 0 indicates normal and 1 indicates syntax violation.
  • Nal_unit_type = 32: contains six bits, which are used to specify the NALU type
  • Nuh_reserved_zero_6bits = 0: six bits, reserved for future expansion or 3D video encoding
  • Nuh_temporal_id_plus1 = 1: contains three bits, indicating the ID of the time layer where NAL is located

In contrast to h.264’s header information, H.265 removes nal_ref_idc, which is merged into nal_unit_type. The type of H.265NALu is as follows:

nal_unit_type NALU type note
0 NAL_UNIT_CODE_SLICE_TRAIL_N Non-critical frame
1 NAL_UNIT_CODED_SLICE_TRAIL_R
2 NAL_UNIT_CODED_SLICE_TSA_N
3 NAL_UINT_CODED_SLICE_TSA_R
4 NAL_UINT_CODED_SLICE_STSA_N
5 NAL_UINT_CODED_SLICE_STSA_R
6 NAL_UNIT_CODED_SLICE_RADL_N
7 NAL_UNIT_CODED_SLICE_RADL_R
8 NAL_UNIT_CODED_SLICE_RASL_N
9 NAL_UNIT_CODE_SLICE_RASL_R
10 ~ 15 NAL_UNIT_RESERVED_X keep
16 NAL_UNIT_CODED_SLICE_BLA_W_LP Key frames
17 NAL_UNIT_CODE_SLICE_BLA_W_RADL
18 NAL_UNIT_CODE_SLICE_BLA_N_LP
19 NAL_UNIT_CODE_SLICE_IDR_W_RADL
20 NAL_UNIT_CODE_SLICE_IDR_N_LP
21 NAL_UNIT_CODE_SLICE_CRA
22 ~ 31 NAL_UNIT_RESERVED_X keep
32 NAL_UNIT_VPS VPS(Video Paramater Set)
33 NAL_UNIT_SPS SPS
34 NAL_UNIT_PPS PPS
35 NAL_UNIT_ACCESS_UNIT_DELIMITER
36 NAL_UNIT_EOS
37 NAL_UNIT_EOB
38 NAL_UNIT_FILLER_DATA
39 NAL_UNIT_SEI Prefix SEI
40 NAL_UNIT_SEI_SUFFIX Suffix SEI
41 ~ 47 NAL_UNIT_RESERVED_X keep
48 ~ 63 NAL_UNIT_UNSPECIFIED_X Not provided
64 NAL_UNIT_INVALID

The NALU type of type H.265 is the second to seventh bit of the first byte in the information header, so the method to determine the type of H.265NALU is to perform the and operation on the first byte of NALU and 0x7E and move it one bit to the right, that is:

NALU type = (first byte of NALU header & 0x7E) >> 1Copy the code

Similar to H.264, H.265 bitstreams are packaged in two formats, the Annex B format, which uses the start code as the boundary, and the HVCC format, which adds the NALU length prefix to the NALU header. In HVCC, an Extradata is also required to hold the video stream’s codec parameters in the following format:

bits line by byte remark
8 configurationVersion always 0x01
2 general_profile_space
1 general_tier_flag
5 general_profile_idc
32 general_profile_compatibility_flags
48 general_constraint_indicator_flags
8 general_level_idc
4 reserved ‘1111’ b
12 min_spatial_segmentation_idc
6 reserved ‘111111’ b
2 parallelismType
6 reserved ‘111111’ b
2 chromaFormat
5 reserved ‘11111’ b
3 bitDepthLumaMinus8
5 reserved ‘11111’ b
3 bitDepthChromaMinus8
16 avgFrameRate
2 constantFrameRate
3 numTemporalLayers
1 tmporalIdNested
2 lengthSizeMinusOne
8 numOfArrays

Repeated of Array (VPS/SPS/PPS) 1 | 1 | array_completeness reserved | | ‘0’ b 6 NAL_unit_type 16 | numNalus 16 | nalUnitLength N| NALU data

As you can see from the table above, the second half of the H.265 extradata is an array of data with the same format. In addition to the same SPS and PPS as h.264, an additional VPS is required.

VPS (Video Parament Set), type 32 in H.265. VPS is used to interpret the overall structure of the encoded video, including the temporal sublayer dependencies and so on. The main purpose is to be compatible with the extension of H.265 standard in the system’s multiple sublayers.

3. Video codec in iOS

Layered video processing frameworks in iOS

3.1. Choice of framework

For AVKit, AVFoundation and VideoToolbox, their functions and customizability are becoming more and more powerful, but correspondingly more difficult to use, so you should choose which level of interface to use according to actual needs. In fact, even if you use AVFoundation or AVKit you can still get hardware acceleration, all you lose is direct access to the hard codec. For VideoToolbox, you can convert H.264 files or transfer streams into CMSampleBuffer on iOS and decode them into CVPixelBuffer by directly accessing the hard codec, Or encode uncompressed CVPixelBuffer as CMSampleBuffer:

3.2. Container of video data

Call AVCaptureSession and each frame is wrapped into a CMSampleBuffer object. With this CMSampleBuffer object you can get the uncompressed CVPixelBuffer object. If you read the h.264 file you can also take the data and generate a compressed CMBlockBuffer object and create a CMSampleBuffer object for VideoToolbox to decode.

In other words, CMSampleBuffer can be used as a container for CVPixelBuffer objects as well as CMBlockBuffer objects. CVPixelBuffer can be said to be a container for uncompressed image data. The CMBlockBuffer is a container for compressed image data.

3.3. Encoded data Raw stream

NALU. For an H.264 raw stream or file, it is composed of Network Abstraction Layer Unit (NALU) units that represent both the image data and the parameters needed to process the image. It comes in two main formats: Annex B and AVCC. Also known as Elementary Stream and MPEG-4 formats, Annex B formats begin with 0x000001 or 0x00000001, and AVCC formats begin with the length of the NALU in which they are located.

3.4. Commonly used data structures in VideoToolBox

  • CMSampleBuffer: Container data structure that stores video images before and after codec
  • CVPixelBuffer: image data structure before and after encoding
  • CMBlockBuffer: Data structure of the encoded image
  • CMVideoFormatDescription: image storage way, codec format description, etc
  • PixelBufferAttributes: : CFDictionary object, which may contain information about the video’s width and height, pixel format type (32RGBA, YCbCr420), whether it can be used with OpenGL ES, and more
  • CMTime: timestamp dependent. Time appears in 64-BIG /32-bit format. The numerator is the 64-bit time value, and the denominator is the 32-bit time scale.

3.5. Add a Start code for the encoded data.

Data encoded using vtCompressionSession in iOS does not contain start code. We need to add it by ourselves, why do we need to add the start code? As mentioned above, VPS, SPS, PPS and other key information were extracted in order to distinguish each NALU.

In the figure below, we can get the encoded CMSampleBuffer data in the encoding callback. If it’s H265, there’s also VPS in CMVideoFormatDesc. The key data VPS, SPS, PPS and start code must be found before the stream is pushed. Since there may be one or more NALUs in the CMBlockBufferRef, we need to replace the data with start code at the split point of each NALU by iterating through its memory address. Specific code operation reference actual combat.

Next, we need to process the image data of this frame of video. Through CMSampleBufferGetDataBuffer and CMBlockBufferGetDataPointer we can obtain video data memory address. VTCompressionSession encodes video frames in AVCC format, so we can read the header 4 bytes to get the current NALU length. One thing to note here is that the AVCC format uses big endian, which may not be the same as the system endian currently in use. In fact, iOS uses small endian, so we need to first convert this length to the small endian that iOS uses:

NALUnitLength = CFSwapInt32BigToHost(NALUnitLength)

Hard coding is basically a reverse process of hard decoding. The parameter sets SPS and PPS were parsed and assembled into NALU after adding the start code. Extract the video data, convert the length code to the start code, and form the NALU. Send out the NALU.

3.6. Decode H.264 raw stream to CMSampleBuffer

CMSampleBuffer = CMTime + FormatDesc + CMBlockBuffer. The above three messages need to be extracted from the H264 stream. Finally, CMSampleBuffer is combined to provide hard decoding interface for decoding.

In THE syntax of H.264, H.265, there is a basic Layer called Network Abstraction Layer, called NAL for short. Encoded data consists of a series of NAL units (AUL).

The code stream consists of NALU units. An NALU may contain:

  • Video frame, video frame is a video clip, specifically P frame, I frame, B frame

  • Encoding properties collection -FormatDesc(including VPS,SPS and PPS)

In stream data, the set of attributes might look like this :(in the case of h.265 bitstreams, VPS precedes SPS)

After processing, in Format Description:

To convert SPS and PPS from basic stream data to Format Desc, You need to call CMVideoFormatDescriptionCreateFromH264ParameterSets, CMVideoFormatDescriptionGetHEVCParameterSetAtIndex method

  • NALU header

For stream data, the Header of an AUL may start with either 0x00 00 01 or 0x00 00 00 01 (both are possible, using 0x00 00 01 as an example below). 0x00 00 01 is therefore called the Start code.

To sum up the above knowledge, we know that the encoding code stream consists of NALU unit, which contains the video image data and the parameter information of the encoder. The video image data is the CMBlockBuffer and the encoder parameter information can be combined as FormatDesc. Specifically, Parameter information includes VPS,SPS (Sequence Parameter Set) and PPS (Picture Parameter Set). The following figure shows an H.264 stream structure:

  • Extract SPS and PPS to generate FormatDesc

    • The start code of each NALU is 0x00 00 01. Locate the NALU according to the start code
    • SPS and PPS are found through the type information and extracted. The last 5 bits of the first byte after the start code, 7 represents SPS and 8 represents PPS
    • Using CMVideoFormatDescriptionCreateFromH264ParameterSets function to construct CMVideoFormatDescriptionRef
  • Extract video image data to generate CMBlockBuffer

    • Locate to NALU by start code
    • Once the type is determined to be data, replace the start code with the length information of the NALU (4 Bytes)
    • Using CMBlockBufferCreateWithMemoryBlock CMBlockBufferRef interface structure
  • Generate CMTime information as required. (In the actual test, there were unstable images after time information was added, but there was no image without time information, which requires further study. It is suggested not to add time information here.)

According to the above, CMBlockBufferRef CMVideoFormatDescriptionRef, and optional time information, using the data obtained CMSampleBuffer CMSampleBufferCreate interface this to decode the original data. The diagram of H264 data conversion is shown below.

3.7. Render CMSampleBuffer to the interface

There are two ways to display:

  • Will provide to CMSampleBuffers AVSampleBufferDisplayLayer direct display of the system
    • It is used in a similar way to other Calayers. This layer has built-in hardware decoding function, which directly displays the original CMSampleBuffer decoded image on the screen, which is very simple and convenient.
  • Using OPenGL rendering itself through VTDecompression interface, decodes CMSampleBuffer into images and displays the images on UIImageView or OPenGL.

Refer to the article

Video stream format parsing

HM source code Analysis

VPS SPS PPS