Use VideoToolbox explore low-latency video coding | WWDC speech activity

This article is based on a translation shared by Peikang at WWDC 2021, speaker Peikang, from the Video Coding and Processing team. Translator Tao Jinliang, senior audio and video development engineer of netease Yunxin, has years of end-to-end audio and video work experience.

Support for low latency coding has become an important aspect of video application development process, which is widely used in low latency live broadcast and RTC fields. This share mainly shares how VideoToolbox (a low-level framework that provides direct access to hardware encoders and decoders, video compression and decompression services, and conversion between raster image formats stored in CoreVideo pixel buffers) supports low latency H.264 hardware encoding. To minimize end-to-end latency and achieve new levels of performance, ultimately enabling optimal real-time communication and high-quality video playback.

Share video records: developer.apple.com/videos/play…

Low delay coding is very important for many video applications, especially for real-time video communication applications. In this talk, I will introduce a new coding mode for low latency coding in VideoToolbox. The goal of this new mode is to optimize the existing encoder pipeline for real-time video communication applications. So what do real-time video communication applications need? We need to minimize end-to-end latency in communication.

We expect to increase interoperability by enabling video applications to communicate with more devices. For example, the encoder pipeline should also be efficient when there are multiple receivers in the call, and the application needs to render the video with the best visual quality. Well, we need a reliable mechanism to recover communication from errors introduced by network loss.

The low-latency video coding I’m going to talk about today is optimized in these areas. With low-latency coding, our real-time applications can reach new levels of performance. In this talk, I’ll start with an overview of low-latency video coding. We can have a basic understanding of how to achieve low latency in a pipeline. Then I’ll show you how to use the VTCompressionSession API to build pipes and code in low-latency mode. Finally, I’ll discuss several features we’ve introduced in low-latency mode.

Low delay video coding

First let me give you an overview of low-latency video coding. This is a schematic of the video encoder pipeline on the Apple platform. VideoToolbox uses the CVImagebuffer as the input image, which requires the video encoder to perform compression algorithms such as H.264 to reduce the size of the raw data. The output compressed data is encapsulated in CMSampleBuffer and can be transmitted over the network for video communication. Note from the figure above that end-to-end latency can be affected by two factors: processing time and network transfer time.

To minimize processing time, ** low-latency mode eliminates frame reordering and follows a one-in, one-out encoding mode. ** In addition, the rate controller in this low-delay coding mode can also adapt to network changes faster, thus minimizing the delay caused by network congestion. With these two optimizations, we can already see a significant performance improvement over the default mode. ** For 720p 30fps video, low latency encoding can reduce latency by up to 100 ms. ** These savings are critical for videoconferencing.

By reducing latency through such operations, we can achieve more efficient coding pipelines for real-time communications such as video conferencing and live streaming.

In addition, the low-latency mode always uses a hardware-accelerated video encoder to save battery life. Note that this mode supports a video codec of type H.264, which we will introduce on iOS and macOS.

Use low latency mode in VideoToolbox

Next, I want to talk about how to use low latency mode in VideoToolbox. I’ll first review the use of VTCompressionSession and then show you the steps required to enable low-latency coding.

The use of VTCompressionSession

When we use the VTCompressionSession, the first to use VTCompressionSessionCreate API to create a session. And using the VTSessionSetProperty API to configure the session, such as the target bit rate. If no configuration is provided, the encoder will run with the default behavior.

After create a session and configured correctly, we can by calling VTCompressionSessionEncodeFrame pass CVImageBuffer session, at the same time can provide during the period of from the session to create output handlers retrieve encoding results.

It is easy to enable low-latency coding in compressed sessions, the only thing we need to do is change it during session creation, as shown in the following code:

First, we need a CFMutableDictionary for the encoder specification, which specifies the specific video encoder that the session must use.
Then we need to set in encoderSpecification EnableLowLatencyRateControl flag.

Finally, we will shrink the encoderSpecification give VTCompressionSessionCreate, the session will run in low latency mode.

The configuration steps are as usual. For example, we can use the AverageBitRate property to set the target bitrate.

Ok, we’ve introduced the basics of the Video Toolbox low latency mode. Next, I’d like to continue with the new features in this mode that can further help us develop live video applications.

New features for VideoToolbox low latency mode

So far, we’ve discussed the latency advantages of using a low-latency mode, and the rest of the benefits can be realized with the features I’ll cover.

The first feature is the new Profiles, which we enhanced interoperability by adding two new Profiles to the pipeline. We’ll also talk about the ** time-domain hierarchical SVC, which is very useful in video conferencing. You can also use the Maximum Frame quantization parameter (Max QP) for fine-grained control of image quality. Finally, we want to improve error recovery by adding support for long-term reference (LTR)**.

New Profiles support

Let’s talk about the new Profiles support. Profile defines a set of encoding algorithms that can be supported by the decoder. Profile is used to determine the algorithm used for inter-frame compression in the video encoding process (such as whether b-frame, CABAC support, color space support, etc.). The higher the Profile, the more advanced compression features are adopted. The corresponding requirements on the codec hardware are also higher. In order to communicate with the receiver, the encoded bitstream should conform to the specific configuration files supported by the decoder.

With VideoToolbox, we support a range of profiles, such as Baseline Profile, Main Profile, and High Profile. Today, we added two new profiles to the series: The Constrained Baseline Profile (CBP) and the Constrained High Profile (CHP).

CBP is primarily used for low-cost applications, while CHP has more advanced algorithms to achieve better compression ratios. We can first examine the decoder functionality to determine which Profile to use.

To use CBP, simply set the ProfileLevel session property to ContrainedBaseLine_AutoLevel. Similarly, we can set the Profile level to ContrainedHigh_AutoLevel to use the CHP.

Time-domain hierarchical SVC

Now let’s talk about the time-domain hierarchical SVC. We can use time-domain layering to improve the efficiency of multi-party video calls.

For example: a simple three-way video conference scenario. In this model, receiver A has A low bandwidth of 600kbps, while receiver B has A high bandwidth of 1,000 KBPS. Typically, the sender needs to encode two sets of code streams to satisfy the downlink bandwidth of each receiver. This may not be optimal.

This model can be implemented more efficiently with time-domain hierarchical SVC, where the sender only needs to encode one bitstream, but the bitstream output can be divided into two layers.

Let’s see how this process works. This is a sequence of encoded video frames in which each frame uses the previous frame as a predictive reference.

We can pull half of the frames into another layer, and we can change the reference so that only the frames in the original layer are used for prediction. The original layer is called the base layer, and the newly constructed layer is called the enhancement layer. The enhancement layer can be used as a complement to the base layer to improve the frame rate.

For receiver A, we can send base layer frames because the base layer itself is already decoded. More importantly, since the base layer contains only half of the frames, the data rate transmitted will be low.

Receiver B, on the other hand, can enjoy smoother video because it has enough bandwidth to receive basic layer and enhanced layer frames.

Let’s look at a video using time-domain hierarchical SVC encoding. I’m going to play two videos, one from the base layer and one from the base layer and the enhancement layer. The base layer itself plays fine, but at the same time we may notice that the video is not very smooth. If we play the second video, we can see the difference immediately. The video on the right has a higher frame rate than the video on the left because it contains both the base layer and the enhancement layer.

The video on the left has 50% input frame rate and uses 60% target bit rate. The two videos only require the encoder to encode one bit stream at a time. This is much more energy efficient when we have multi-party video conferencing.

Another benefit of time-domain layering is error recovery. We can see that the frames in the enhancement layer are not used for prediction, so there is no dependency on these frames. This means that if one or more enhancement layer frames are lost during network transmission, the other frames will not be affected. This makes the entire session more robust.

Enabling time-domain layering is simple. We in low latency mode creates a new session attribute called BaseLayerFrameRateFraction, only need to set this property to 0.5, this means that half of the input frame allocated to the base layer, the rest are assigned to the enhancement layer.

We can check the layer information from the sample buffer attachment. For base layer frames, CMSampleAttachmentKey_ IsDependedOnByOthers will be true, otherwise false.

We can also choose to set the target bit rate for each layer. Remember that we use the session attribute AverageBitRate to configure the target bit rate. Once the target bitrate configuration is complete, we can set the new BaseLayerBitRateFraction property to control the target bitrate percentage required by the base layer. If this property is not set, the default value 0.6 is used. We suggest that the base layer bitrate fraction should be in the range of 0.6 to 0.8.

The biggest frame QP

Now, let’s look at the maximum frame quantization parameter or maximum frame QP. Frame QP is used to adjust image quality and data rate.

We can use low frame QP to produce high quality images. But in this case, the image size will be large.

On the other hand, we can use high-frame QP to produce low-quality but small-sized images.

In low-delay mode, the encoder adjusts the frame QP using factors such as image complexity, input frame rate, and video motion to produce the best visual quality under the current target bit rate constraint. So we encourage you to rely on the default behavior of the encoder to adjust the frame QP.

However, we can control the maximum frame QP used by the encoder when some clients have specific requirements for video quality. When maximum frame QP is used, the encoder will always select a frame QP less than this limit, so the client can have fine-grained control over image quality.

It is worth mentioning that normal rate control remains in effect even if the maximum frame QP is specified. If the encoder reaches the maximum frame QP limit but runs out of bit rate budget, it will start discarding frames to preserve the target bit rate.

An example of using this feature is the delivery of screen content video over a poor network. This trade-off can be achieved by sacrificing the frame rate to send a clear image of the screen content, which can be satisfied by setting the maximum frame QP.

We can use the new session attribute MaxAllowedFrameQP to pass the maximum frame QP. According to the standard maximum frame QP must be between 1 and 51.

Long Term Reference Frame (LTR)

Let’s talk about the last feature we developed in low-latency mode, long term reference frames. A long term reference frame, or LTR, can be used for error recovery. Let’s take a look at the figure showing the encoder, sender client, and receiver client in the pipeline.

Assume that video traffic is over a poorly connected network, and frame loss may occur due to transmission errors. When the receiving client detects a frame loss, it can request a frame refresh to reset the session. If the encoder receives a request, it usually encodes a key frame for refresh purposes, but the key frame is usually quite large. Large keyframes take longer to reach the receiver. Since network conditions are already poor, large frames may exacerbate network congestion. So, can we refresh with predictive frames instead of keyframes? The answer is yes, if we have frame confirmation. Let’s see how it works.

First, we need to decide which frames to confirm. We call these frames long term reference frames or LTRS, at the encoder’s discretion. When the sender client transmits LTR frames, it also needs to request acknowledgement from the receiver client. If the LTR frame was successfully received, an acknowledgement is returned. Once the sender client receives confirmation and passes this information to the encoder, the encoder knows which LTR frames have been received.

Consider the case of a bad network: when the encoder receives a refresh request, because this time the encoder has a bunch of confirmed LTRS, it is able to encode a predicted frame from one of those confirmed LTRS. A frame encoded in this way is called an LTR-P. Ltr-p encoded frames are typically much smaller in size than keyframes, and therefore easier to transmit.

Now, let’s talk about the LTR API. Note that frame confirmation needs to be handled by the application layer, which can be done through mechanisms such as RPSI messages in the RTP control protocol. Here we will focus only on how the encoder and sender client communicate during this process. With low-latency encoding enabled, we can enable this feature by setting the EnableLTR session property.

When LTR frame is coded, the encoder will use signal in the sample attachment RequireLTRAcknowledgementToken a unique frame token.

The sender client is responsible for reporting acknowledged LTR frames to the encoder via the AcknowledgedLTRTokens frame property. Since we can receive multiple acknowledgements at once, we need to use an array to store these frame tags.

We can request a refresh of the framework at any time through the ForceLTRRefresh framework property. Once the encoder receives this request, an LTR-P will be encoded. If no confirmed LTR is available, in this case the encoder will generate a keyframe.

conclusion

The above is the translation of all contents shared by Peikang at WWDC 2021. If there is any unreasonable translation, please correct and communicate with us.

At present, netease Yunxin.com has realized SVC and long-term reference frame schemes for software coding at the client level, and SVC schemes have also been implemented for server forwarding. SVC provides an additional means to control video streaming server of the forward rate, coupled with the size of the flow and rate, and the client downlink network bandwidth detection and congestion control, netease cloud letter in pursuit of perfection of viewing experience, continuous grinding products, the sharing of content, I believe will soon be in a cloud of products to obtain the very good use.

Share video records: developer.apple.com/videos/play…

More technical dry goods, welcome to pay attention to [netease Smart enterprise technology +] wechat public number