The real-time video system has a high demand for time delay, and the video encoder must meet the demand of real-time. Compared with the mainstream H.264, AV1, a new generation of video standard, improves the rate-distortation performance at the cost of increasing complexity. At present, the fragmentation of the application devices is very serious and the computing capacity of the devices is greatly different. All these are challenges faced by the implementation of new technologies in real-time systems. This sharing will focus on the in-depth analysis and explanation of some technical practices in the design of Pano Venus real-time AV1 communication system. We hope to explore the future of real-time video technology together with you.

Article | ZhangQi

Finishing | LiveVideoStack

My name is Zhang Qi, my English name is Volvet. I have been engaged in video development for 20 years. I am currently working in Beleyun, and HAVE worked in Cisco WebEx and NetEase. We are committed to providing PaaS cloud services for audio and video. Our core team is mainly from Cisco WebEx.

Today’s share consists of six parts. The first part briefly introduces the background of AV1, including the history of video codec and the most basic concepts of the editor. Part two introduces the Pano Venus product and why we developed a real-time communication system based on AV1. The third part simply analyzes the complexity of AV1. The complexity of AV1 is very high, which means that it is very challenging to implement AV1 into real-time system. Therefore, complexity analysis is the benchmark and prerequisite for designing AV1 system. The fourth part introduces the design of AV1 real-time system based on complexity analysis. The fifth installment covers AV1: Scalability coding is a very popular topic, but scalable coding has not been widely used until now. With AV1, I think it will have its chance. In the last part, I will share the follow-up plan of Leyun based on AV1.

1. Introduction

The development of video coding standards has nearly 40 years of history, in the development of video coding standards, the International Telecommunication Union (ITU) and the Moving Image Expert Group (MPEG) these two organizations have a pivotal position.

The earliest video coding standards H.120 and H.261 are developed by the International Telecommunication Union in the last century in the 1980s, and then the moving image expert group developed MPEG-1, MPEG-1 is known as VCD coding standard.

The two groups then joined forces to form the Joint Video Expert Group (JVET), which defines the video component of MPEG-2, also known as H.262. Mpeg-2, better known as DVD, is arguably one of the most successful video standards to date, no less than h.264-AVC today.

H.263 was later developed by the International Telecommunication Union and MPEG-4 was developed by the Moving Image Expert Group, neither of which achieved the same level of success as MPEG-2. Then the two organizations came together again (JVET) to develop the H.264-AVC standard. H.264 was developed at the beginning of the 21st century. It has been nearly 20 years since then, but it has still occupied an absolute leading position in the industry application for 20 years.

H.265-hevc was developed by the Joint Video Expert Group in 2014, and its technology is commendable, but confusion and high cost of patents have prevented h.265 from being used in the industry to shake h.264’s position. JVET has subsequently developed H.266-VVC, which still needs to be concerned about patent licensing. If H.266 can provide a more friendly patent licensing scheme, the future will be full of hope.

Another important video standard organization is the Open Media Alliance (AOM) represented by Google. The AV1 standard developed by AOM shows that the encoding performance exceeds H.265-HEVC, and has rich encoding tool support, which can greatly improve the compression ratio of video and save a lot of bandwidth. At the same time, AV1, as the first generation standard developed by the open media Alliance AOM, not only has very good ecological support, but also provides a free patent policy. Compared with H.265 / H.266 and other video standards with unclear intellectual property policies, AV1 has huge advantages. A clear patent policy is also an advantage of AV1 in the industry.

In addition to the three standards bodies mentioned above, AVS is also worth mentioning. AVS is China’s own standards organization. AVS standards defined by AVS have been developed to AVS-3. According to the data shared by AVS, AVS-3 has excellent coding performance, but it is still very difficult to establish an AVS ecosystem.

Although the development of coding standards has gone through several iterations and new coding tools have become dazzling, the basic framework of encoders has not changed much.

From H.261 to h.266 and AV1, it is still a hybrid coding framework, and its core modules include: Block segmentation (16×16-128×128), block-based prediction (intra and Inter prediction), block-based transform (Transform), quantization (quantization), and entropy coding. For students who develop video applications, you do not need to understand the technical details of encoders and decoders, but you need to understand some of the most basic concepts, such as key-frame /I-frame.

The so-called key frame only uses intra prediction for its block prediction, which means that the key frame does not depend on the frame before or after it, as long as the data of the key frame is complete. It can be decoded correctly (this description ignores parameter sets such as SPS/PPS, whose importance can be considered as equivalent to key frames), which is why in real-time systems, when decoding errors are encountered, key frame request techniques are used to recover decoding.

Another concept to understand is p-frame (b-frame is generally not used in real-time systems, so avoid mentioning it in this article). P-frame is a frame that uses forward prediction, and its decoding will depend on the previous frame or several frames. In the implementation of encoder, a very flexible forward prediction structure can be adopted. Depending on different conditions, you can refer to the latest frame (latest frame) or select a slightly further forward frame. Using the long-term reference technology, even very early frames can be selected. The time grading technology of the encoder is essentially the choice of reference structure. Flexible forward reference structure can be combined with the actual use of scenarios and congestion control algorithm, deriving rich variations.

2. Pano Venus

Pano Venus is a real-time communication video engine based on AV1. Compared with H.264, the bit rate is 40%~70% lower on average, and it can run in real time on mainstream mobile phones. This is the first time that AV1 has been launched in the real-time system in China.

It is well known that ALTHOUGH AV1 has good performance, its complexity is very high, and the application of AV1 poses great technical challenges. So why did Pano choose AV1? When I was working for Cisco and promoting the implementation of HEVC project, I encountered this kind of problem. At that time, one of my classmates asked me whether IT was necessary to do HEVC because the SYSTEM based on H.264 worked very well. HEVC is very complex and many devices may not land. Codebase, based on H.264, introduces more complex algorithms and more advanced toolsets to optimize and improve performance. Compared with DOING HEVC, it is easier to see the benefits and landing. I didn’t have a good answer to that question at the time. I just felt it was natural to embrace new standards and technologies. But now I think that it is a challenging long-term process to make a new standard. The new standard means that the performance upper limit of the encoder can be improved. All the performance improvement of the new standard is for the future.

Due to the high complexity of AV1, its application in the real time domain has been greatly limited. In the early days of AOM coding, it took several seconds to encode a frame at a large resolution, which was far from ready for production. However, in recent years, Google’s AOM Codec has undergone several iterations and improved performance in all aspects, which also promoted the possibility of AV1 being implemented.

The development of Pano Venus faced two major challenges:

1. Low-end devices: At present, there are many low-end devices on the market. For example, IoT will choose cheap CPU and memory based on cost considerations when designing low-end devices.

2. Overheating of the device: Although the CPU computing capacity of the mobile phone has been greatly improved, it still faces the serious problem of hot and low frequency of the mobile phone. When the user runs Codec on the mobile phone, he will also run beauty or other advanced algorithms, which need to work together. The same is true for AV1. Even when the phone is capable enough to run AV1, the phone may still get hot and slow down, affecting the final experience of AV1. Therefore, we must consider the overheating problem caused by long-time use when setting conditions for AV1.

3. Complexity Analysis

The increase in complexity of a new Codec over an older Codec such as AV1 and H264 comes from at least two sources. On the one hand, the coding tool of AV1 is obviously higher than that of 264 in AV1 standards, but the specific degree of higher needs to be tested and analyzed. For example, Intra coding in H264 has 4 prediction modes for 16×16, 9 prediction modes for 4×4, and 48 prediction modes for AV1. You can imagine that the complexity of making the optimal choice will increase considerably. In addition, the largest block in H264 is 16×16, while the largest block in AV1 is 128×128. Its tree segmentation is more complex than THAT of H264. In general, the complexity of AV1 is much higher than that of H264. So if H264 has performance issues running on a device, AV1 will only be more difficult to run.

AV1 has several good Open sources that can be used as our benchmark, such as AOM Encoder, Decoder, And VideoLAN Dav1d.

In the figure, AOM Encoder is used for analysis. The encoding parameters are from the real-time performance of AV1 shared by AV1 Team engineer Jerome in LVS. The speed is set to 9, which is the fastest speed of AOM Encoder implementation at present, and Codebase is 3.1.

First look at the complexity test of the decoder. The decoder uses Dav1d. My assumption is that the test of the coding tool set can be reflected in the complexity of the decoding, and they should be of the same magnitude. The test is set as the code stream in the meeting scenario, and the bit rate of AV1 is about 60% of that of H264. The bit rate of 264 is encoded by OpenH264, and AV1 is encoded by AOM Encoder. At 720p, it takes 230.94 FPS to decode AV1 with Dav1d and 644.57 FPS to decode the same scene sequence with OpenH264. At 1080p, Dav1d is 63.61 and OpenH264 is 160.54. The analysis of the data in the table shows that the decoding speed of AV1 is nearly 3 times slower than that of H264. It is roughly estimated that the complexity of the coding tool is about 3 times higher than that of H.264. On the other hand, poor performance devices cannot even support AV1 decoding.

AOM and OpenH264 were used to compare the coding complexity test. At 720p, AOM encodes 64.59fps on a PC, while OpenH264 encodes 344.74fps. At 1080p, AOM encodes 19.75fps and OpenH264 82.19fps, showing that AV1 encodes nearly five times slower than H264. According to the test results, AOM Encoder has been optimized quite well in all aspects, and it is completely OK to apply Real Time on some devices even without any modification. As the test is carried out on PC, the encoding speed of AOM Encoder is 5 times slower than that of H264, so it will be very difficult for AOM Encoder to land on mobile. So we need to do better when developing our Own Pano Venus coding engine.

This is the goal we set for ourselves when developing AV1 encoder. Strictly speaking, these are actually three questions:

If we want to improve the speed, other methods than optimization may be to cut and optimize the algorithm and coding tool set, which will inevitably cause the loss of RD performance while improving the speed. Is it possible to make a Codec based on AV1 semantics with the same speed and RD performance as H.264?

If the speed of the encoder is required, for example, the speed loss at a certain point is proportional to that of the H264, how well can RD perform under this requirement?

Can scalable RD performance be achieved, such as 0 to 70% or higher, how fast can the encoder do under these constraints?

If you’re familiar with optimization theory, you can see that this is a lot like an optimization problem, whether you can find an optimal solution under certain constraints. However, the choice of algorithms and coding tools is difficult to be measured by a simple mathematical model.

Finally, we achieved satisfactory results. AV1 can be run on mainstream devices including IOS and Android.

4. System Design

Based on the above analysis, it can be seen that the landing of AV1 is very challenging. We’ve done a lot of innovative improvements to Venus performance, but we haven’t been able to get it to the same level as H264 yet. AV1 has a lot of complexity improvements compared to H264. At present, there are a large number of devices on the market, and the performance of the devices varies greatly. Some devices are difficult to support even H264, let alone AV1. We have classified the devices.

The first line is the (Lowest) worst device, which cannot run Video, but only Audio. The second line is the Old SDK (Old SDK without AV1). When the first SDK came out, there was no AV1, just H264. The third tier (LOW) is similar to the second tier, with weaker performance and does not even support AV1 decoding, so it can only run H.264. The fourth file is the Old SDK (Old SDK with AV1 Decoder). In fact, before releaseAV1, we have released the AV1 Decoder, so the early version has included a Decoder, which can communicate with the current SDK. The new SDK sends AV1 to this version, and then it can be decoded. The performance of this device is ok, and it can support AV1decoder. The fourth (Medium) is Medium complexity, can support H264 codec and AV1 decoding. The fifth grade (High) device with better performance can support AV1 transceiver, codec can use AV1.

After differentiating the devices above, Pano Venus supports AV1 in the widest range of scenarios and on mainstream devices based on Venus’ extreme performance optimizations.

5. AV1 Scalability

[L | S] [Number] [T] [Number] [h] [KEY] [SHIFT] describes the Scalability.

[L | S], “L”, “S” either, represents the spatial classification, Special Scalability, “L” refers to the prediction of spatial classification containing different spatial layer, enhance space can predict spatial layer, “S” refers to the different space between layers without any prediction. Simulcast is a subset of this structure. Simulcast means that different layers of space are completely independent.

[Number] refers to the Number of spatial layers. For example, L2 and S2 indicate only two spatial layers.

[T], which stands for time classification.

[Number] indicates the Number of time layers.

[h], the spatial resolution relationship of two adjacent layers in common spatial classification is 1:2, for example, basic layer is 180P, enhancement layer is 360P, and [h] represents the relationship from 1:2 to 1:1.5, basic layer is 180P, enhancement layer is 270P.

[KEY], KEY frame correlation. If there is interlayer prediction in different spatial layers, the advantage is that the information of the basic layer can be referred to when encoding the enhanced layer, which is helpful for coding, while the disadvantage is that the basic layer needs to exist when decoding, and [KEY] tries to strike a balance between the two. KEY weighs whether to adopt interlayer prediction during spatial classification. [KEY] Indicates that layer prediction is performed only in KEY frames. For key frame encoding, there will be some encoding gain. For downlink, non-key frames only need to receive the enhancement layer, and do not need to receive the basic layer. Only key frames need to receive the basic layer.

[SHIFT], associated with temporal grading, usually occurs in different spatial grading. L2T2, for example, has two spatial hierarchies, two temporal hierarchies, and the temporal hierarchies on the two spatial hierarchies correspond to each other. For example, the basic layer is [T0], the enhanced layer is [T0], the basic layer is [T1], and the enhanced layer is [T1]. [SHIFT] tries to change this. [T0] and [T1] of different spatial layers are interleaved. Strange as it may seem, there must be a reason behind the standards determining the technology, and the technology must be beneficial in certain scenarios. I personally think that there is gain for SFU Server. When a source is L2T2, 25% users subscribe to L0T0, 25% users subscribe to L0T1, 25% users subscribe to L1T0, and 25% users subscribe to L1T1. If there is no [SHIFT] structure, then for all users, all data must be forwarded at T0. At T1, 50% of the users do not need to forward data, so the load of the Server is uneven. Some time is busy, and some time is empty. With [SHIFT], it interleaves the time scales to maximize the amount of data to be forwarded by the Server.

As can be seen from the introduction above, Simulcast is only the simplest form of scalable, and there are great opportunities for future scalable technologies to take advantage of the increased flexibility of scalable to design technologies that are more adaptable to application scenarios. Now WebRTC is implementing this technology as well.

Here are a few examples.

So this is an example of L1T2, which has one space layer, two time layers, 0 on the horizontal is T0, 1 on the horizontal is T1.

This is an example for L2T2, and the example for L2T2h is exactly the same. You can see that both spatial layers have arrows pointing up, indicating that the one above is the space enhancement layer, which references the base layer below.

This is an example of S2T2, which has no arrows compared to L2T2, indicating that the two spatial layers are completely independent.

This is an example of an L2T2 Key, where the arrows only exist between Key frames, and inter-layer prediction techniques are not used when they are not Key frames.

This is also an example of an L2T2 Key, similar to the previous example.

This is an example of L2T2 Key Shift, which is more complex and has introduced a time level interleaving during coding so that T0 of the enhanced layer and T1 of the base layer are in the same time stream.

The following describes the basic concepts of AV1 Scalability.

The most important thing is Chain. There is inter-frame prediction in video coding. Chain refers to decoding the current frame, which depends on a previous frame or several frames, and the dependent frame depends on the previous frame.

The scalability of DTI can be extended to different feeds, such as feeds with different resolutions and frame rates. Therefore, DTI can add more resolution or frame rate scalability to a unit scene.

Indication indicates that a DTI can be switched on a frame other than a key frame, and that T0 may be a pointer in the chain.

Discardable Indication: Some frames are not in the chain and can be discarded.

This is a 60-fps, two-tiered sequence. It can be extended to a variety of DTI.

It’s a DTI that only needs 15fps in HD.

This is another DTI that requires 30fps of SD.

This is the third DTI and requires 60fps of SD.

This is the fourth DTI, all the frames, you need 60fps in HD.

6. Future Work

Pano Venus is officially available, but we expect more from it in the future:

  • More flexible coding complexity scaling scheme, covering more devices

  • Subjective visual coding to further improve coding performance

  • Support desktop coding tool set, support desktop coding

  • Parallel coding framework, make full use of multi-core processor performance, improve coding efficiency

Although AV1 technology has been added to WebRTC for a long time, due to the algorithm complexity is many times higher than H.264, real-time has been a special concern. We believe that the application of Pano Venus can promote the application of AV1 in the field of RTC and further promote the ecological development of AV1. We firmly believe that technological innovation can create greater imagination for the development of multimedia ecology.

These are the references, and most of the content of this article comes from these three references.

The above is the content of this sharing, thank you!


The lecturer to recruit

LiveVideoStackCon 2022 Shanghai station, is open to the public recruitment of lecturers, whether you are in the size of the company, title level, old bird or rookie, as long as your content is helpful to technical people, the other is secondary. You are welcome to submit your personal information and topic description via [email protected], and we will give feedback within 24 hours.