The increasing use of video conferencing in People’s Daily life, especially the rapid growth of the video conferencing market due to the COVID-19 pandemic, has led to continuous updates of Cisco Netcom video technology. We are joined by Thomas Davies, Principal Engineer of Cisco’s Collaboration Technology Group, who shares with us the evolution of AV1, the challenges it faced in developing AV1, and the future of AV2 and its role in real-time communications.

The text/Thomas Davies

Organizing/LiveVideoStack

Hello, I’m Thomas Davies. I’m a principal engineer in Cisco’s Collaboration Technology group, and today I’d like to talk to you about AV1, Cisco Webex, and the next generation of video conferencing.

My talk today covers several topics. First, I want to talk about the recent explosion of video conferencing applications caused by COVID-19. Video conferencing applications should have been commonplace for some time, but COVID-19 has created a tipping point that has changed the landscape of real-time communication. Then I want to talk about the historical background, the history of the Open Media Alliance and real-time communications. How did we get to where AV1 is today, what were the real-time communications factors that we considered when we developed AV1. Then I want to talk about our AV1 codec on Cisco Webex and what we’re doing for our rollout. Finally, I want to talk about the role of AV2 in real-time communication. Can we create more in this area?

#1. Development of video conferencing

The first is the explosive growth of videoconferencing.

I think we’ve all had an extraordinary year. From a conference point of view, it opens a new chapter for us. From February of last year, the number of meetings on our platform increased dramatically, and as you can see through the end of February, then by March we had a huge increase in traffic — a 10-fold increase, a 20-fold increase, a 30-fold increase, over 500 million monthly meetings, over 25 billion minutes of monthly meetings. Another interesting factor is that we started having more team meetings, more educational meetings and so on, so the size of the meetings increased by 33 percent. As a result, our use case has changed a bit. There was a huge increase in traffic, which obviously affected us because we had to support it accordingly. But it also underscores the need for expansion. The feedback we get from our customers is that we need to take the technology to the next level. People now know it works, but we need to keep moving forward and keep improving tools and technologies to improve people’s experience. Now in some cases, we’re adding ARTIFICIAL intelligence and new technologies for background noise suppression or voice-to-text transcription and translation in real time, but we can’t get around basic video and audio when it comes to improving the quality of the user experience. That means checking our video processing lines, but it also means getting new codecs, and we’ve been using H264 for a long time with codecs.

But the question we have to ask about our recent experience is “is this the new normal?” . In one sense, clearly not, because COVID-19 is likely to be a once-in-a-lifetime experience. But even before COVID-19, remote work was growing fairly steadily, up about 30% over the past decade. What we’re seeing is people increasingly using video apps and video calling where they might have used voice before. But when the outbreak began, many companies began to look at the way their teams worked and consider what would happen when it was over. Seventy-four percent of U.S. Cfos predict that they will continue to work remotely a lot after COVID-19 is over. Whether that prediction is correct remains to be seen, but I think many companies will fundamentally change the way they work. The videoconferencing market is expected to grow by 11% annually, or more than double over the next seven years. But free use will also increase. In many cases, video is replacing audio. People often make video calls because devices are configured with video calling apps, which changes the way people interact with and use these devices. COVID-19 won’t last, but I think it will have a lasting impact on technology that will change the way people work for a long time to come.

#2.AOM and real-time communication

We’ve seen this before. At the beginning of the Open Media Alliance, there was also a sort of turning point for video conferencing, especially with the use of video conferencing, the increase in software-based platforms, which required an increase in technical capabilities. We wanted to move forward with other solutions, and the Open Media Alliance gave us an opportunity to move forward with codecs.

Cisco is a founding member of the Open Media Alliance. We have felt that the existing standards do not serve open media well and licensing fees are a barrier, especially for H265, which we have developed as a solution, but the licensing model does not fit. We use it on software platforms that may have millions of users. At the same time, we needed the next generation of video codecs, and since H264 has been in use for almost 20 years, we developed the “Thor Codec” for RTC. This actually shows how careful we are about balancing complexity and compression performance. Thor was integrated into the first AV1 test model based on VP9. From the beginning we were interested in the real-time communication focus of the new standard to understand the impact of each tool on our use cases.

But what does that really mean? We should identify three main requirements for the new video codec in real-time communication, starting with its lower complexity compared to other cases. And we particularly want to do it with software on a commercial PC platform, not because we don’t use hardware, but because we use hardware, but it takes us years to open up the hardware, and then it takes us years to develop good hardware with good encoders. The second factor is resilience to the network, and we need standards-compliant tools to detect and repair errors and help recover from them. Third factor we think may be more controversial factors, is we want to limit the number of the standard configuration, at least from the point of view of tools, because the new configuration as new codec, we have to interact with the operation, so we prefer to use the new codec, instead of using multiple codecs, must interact with multiple codec operation, We have seen relatively limited adoption of high profile H264 for this reason and limited scalability due to the difference in profile H264 and H265.

In terms of complexity, compared with some other cases, the price we at some low quality to get faster running speed, above the red circle, is the place where we want to codec operation, we have to fast, we need the complexity of the limited, which means that we cannot merely average, we can’t just coding at a speed of 30 frames per second, We need to meet the time requirements of these frames all the time. One of our goals, and I think this is a feature of a well-designed codec standard, is to be able to achieve real gains even at similar complexity, which requires faster operation points, and if you’re in a voD scenario right now, maybe you’ll move to higher complexity. It could be higher, more complexity to achieve these gains, for example in voD, you might be able to tolerate 5 times or 10 times more complexity and reduce the bit rate by 40 percent, but because we have to replace the previous standard with a standard in the software. The envelope we can use to add complexity is limited. So we need to achieve real gains, and even though complexity is similar, we can tolerate a modest increase in complexity, but not a very large increase.

AV1 meets this need to a large extent. First, AV1 tools can give us the huge gains needed for any new standard, such as screen content tools and powerful loop filtering, but at the same time maintain a moderate level of complexity for any of the core tools we use to do good video encoding. These loop filters have a certain complexity, we have multi-sign, arithmetic coding is less complex than similar techniques in other standards, we have very simple interpolation filters, we have fast decomposition transformations. In terms of elastic network, we can detect errors through the frame number, so we can see that whether we and reference frame synchronization, through the elastic model, even before the frame is lost, we can analysis frame, we also can repair the missing frames, because we can find things such as motion vectors, even if we don’t have the reference frame, We can also use this information for inserts, and we also have extensibility as a standard. This also relates to point 3, that there is only one main configuration, there are chrominance sampling-based profiles like 4:4:4 and so on, but all the tools are in the main configuration, including extensibility, which is very useful if you want to build an encoder, because it gives you a complete toolkit to explore. And the good news is that you don’t have to be constrained by decisions that people make when they decide on a standard that a simple tool might not actually be the best choice.

#3. AV1 development for Cisco Webex

During AV1 development, we were also developing our own encoder for Cisco Webex, which is standards-specific. In the form of software on PC hardware.

We presented the world’s first AV1 HD live video encoding. We did 720p for camera video and 1080p for screen content. We demonstrated this technology in New York in the summer of 2019. Since then our encoder speeds have increased by around 60% and we have been working hard to provide AV1 with all the integration and system support needed for an end-to-end solution.

So what are our concerns? We had to choose between sharing video content and camera input, and we decided to use shared content. Because this does represent some of the most challenging videos that we have to code, and some things can be very simple, like this slide, but nonetheless, people increasingly share all kinds of things, we share charts, slides, YouTube videos. Or maybe a hybrid video that plays in a browser, something that’s not designed for computer use. So there are a lot of requirements for fidelity, some very colorful materials that may have a very low frame rate but a very high resolution, some high speed motion scenes, and we have an adaptive system to handle that kind of motion and content.

We needed to integrate AV1, and in our first phase, we introduced AV1 to cover high-speed motion sharing. High-speed motion video is the most difficult video because it can be anything by nature. We launched this mode in our product in February, and our next phase will cover high-resolution mode and auto-adaptation, which we aim to complete in the first half of this year. Future phases will include camera input video and transcoding, which is now relevant from the interaction with H264 attendees in the meeting. We are currently running backwards compatibility mode, so if a AV1 attendees and a only H264 attendees for the meeting, so they will be backwards compatible to the H264, but obviously this efficiency is very low, so we hope that in this case some special transcoding, this does not necessarily mean that we will always continue to use this kind of practice, This may not be the most efficient approach, but it will increase use and provide the greatest benefit to the greatest number of attendees.

What challenges did we encounter in encoder development? I think the biggest challenge is to achieve AV1 influence on CPU, compared with the H264 is very small, it does not mean that has no effect, nor does it mean that we do not use more CPU, more doesn’t mean we can’t in the case of more of the available CPU use more CPU, but it does mean that in some cases we do need the least amount of CPU, And still achieve gain, which is very challenging from an encoder optimization point of view, and the second thing is more of a solution problem, how to balance mass and bit rate. I think one thing has changed the focus on COVID-19 somewhat. We do need to deliver better quality, and bit rate is not always the most important thing. But it’s usually some kind of tradeoff between the two if you want higher quality. Even if the bit rate drops to very low levels, then you can use AV1. As I mentioned earlier, we have to support backward compatible behavior in interactive scenarios. More generally speaking, we have a multidimensional problem. Based on the CPU power of whatever device we’re using. Adjust encoder complexity Settings, resolution and bit rate. So we can either reduce or increase complexity by changing the Settings of the encoder, with more or less loss, or we can change the resolution of our code, or we can change the bit rate of our code, both of which can change complexity. There are different tradeoffs involved, and we need to develop a good engine to make decisions.

As we move forward, we start to offer more hybrid conferencing, and as I mentioned earlier, how to combine multi-stream conferencing with multiple encoders is a problem. If you send multiple layers of different qualities, where should the new codec be in it? You can either go for a very low bit rate, the lowest layer, and really make sure it works, or you can aim for the highest quality and make sure it’s even better. There is also a problem with decoding, you may have multiple decoders running different codec standards, so how do you integrate and manage them within the CPU envelope, so those are quite difficult technical challenges in providing a solution.

# 4. AV2 and RTC

What about AV2 and next-generation codecs? Where do we see real time communication going?

In a sense, our requirements haven’t changed since AV1, so let’s look at these mass versus speed tradeoffs. You might want to operate with the same or slightly higher complexity as you would with AV1, but still achieve real gains, and one of our testing goals for AV2 was that we could prove that we were going to achieve those gains. Now that’s going to be very difficult because no one is going to develop a complete real-time encoder for the AV2 standard. You can deal with these kinds of things as you go along and try to understand them, but you’re not going to have a completely optimized solution at every point. But AV2 is unique in having a software implementation working group that will hopefully give us some insight into implementation issues, perhaps not as fast as real-time communication, but certainly faster than the maximum compression that encoders can provide. In terms of video on demand, I still think that codecs may not support a significant increase in complexity compared to previous codecs. Ideally, they would seek complexity and a modest increase, perhaps beyond what we would tolerate with real-time communication, perhaps five to 10 times a reasonable goal. But it is still likely to be much lower than previous standards. If those curves don’t overlap, then I think we’ve done a pretty good job, but how do we make sure that those curves don’t overlap and that we get gains in both speed and mass? (because, as shown in the curve for us, if we can reduce the complexity, then we can budget effectively applied to the curve of quality, if you put the blue curve to the right or move up, then you will have more space, improve the quality at the same speed, or speed in the same quality.)

I think we need to keep some principles in mind. Cannot rely on the first principle is that the new software encoder more CPU, which seems to be a bit of a surprise, because we expect the next generation of codec conforms to Moore’s law, in part, this is right, but people hold the computer is more and more long, the performance of the single nucleus did not like the past, although the number of nuclear has been increasing. But it’s also important to remember that other programs on our devices are also using the CPU. One of the challenges of using an application like Netcom is that we share the CPU with other applications that are running and using a lot of computing power and we have to accommodate that without causing problems for other applications and I think we need to be very careful about how much CPU we have available. And at the same time we want to achieve huge gains, ideally another 50% bit rate reduction. In order to code AV2 software with low complexity on these ordinary computers, I think we need something called “scalable complexity,” and we want to be able to find our way through standards, as well as our way through still simple encoders. Ideally even simpler than previous standards, which means that AV1’s general core tools should maintain or reduce their complexity over time. This is a difficult thing to do because the way these tools have improved may make it difficult to predict how they will be optimized in real encoders, but the cost is very high because the reduced complexity of all the tools means real quality improvements at real-time communication speeds.

Now you’re reinforcing those patterns, you’re increasing the number of choices in the whole process, so the reference implementation slows down, which is obviously not good, which is not good for any encoder. But that is not completely disaster, because the part can avoid this kind of intelligent encoder complexity, you cannot avoid the complexity of tools increases, this is you can’t avoid or don’t want to avoid, because they were very useful, it also means that a preliminary analysis and machine learning, the complexity of encoder management will become more and more important, If we were to generalize all of these models, because we don’t have time to predict all of them. So we have to do something that saves time and effort, so more and more people are going to be working on the minimization of the full search algorithm.

# summary

In short, the Open Media Alliance ecosystem has helped us take the next step in video conferencing technology by helping us move beyond the old H264 video codecs and we are now publishing AV1 in real time in a complexity similar to THAT of H264 AVC. But we realize the significant gain, I think it shows AV1 is a well-designed standard, it is not perfect, but it is the core of the application is very useful for real time communication, this is also our expectations of AV2, if we have the same good design principles, design, and we are equipped with intelligent encoder so AV2 will be able to achieve further gain, Finally, I want to say thank you very much and I welcome your questions.

Thank you!