The applications of video engines are mostly real-time communication scenarios requiring low latency, such as video conferencing and live entertainment. In this sharing, we invited Mr. Han Qingrui, an engineer of netease Yunxin, to share the characteristics and landing plan of netease Yunxin video engine technology for us.

By Han Qingrui

Organizing/LiveVideoStack

Hello everyone, I am an engineer from netease Yunxin, my name is Han Qingrui. This time I share the topic of netease cloud letter video engine technology introduction.

This sharing is mainly divided into the following three parts. The first part is the introduction of video engine of netease Yunxin, the second part is the introduction of relevant key technologies of video engine, and the third part is the relevant landing cases of video Yunxin customers.

1. Introduction to Yunxin Video Engine

First we enter the first part, netease cloud letter video engine introduction.

First I want to introduce the video engine application scenario. The video engine of Yunxin is mainly used for real-time video communication scenarios, such as video conference, interactive live broadcasting, interactive teaching, online medical treatment and so on. One common feature of these scenarios is low latency.

And then this is a network deployment diagram of the whole Yunxin video engine service. Then Yunxin video engine is mainly located in a terminal test, including all kinds of PC, Mac, a variety of mobile phones and Pad above.

This is the architecture of the video engine. It can be seen from the figure that the video engine of Yunxin mainly has six sections. The first module is video pre-processing, the second module is video coding compression, the third module is video QoE to deal with various networks, and then video decoding and video post-processing. In order to ensure the efficient operation of Yunxin video service, we also have a video control module. I will explain some of the key technologies of each module one by one.

2. Key technologies of Yunxin video engine

OK, and then we move on to the second part. In this part, we will give you a detailed introduction of some key technologies in each module of Yunxin video engine.

First of all, we will conduct video pre-processing after video collection. Then, the purpose of video pre-processing of Yunxin is mainly two: one is to improve the end-to-end video effect in real-time communication; the other is to guide the encoder to compress more efficiently. At present, our video engine pre-processing mainly has three technical points, the first is video noise reduction, the second is video AI enhancement, and the third is video salience detection.

So let’s first look at the AI enhancements for the video. Video enhancement has been studied and used for many years, but there are actually some difficulties in using it in RTC scenarios. For example, in some RTC scenarios, mobile terminals are in the majority, and various mobile phone types have uneven performance and sensitive power consumption. So you can’t use overly complex algorithms, and for deep learning it’s probably just small models. However, the learning ability of small model is poor and the training is easy to overfit. That’s the first drawback. A second drawback is in the front do enhancement, its end-to-end effect is not good, because of the enhanced image, its high frequency components increased, so that behind the video coding compression will have larger pressure, if processes is not good video quality will be more loss, especially in low bit rate below the losses will be bigger. For example, we can see the two images on the right, the one on the left which is not enhanced, the one on the right which is actually enhanced, and then the two images which are enhanced and then compressed and then decoded, and so on. And you can see that with the enhancement, the chunking effect is actually a little bit more pronounced than without the enhancement.

In view of the difficulties mentioned above, the cloud mail the solution here is: we have designed a hierarchical network, the network is a scene in front of the identification module, it can identify the type of video, it can take different according to different type small model, or is not the same as the model structure or the parameters of the model is not the same. Then for the second difficulty mentioned above, that is to say, sometimes the enhanced effect is worse than the unenhanced effect after encoding the video quality, which we can also identify through the scene recognition module, for this kind of scene video we can directly code without scene enhancement. And each of our little models is a lightweight model. This lightweight model is about 1K to 2K, and then the network layer number is eight. This number of parameters and network layers may not be a lightweight for the industry, where a lightweight number of a few hundred parameters is possible. But why don’t we use smaller models? Because there is a drop in quality with smaller models. However, the disadvantage of a model with 1 to 2K arguments is that it can be much more expensive. So the reason why we have the confidence to use 1-2k parameters without worrying about its overhead is because we have a self-developed and efficient AI engine, which we will talk about later.

This is a demonstration of one of our AI enhancements. You can take a look at it. On the right is our enhancement, and you can actually see from the picture that the right side is much better after the left side is enhanced. This video is decoded by encoding after the end of the decoded after the display effect.

And then the second technology point I want to introduce is our video noise reduction. Video denoising should be said to be a very common technical point in RTC scene. Its main function is that after some acquisition, the high-frequency noise will consume unnecessary bit. Through noise reduction, we can use the bit for encoder to encode useful information, so as to improve the end-to-end video quality. However, there are many difficulties in noise reduction in RTC.

The first, of course, is universal, that is, most of our RTC mobile terminals, various types of Pads and mobile phones are not evenly hierarchical, and especially sensitive to performance and power consumption, complex algorithms can not be used, and some fast algorithms are not good enough. The second difficulty is inappropriate noise reduction, which may filter out useful edge information at the same time of noise reduction, resulting in a decline in the overall effect of E2E.

For these two difficulties, the solution of Yunxin is as follows: our noise reduction is divided into the following three modules: the first is the noise estimation module, which is more commonly used. We will estimate the noise of this image. Of course, it is possible that the noise estimation is characterized by a very light and fast noise estimation. One disadvantage of light noise estimation is that it may be inaccurate. At this time we also have a saliency detection module. Saliency detection can detect the most sensitive areas of human eyes, and it can classify the sensitive areas of human eyes in the image. In this way, for the sensitive areas of human eyes, we may say that the scale of noise reduction will be a little smaller, while for the insensitive areas of human eyes, the coefficient of noise reduction and the scale of noise reduction will be larger. Because human eyes are not sensitive to the area, even if we turn on a large noise reduction scale, some useful information is blurred, but because human eyes are not sensitive, so the overall subjective quality is not too much influence. So we take noise estimation and saliency detection as two weights, and we put them into anisotropic filtering, and we get our image.

So our first module is video pre-processing, and after that it goes to the second module, which is the video coding module. The main function of video coding in RTC is that it can compress the code efficiently and then transmit it through the network. One of the characteristics of Yunxin video encoding is that our encoding speed is very fast, and the compression quality is very high. Cloud video coding has the following four letter of encoder, the most main is our since research H264 encoder, we call NE264, then there is our own research agreement NEVC NE265 and private cloud letter code, to screen sharing this scene at the same time, in 264 the scene below we optimized for screen sharing this scene in the separate, This is called NE264-CC.

Let’s introduce our NE264 first. NE264 is the H264 encoder developed by Yunxin. Among them, we have developed several efficient compression algorithms, such as fast mode decision, efficient sub-pixel search, adaptive reference frame, and CBR code control, etc. The table below compares NE264 with several well-known encoders in the industry, including WebRTC’s own encoder, OpenH264, X264’s superfast encoder, and Apple’s iPhone 11 encoder. From this table, we can see that the encoding quality and speed of our encoder are better than those of these encoders. Meanwhile, our rate volatility and rate control are also higher than those of these encoders. Because in the real-time transmission network, especially in the case of weak network, the bit rate fluctuation will cause a great impact on the quality of the whole video.

Then we also did a comparison with X264-Ultrafast, which is the fastest gear in X264. In this comparison we can see that our speed is only about 25 percent slower than Ultrafast’s, which is 468fps, and we are 350fps. But our compression rate has increased by nearly 50 percent (49.85 percent, nearly 50 percent).

This is our NE264. Then for the screen content scenarios, we used NE264 to optimize the screen content separately. There are several solutions in the industry for screen content encoding, such as AV1, H265 SCC. There are also some industry vendors are using some of the SCC related coding methods ported to H264 such as some proprietary protocols. These are all great solutions. Yunxin’s thinking on this is that there is a lot of room for optimization even at the coding end for screen content scenes, and good results can be achieved with the post-processing of text. At the same time, H264 is the most widely used protocol, which can ensure the interconnection of each device; The H264 is fast and does not consume too many resources. So we used the H264 standard protocol to do a separate optimization for the screen scene. Below are our optimization results. We can see that our compression rate improved by an average of 36.72% compared to NE264 without screen content optimization. Maximum increase of 62%, while our speed slows by only 3% — basically nothing. If we compare the compression to WebRTC’s built-in OpenH264 screen compression, our compression rate is about 41% higher, while the speed is about the same. As you can see, without adding any new compression tools, we were able to improve the compression rate by 40% with the same speed just by optimizing the H264 protocol.

Here we introduce our own research of NE265. NE265 is the H.265 encoder developed by Yunxin. In view of H.265’s high compression rate but complex protocol, we have made efforts in three directions. One is to design an efficient encoder architecture, and the software architecture of NE265 is very cache-friendly. Second, we designed and implemented more than 20 kinds of fast algorithms. Third, we designed and implemented assembly optimization at the bottom level. For the modules of intensive computation, we used SIMD instructions to carry out very fine optimization. The following test results reflect our efforts. For example, the first is the gear comparison between us and X265 Veryslow. The X265 Veryslow gear is the best coding quality in the X265. But we are relative to this gear, under the RTC scene, our PSNR and SSIM’s BDRATE are similar, as we can see. The test sequence we adopted was 25 standard sequences provided by the government, plus some sequences of our Yunxinbusiness and some sequences related to social entertainment scenes, all of which were related to business. There are 55 videos. We can see that we are basically the same on the BDRATE, less than one percent apart, but we are 40 times faster than the x265 Veryslow gear. And then we compared it to H.264, and we compared it to the Faster gear of X264, and our average compression rate was nearly 35%, and we were 10% faster than it.

Of course, Yunxin’s efforts don’t stop there. We also developed our own proprietary coding protocol, called NEVC. One of the core points of this protocol is the use of multi-scale video compression technology developed by Yunxin, of course, we also use some of the AV-1 compression tools. Compared to NE265, our average compression rate can be improved by 12%, and under the official sequence we can be improved by nearly 18%, and our speed is similar to that of NE265.

After the video coding, we entered the video QoE. Because the video engine is the RTC service, the data after the video encoding will be transmitted to the video decoding through the network. But the situation of the network is ever-changing, some network is good, some network is poor, some network bandwidth is low, some network packet loss. The video QoE module is to ensure that our users have an optimal video experience under any network situation. Our video experience is actually a composite rating of the video content from its smoothness, sharpness and smoothness. The second effect of video QoE is that it reduces the lag caused by data loss.

I would like to introduce Yunxin self-research intelligent network adaptive technology. Before I start, I want to mention Netflix’s per-title technology. Per-title technology points out that for video, the higher the bit rate and resolution, the better. When the increase of bit rate or resolution reaches a certain threshold, the further increase cannot improve the subjective effect of video. And the thresholds of different scenes are different; Netflix found the best bitrate and resolution for each video by offline learning. Yunxin extends this technology to make it available for RTC scenarios. We use deep learning to classify videos. First of all, we will make a data set. For each video in the data set, we will encode it with different bit rate and resolution. Then, we will score it through the video quality evaluation algorithm to find the best bit rate and resolution and generate labels. The data set is used to train the classifier. Then in the actual application, we will feed each input image into this classifier. Through the classifier, we will select the best resolution, and then send the bit rate and resolution into the encoder together with the current video for encoding. It’s encoded and then it goes to the network, and then it goes through decoding, and then it goes through upsampling. In this way, we can find an optimal video resolution according to the changes of the network in real time, so as to obtain the best user video experience.

Our video QoE also has techniques for long reference frames and time domain SVC. Of course these two technologies are probably familiar to all of you. In Yunxin, the long reference frame is mainly used for the protection under such a scenario of 1vs1, and the time domain SVC is mainly suitable for such a data protection under the one-to-many case descending.

After the QoE module we will enter a video decoding. Cloud letter video decoding features: it can support almost all video formats on the market including video and image format decoding and interconnection.

After the video is decoded, we proceed to video post-processing. Since the business scenes of Yunxin can be mainly divided into two categories: one is the scene using camera, such as video communication; the other is the scene sharing screen content. Therefore, one of our video post-processing is the video superdividing of camera image; the other is the video screen optimization of the screen content. Yunxin video engine post processing is mainly to restore or improve the quality of video.

We first introduced the video super score, the same in the video super score we are using the lightweight network of self-research. Of course, I should put the lightweight in quotes because we have 1k to 2k parameters, and the network layer is actually 8 layers, which may not be lightweight for the industry. But we have our own AI inference engine, and then we have our own data set processing technology, we have some pre-processing technology that we have developed for the data set. In this way, through the preprocessing of these data sets under the same network, our superscore effect will also be more obvious. The figure on the right shows the CPU usage after our superscore. It can be seen that whether it is Android, Mac or Windows, our superscore CPU growth rate is not very large without the superscore on, which is also within 5%. And then we can also see the objective quality of our video. Our video quality is much better with the AI super score than it is without the AI on.

This is a comparison of our video overscore. On the right is the one with the video overscore, and on the left is the one without the video overscore. And you can see that the video quality on the right is actually significantly better than the video quality on the left.

Move on to the next one, Desktop Sharing Optimization. The main challenge in RTC for the optimization of desktop sharing is that desktop sharing is mostly a large screen with the size of 1920×1080, which is too expensive for deep learning to do. So our idea to solve the problem is that we will first do the text recognition for the areas with the text, and then do the AI optimization for the areas with the text. This will reduce the size of the resolution will be a lot. The second word is also a lightweight network with quotes. Because we have Nenn, which we’ll talk about later. Here is one effect of our desktop sharing optimization. You can see that the text on the right is much clearer than the text on the left.

Let’s go down to the last section, the video control section. In fact, this part mainly includes two parts, one is video strategy, the other is AI reasoning engine.

Let’s start with the AI inference engine. AI engine is an AI reasoning engine developed by Yunxin. We call it Nenn. Of course, we know that there are many well-known open source engines in the industry, so the difference between us and them is that, firstly, we have made fine optimization for Yunletter’s own model to support more OP, and secondly, we have optimized the copy and arrangement of data, so that the computation of sparse matrix will be more efficient with NEN. However, for an inference engine framework, performance is a major concern, as is ease of use. At NAN we support more than 30 image processing algorithms. So in use without the need to add a third party image processing library can be used directly. The one on the right is the comparison of our reasoning speed, the one on the top is the Mac platform, and the one on the bottom is the one on our phone. It can be seen that NAN has a faster and better advantage than other models, no matter on the standard model or the model developed by ourselves.

And then there’s our video strategy. The video strategy is mainly due to the fact that the video engine mainly faces a variety of terminal platforms. There are different platforms, such as high-sex computers, low-sex computers, laptops, and different mobile phones and tablets. These platforms are of different types with different performance, so we need to configure different algorithms, different classes, parameters, and hard and soft codec for different platform types at the beginning of the device. For example, for some good devices, the algorithms we open may be more complex, and the types of algorithms we open may be more. For poor devices, we will use some simple algorithms. Because Yunxin’s business is all over the world, we will set different algorithm types and parameters for different regional network characteristics in the world. Secondly, in terms of actual operation, we have Control Engine, a module of Yunxin, which can monitor the current CPU and network conditions and network conditions of the device in real time, and then dynamically adjust the strategy according to these conditions, dynamically adjust the encoding and parameters, as well as the hard and soft codec, to achieve the optimal user experience no matter when. Above is the key technology of video engine.

3. Video engine customer cases

Now let’s move on to the third part, some customer cases related to the video engine.

At present, Yunxin video engine service serves more than 1000 enterprises worldwide. This is more representative of netease cloud music, netease cloud music Look live, netease cloud music heart encounter and online KTV, these well-known APP use cloud letter video engine. Then there’s conferencing, like netease conferencing, and then there’s conferencing communication tool, netease Popo, and there’s some third party products that use video engines. That’s all for this time, thank you!