On October 21, 2021, “QCon global Software Development Conference” was held in Shanghai. Chen Gong, TECHNICAL VP of netease Intelligent Enterprise, as the producer, launched the special session “Converged Communication Technology in the ERA of AI”, inviting many technical experts to share relevant technical topics with us.

We will introduce and share four topics one by one. This is our second phase, the exploration and practice of key technologies of video communication.

Guest introduction: Han Qingrui, senior technical expert of netease Yunxin Audio and video Lab.

preface

Video has become one of the most important ways of interaction in social entertainment, online learning, remote banking and other life scenes, and users have put forward higher and higher requirements for video effects. Low latency, weak network resistance, clear video quality and so on also make enterprises face high technical challenges.

As a fusion communication cloud service expert, netease Yunxun.com’s business covers major video scenes, including low-delay real-time audio and video scenes, partially delayed live broadcast scenes and on-demand scenes without emphasis on delay. This paper introduces the key technologies and application attempts of netease Yunxun.com in each scene.

Netease Cloud video technology deployment

The following figure is the main network diagram of the whole fusion communication of netease Yunxin. The devices on the left and right sides can be connected to any device, such as mobile phone, pad, PC, or Web. The central part is the server, including the compiled forwarding server and MCU server. If a certain delay is involved, the server will be transferred to the interactive live broadcast. In the RTC scenario, the video technology of Yunxin is mainly deployed on the end side. If the business is live broadcast on demand, Yunxin mainly provides the video transcoding service deployed by live broadcast and promotion.

Video technology for RTC scenarios

The following describes the video technology of netease yunxin in the RTC scenario, which is mainly divided into three aspects.

The new generation of audio and video SDK architecture

The following figure is the audio and video SDK architecture diagram of netease Yunxin. At the end of last year, netease Yunxin released a new generation of audio and video SDK — G2. This SDK architecture is divided into five layers, of which the media engine layer is the core location, mainly divided into three video engines, video, audio and communication engines.

The video engine architecture and application scenarios of netease yunxin

The following figure shows the architecture of a video engine of Yunxin. It is mainly divided into five modules, video pre-processing, video coding, video QoE, video decoding, video post-processing. Input from the collection end, the services supported by Yunxin are mainly divided into two types, one is the real picture collected from the camera, the other is the picture collected from the screen sharing. Acquisition before the pictures will be sent to video processing, cloud distribution in the global business letter, there are all sorts of equipment, there are some low-end equipment and also some of the entry level, due to the reason of the camera to collect picture, in order to improve and restore the image quality, video processing completed before entered coding compression after image processing, will be transmitted to the network.

Due to the influence of various networks, we will have a video QoE module to ensure that yunxin users have a perfect video experience. After the network transmission to the end of the decoding, do a post-processing. Post-processing is mainly to reduce or improve the loss of picture quality caused by network compression transmission.

The following figure shows the application scenarios of video engine. The video scenarios of cloud communication are divided into four types: real-time communication application scenarios, low-delay application scenarios, video conference-related, interactive live broadcast scenarios and interactive teaching low-delay live broadcast scenarios.

Key technology of video engine

Video pre-processing

Video pre-processing is mainly used to improve the end-to-end performance of real-time video. In the global business of netease Yunxin, all kinds of devices will be connected, so we need such video pre-processing to improve the picture quality.

Video AI enhancement

This technique is relatively old and has been studied for many years. With advances in AI technology and deep learning, video enhancement technology has greatly improved. However, deep learning or AI requires too much computation. Yunxin’s business is all over the world, and all kinds of devices will be connected, especially the mobile terminal. For example, there are many entry-level devices in The Indian market and Southeast Asian market.

These mobile terminals are very sensitive to power consumption and performance. With a little more computation, power consumption and battery will drop quickly, which leads to the emergence of large models. A good deep learning model is difficult to land in these scenarios. If some small models are used, the effect cannot be guaranteed.

Our business is a communication business and needs to be transmitted to the peer end. Enhancement alone may be good, but not necessarily to the other end. Although the image after decoding has enhanced effect, its block effect is more serious than that without enhancement. It may perform better at the local end, but the high-frequency component becomes more, resulting in high compression rate and excessive loss.

Yunxin solves the above problems through two methods. One is that training is easy to fit, the other is that the subjective may become worse after enhancement.

Firstly, there is a scene recognition module, which can identify the content of some text areas, motion scenes and game scenes. There will be different models for each different scene. For example, the game scene is a model, the text scene is a model, maybe the same model but maybe different parameters, so that you have enough computing power, but also good results.

Our model is a small model. As mentioned above, the model should not be too small, because the expression ability is not good. Therefore, our model is a “lightweight model” with 1-2K parameters. In fact, it cannot achieve the effect of such a small model, because many small models in the industry have parameters less than 1K, maybe only several hundred, and their network level is three or four layers. Because we have developed our own efficient inference framework NENN. Compared with the open source reasoning framework, it has made a unique optimization to the **** small model to ensure that the speed of the small model is much faster than other open source frameworks.

Video noise reduction

Because some devices or cameras have a lot of noise in dark scenes, high-frequency noise is needed to eliminate unnecessary bits. If noise reduction is carried out, it is conducive to coding, transmission and improvement of subjective quality.

In the RTC scenario, noise reduction is the same. Mobile terminal services are in the majority, and many areas are entry-level devices, which are very sensitive to performance and power consumption. Complex power consumption cannot be used, and fast algorithms have poor effects. Improper noise reduction can not only remove noise, but also reduce the useful high-frequency components, which can adversely affect the overall video quality.

Netease Yunxin is concerned about this problem, we are from the subjective feeling of human eyes to consider. From the subjective point of view, human eyes have different views. In some scenes, human eyes have high resolution and can distinguish many high-frequency coefficients. In other scenarios, the resolution of the human eye is so low that it drops dramatically.

Netease Yunshun adopts the human eye sensitivity analysis method, which can extract the human eye sensitive area in the pixel-level image. We would rather reduce the noise reduction coefficient and let go of some noise rather than sacrifice the high frequency coefficient. Even if it’s gone, the human eye can’t see it, and we have a very simple but very efficient noise estimation algorithm, and these two methods produce a weight value, so the video is very fast and very good.

Video codec

Yunxin video coding support mainstream encoders, including the most widely used 264 and 265, and based on the deep understanding of RTC, developed a self-developed encoder, called NE264CC.

The speed of the cloud encoder is very fast, our quality can be improved by 50%, compared to 265, we can encode 60 times faster. NE264 is an excellent protocol that has been around for 20 years. It is also the most widely used real-time communication protocol. Based on 264, Yunxin has developed NE264 encoder, which has fast mode decision, efficient sub-pixel search, adaptive reference frame and CBR code control.

As can be seen from the figure below, compared with the encoder of Openh264, X264 and iPhone, Yunxin is leading in terms of encoding speed and encoding quality. At the same time, bit rate volatility may be ignored. For RTC, video quality and speed are one aspect, and there is also a very important aspect is bit rate volatility. For RTC strict low delay scene, a fluctuation of bit rate will bring the picture jitter, resolution reduction, in which NE264 bit rate volatility is also the smallest.

Below is a comparison to the X264-Ultrafast, which is one of the fastest modes. Our speed is about 25% slower than the Ultrafast, but our compression rate is almost 50% better. If the X264 uses a Megabyte of bandwidth for the same quality, we only need 500, which is a basic image optimization.

H265+SCC, AV1, H264+SCC are some of the best ideas for screen sharing compression optimization.

When considering this issue, we believe that 264 is the most widely used protocol for RTC scenarios. As a lightweight protocol, the overhead is very small. The 264 protocol is minimal in cost.

On the other hand, even if we don’t change the protocol, don’t add tools, just optimize the encoder, screen sharing itself has a lot of room to exploit on the coding side, our protocol based on 264, by digging into the improvements in screen sharing, to improve the performance. Here are some of our results, with and without the screen sharing coding algorithm, in the screen sharing scenario, our compression rate increased by 36.72%, while we were only 3% to 4% slower. You can see that our compression rate is 41% better than Open H.264, and the speed is basically the same.

Take a look at the self-developed NE265, which is also in continuous iteration. The NE265 features an efficient architecture that can be implemented by design. Some algorithms for computational complexity are finely 3D optimized. Those of you who know encoders know that veryslow is 64 times faster, which is not even the fastest gear, but the fastest gear is over 200 times faster.

264 is also compared to 265, which we compared to the Faster gear. 265’s main disadvantage is that it is slow. You can see that it’s almost 30% faster than the X264 faster gear, with an average compression improvement of 34.78%. These test sequences used the official standard test sequences as well as the test sequences related to YUNxin RTC business and social entertainment.

Based on a deep understanding of RTC and audio and video communication, we invented NEVC, a multi-scale video compression technology. Compared to NE265, the speed is basically the same, but the compression rate is improved, the right texture is improved more clearly, and the left texture is almost blurred. Once we’re done with video encoding, the code should be compressed and sent to the network. Networks are the most complex for AN RTC, especially in a globalized business where there are many different networks. How to ensure the best video quality under the multi-network and complex network environment, we have a video QoE module to support. The video QoE module will guarantee the video from five aspects of smoothness, clarity, quality smoothness, delay and performance power consumption.

Video QoE

Video quality control module

Video quality control is interpreted from three dimensions of video smoothness, clarity and quality smoothness. After collecting, pre-processing, encoding and sending, the network is finally transmitted to the network, where there may be various networks, such as some networks with low bandwidth, some networks with continuous packet loss, or some networks with jitter.

Each different network cannot be transmitted at one resolution, one frame rate, and one bit rate, which may produce very different and poor results. We video quality control module called VQC here, it will first receive from network QoS assessment of a network bandwidth, or network bandwidth effectively, according to the bandwidth allocation of appropriate video resolution, video frame rate, encoder, to achieve the best video quality, image in a variety of different network at the same time, have a plenty of noise larger network, Some dark scene network, for the VQC module will collect information, will decide which video algorithm switch on or off, or adjust video parameters, enhance or reduce noise, and some coding algorithms.

Equipment control module

The business of Yunxin covers all over the world, and there are various networks, such as extremely bad networks in Asia, Africa and Latin America, India and Southeast Asia, as well as Europe and America, but the domestic network is better. Another point is that there are many types of terminal platforms, including high-end mobile phones, low-end mobile phones, PC and tablet. The device control module of Yunxin sets the video algorithm according to different network characteristics and different regions, and sets our algorithm according to the platform type of the device.

For example, low resolution, low frame rate for some poor devices; High frame rates and advanced algorithms for better devices.

In the actual process, the device control module adjusts the algorithm in real time through the real-time monitoring data because the network is not invariable, and there are also influence factors such as the device status and GPU occupancy rate.

Video decoding

After QoE finished speaking, the code spread to the receiver for video decoding. The feature of cloud video decoding is very efficient and supports almost all video formats, and there is no problem with the interconnection.

Video post-processing

Video post-processing restores and improves video quality by optimizing video screen content and video frequency overdivision. Video super points cloud letter, network and the number is 2 k to 4 k, the network layer is less than 8, we have the research of AI reasoning engine, unique optimization, very fast, we will speed up at the same time, the super points for data set processing, the effects of using apple or zoom’s mobile phone, with different focal length of the data acquisition, real data training, At the same time, some data pretreatment and enhancement will be used to determine the effect, the main advantage is high efficiency and fast.

In the table below, the first three are the traditional processing time, which is the super score developed by us. This is a well-known lightweight network. From the perspective of processing time, the time of Yunxin AI is more than 30 times faster than that of the famous lightweight network. From the point of view of the effect, the video quality of yunxin AI super points far exceeds the non-AI effect, and the classical effect is very small, basically no difference can be seen.

Secondly, desktop sharing optimization, for desktop screen sharing more than 264 coding has done post-processing, for the optimization of text scene, for deep learning, the biggest difficulty of screen sharing is its resolution is generally very large, cloud letter has high-precision text recognition function, to enhance the interpretation of the text, Meanwhile, our self-developed reasoning framework NENN also maintains this speed, which is the effect of text enhancement.

Video technology of live broadcast on demand service

The architecture of live broadcast on demand

The compiler server was introduced earlier, which is basically a line with low delay of RTC. If live broadcast is performed, the short tweet server of live broadcast on demand can be sent through CTO.

The link of live broadcast on demand is from the client to the edge media server and then to the live broadcast transcoding and then to the CDN.

This link has two problems. One is that when the device is uploaded, its picture quality is lost, which is compressed. There may be problems with the camera acquisition itself, which will also bring losses. The second is that after transcoding, transcoding is very high when distributed through CDN.

In order to solve the above two problems, Yunxin proposed the intelligent code super – clear technology. Firstly, deep learning video repair technology is used to repair or enhance the video before transcoding, and then coding technology based on human eye perception can save bit rate without decreasing the subjective quality of the video.

The image is first repaired, enhanced or beautified by the video repair module, and then perceptual coding, which will analyze the video content, is preceded by a video analysis module.

Smart code ultra – clear technology architecture

Video repair technology

Video restoration is a difficult technology in the industry. Due to the diversity of degradation models, there are many reasons for video degradation, such as the influence of camera noise, loss of compression, over-exposure or under-exposure caused by bad camera itself, or improper focusing, etc.

Cloud credit has a picture quality assessment algorithm, through in-depth learning algorithm, to find out what the degradation model of this video is. Different restoration methods are used for different degradation models. If it is noise, we will use video noise reduction method; If it is fuzzy, will use the method to blur; If the texture is not good, texture enhancement will be used, as well as screen correction. It can beautify or enhance the subjective effect of the video by evaluating it and then fixing it.

Video aware coding technology

Once it’s fixed, it’s coded. The perceptual coding of yunxin adopts JND technology, which measures the sensitivity of **** human eyes to distortion in different regions of the image with the smallest perceptible error of human eyes.

JND is one of the most widely proposed technologies. As can be seen from the figure below, objective distortion is continuous curve, the human eye is a ladder shape, there are redundant places can be optimized, save bit rate, while subjective drop.

JND is a relatively traditional method, but the traditional JND coding is a method based on the low-level features of the image, such as texture, edge, brightness and color.

Yunxin JND is different from others in adding video content analysis. Such as above, we can video analysis, analysis of the image, the prospect of human face, text and other information, according to different information to construct JND alone, to achieve the purpose of saving stream, through such processes can output outlook, word, face, the consensus of each feature has JND, JND coefficient coding.

The following figure shows the test results of smart code ultra clear. Blue represents Yunxin, and other colors belong to friends in the industry. On the left is the subjective score of the human eye, so higher is better, while the compressed file size is obviously lower.

Netease Yunxin Entertainment and social networking industry line video technology

This is the key input and output of netease Cloud letter.

Skin care technology

Yunxin beauty technology, provides skin grinding, whitening, big eyes and other 26 functions, more than 50 filters, age, gender, line of sight recognition and tracking, support FOR 2D and 3D stickers, these industries have, but our characteristic is to be able to face quality efficient processing speed, this is our core competitive advantage.

Video in 720 p beauty, skin, whitening, thin face, such as cost, on Xiao dragon processor, cloud letter can reach 30 is the basis of skin care, for our overseas markets, especially in India, southeast Asia market, in the case of entry-level models all over, this is very competitive, the entire video experience is completely different.

Background segmentation technique

Cloud information background segmentation technology, using a large number of data sets. Our accuracy is relatively high, with an IOU of 0.93, good robustness and fast inference speed, less than 10 milliseconds. The figure below is the comparison between our accuracy and the industry’s friends. The higher the accuracy, the better.

The ground practice

Finished speaking technology, can specifically look at netease cloud letter landing practice. The video engine of netease yunxin has served more than 10,000 users worldwide.

Users who have access to both SDK and video engine, such as LOOK Live, netease Cloud Music online KTV, netease Conference, and POPO within netease, also include some third-party manufacturers based on conference components.

Netease news broadcast on demand applications, has a large cloud music concert is using the function of netease cloud letter live on demand, such as the very famous last year a record number of concerts, also with the letter of cloud video engine, we also will continue to follow-up of deep down in the field of technology, to bring us more and better products.