Audio and video consumer demand a new scene has given rise to more and more new technology, from the moment of live, on-demand, RTC, into the future of XR and yuan universe, audio and video technology support for the new scene is more and more tend to comprehensive, AI algorithm is developing rapidly in recent years, but a good algorithm effect often needs to consume a lot of work force resources, This makes the commercialization of the algorithm to face a very big challenge. How can we give full play to the capabilities of both soft and hard? How to effectively balance algorithm performance and performance?

At LiveVideoStackCon2021 Beijing summit, Yang fenghai, senior algorithm expert of ali cloud intelligent video cloud, started from the latest scene exploration of ali cloud video cloud, brought ali cloud video cloud in virtual background, video super-classification direction of the best innovative practice experience sharing.

Article | feng-hai Yang

Finishing | LiveVideoStack

This sharing is mainly divided into five parts, including introduction, innovation and optimization at algorithm level, in-depth optimization at software and hardware level, future outlook and QA.

1. Introduction

In terms of business forms, audio and video services include live broadcasting, vod, RTC, media production and processing, cloud games and cloud desktop. In terms of technology chain, it includes acquisition, coding, transmission, cloud transcoding/processing/rendering, network distribution, receiver decoding and rendering, among which the part involving algorithm includes pre-coding processing, post-decoding processing and cloud video processing.

In order of computing power, the server side has the strongest computing power, and the end side is relatively poor, so it can be roughly described as a normal distribution. According to the normal distribution curve, most of the algorithms are now deployed in the cloud, and a few will be deployed on the end side.

From the current situation of audio and video and algorithm deployment, the entire software and hardware system is heterogeneous, basically covering cloud servers and edge servers, various terminal devices and IOT devices.

From the hardware level, including CPU+GPU, CPU+FPGA, CPU+DSP and CPU+ASIC chip. There are also a lot of chip manufacturers involved, including Intel, AMD, ARM, Nvidia, Qualcomm, Apple, etc. The operating system coverage is also relatively comprehensive. In addition, software standards, development and compilation environments, training for deep learning, and reasoning frameworks vary widely.

Such a complex heterogeneous software and hardware environment presents unprecedented challenges to algorithm implementation. How to give full play to the ability of integrating software and hardware? How to effectively balance algorithm performance and performance? These are two problems that have to be solved.

2. Innovation and optimization of algorithms

Next, I will introduce how we balance performance and effect from an algorithmic level to create value for the business through two algorithms: virtual background and superpartition.

The virtual background

Let’s look at algorithm background, I believe everyone experienced since the outbreak of last year now online video conference, accompany her online courses, etc., of course not her friends have certainly seen live, short video, focused on the online dating scene, the scene you are actually very want to be able to put yourself into a virtual environment, on the one hand can protect personal privacy, On the other hand, it can effectively increase the fun and immersive experience.

So with that algorithmic background, let’s look at how do we land in business?

First, the scene may be very complicated, such as light, noise, scene (indoor, outdoor, single/multi-person, conference room, office area, venue, home, etc.), foreground background boundary is not clear, hand-held objects, clothing decoration is very diverse;

Secondly, there are very few data available for training and learning. The labeling standards of open source data sets are different and the accuracy does not meet the commercial requirements. Manual labeling is time-consuming and laborious.

Finally, in terms of computing performance, the cloud requires high concurrency and low latency, while the terminal requires low latency, low power consumption and low resource occupation.

This requires us to design a very robust algorithm, and can meet the performance requirements of different deployment end. As we know, pixel-level algorithms for portrait Segmentation fall into two categories: Segmentation and matting. Of course, there are also some Segmentation fields, such as semantic Segmentation, instance Segmentation, video object Segmentation (VOS), Panoptic Segmentation, blue-green screen matting and natural scene matting.

Which one should we choose for landing? First of all, we should know that our landing scenes are mainly education/conference/pan-entertainment scenes at present. After comprehensive evaluation of effect and performance, we believe that semantic segmentation for portrait can meet business demands.

After determining the direction of the algorithm, the first thing we need to do is to stand on the shoulders of giants for innovation. It is necessary to understand the development context of the algorithm in the field over the years.

It can be seen that from the initial FCN to the later Segnet, Unet series, DeepLab series, HRNet, etc., basically follow the encoder-decoder structure from the perspective of algorithm design and innovation. Then try to design different backbone and Block to balance the algorithm effect and performance, and then multi-branch design, multi-resolution fusion, Coarse2Fine structure and various Attention and so on.

From the perspective of paper publishing, many algorithm models will be designed deeper (more layers), wider (more parallel branches, more channels, larger featuremap), more dense connections, larger receptive fields, and more global information. However, from the perspective of business landing, These complex algorithms are difficult to run in real time on end-to-end devices.

Simplicity. Our algorithm is inherently aimed at satisfying the deployment of different heterogeneous platforms. Therefore, we adopt the framework of Unet and integrate various lightweight Block design ideas. Reference includes SqueezeNet, MobileNet, ShuffleNet, EfficienNet, GhostNet and so on.

In addition, attention structure of spatial dimension and channel dimension is fully utilized to fully integrate multi-resolution features, while ensuring that the calculation is not slowed down. Specific structure and loss function are designed based on differentiated design of different hardware platforms and business scenes, including specific edge loss. Online difficult sample mining, etc.

Neural network model of ascension cannot leave the scene and data, so before design algorithm we first need to define the current business scenario, and then build a data set, and through data training iteration algorithm, the algorithm in turn again badcase collected through online business practices, to clean and expanded data sets, fintune algorithm model again, In the end, the scene, data and algorithm are organically combined and the cycle iteration is perfect.

Data enhancement is essential because of the limited distribution of the data set itself. Traditional portrait map color temperature difference is large, the synthesis effect is bad. If such data were added to the training, the overall benefit would not be very high, and there may be side effects if the synthesis is not good. So we used dynamic white balance and pyramid fusion to make foreground and background fusion more realistic.

Due to the high cost of manual data acquisition, and it is difficult to cover all portrait movements, posture, environment and clothing, etc., as shown in the figure at the lower left corner, we conduct data expansion through 3D animation maps of specific portraits, actions and scenes. The right side shows the improvement of anti-interference ability to light, noise and motion blur respectively. We calculate consistency Loss for the network output results of original data and enhanced data to improve the robustness of the model.

No matter how well the algorithm is designed, some bad cases will inevitably appear in actual business scenarios. For example, when a person is sitting at a desk, his arm and his body are not connected. If we simply use the previous model, his arm may be mistaken for the background, so we developed a multi-task joint learning method. At the same time, combined with portrait segmentation, human key points, human body analysis and other multi-task training and learning of the model.

In the final reasoning, other tasks are not involved in the reasoning, but only used to help the segmentation model extract and learn relevant information in the training stage. This improves the model without increasing its complexity.

In addition, we all have more or less experienced the application of virtual background. It can be seen that no matter which manufacturer, the edge processing is not very good. Therefore, we specially designed special Loss constraint for the edge, so that the edge accuracy can be significantly improved.

In the final landing, the model needs to be lightweight. Common methods include pruning, compression, distillation, quantization and NAS. Here, we take distillation as an example to introduce how we carry out model lightweight.

First of all, let’s take a look at the development of knowledge distillation. In 2014, Hinton released his pioneering work Distill the Knowledge in a Neural Network, which simply means training a complex network of teachers and a lightweight network of students. Then, the KL divergence is used to approximate the student’s output and the teacher’s output. Later, some improved articles also discuss the method of conveying spatial attention in the classified network to the student network.

Distillation performed relatively well on categorization tasks, but was very difficult to adjust for pixel-level tasks, whether segmentation, matting, or hypersegmentation. There have been papers on this issue over the years, like this one from Microsoft: “Structured knowledge for Semantic segmentation” was proposed to conduct distillation using the similar relationship between any two pixels in the feature graph. KL divergence distillation based on pixel classification and GAN antagonistic distillation are also used.

In addition, there are also papers using focal Loss ideas for distillation. Firstly, the student network loss is calculated, and the position with large loss is given greater weight. Finally, the loss of distillation is calculated according to these weights.

We fully combined the KL divergence distillation of pixel classification and gan-based distillation methods to carry out distillation.

This is the actual algorithm effect, there are single person to change the background, virtual classroom, virtual conference room, etc., can protect privacy while providing some fun for boring meetings and classes.

Video super resolution algorithm

At present, the video superdivision algorithm consumes a lot of power when applied on the server, so it is mainly applied to the server video ultra HD scene, including 2K ultra 4K, 4K ultra 8K, etc.

How to perform superpoints on the end? Our main landing scenario is RTC, which requires extreme delay, packet size, power consumption and resource occupancy. Based on THE RTC service characteristics, Ali Cloud selects the scenario of weak network on the end to perform oversplitting.

In weak network, the resolution, bit rate and frame rate can be reduced by QOS policy to meet the requirements of smooth network transmission. At the player end, the hd reconstruction of the picture is carried out by hypersegmentation algorithm. This can not only ensure the transmission quality under the condition of weak network, but also enable users to get a good viewing experience.

Review the development of hyperfractal model in recent years. It is mainly divided into traditional algorithm and deep learning algorithm. The traditional algorithm is known as several interpolation algorithms. From SRCNN in 2014 to now, there are constantly new papers on deep learning, but few of these papers can really run on the server.

Through these network structures, some design ideas can be summarized and extracted, which basically adopt residual structure, recursive structure, or Dense structure, and another is based on GAN, but GAN is more difficult to implement on the end. In order to extract the correlation information between video frames, 3D convolution, variable convolution, optical flow and motion estimation can be used, and the consumption of computing resources is also large.

Super points algorithm itself, is a pathological problem to solve, from low resolution to high resolution, or from high resolution to low resolution, does not exist a certain mapping function, so learn up very difficult, and in the video, the alignment algorithm based on the frame of information between, it is hard to meet on the side of real time and power consumption requirements.

In the RTC scenario, the algorithm must ensure low power consumption and has strict requirements on packet size, CPU usage, GPU usage, and power consumption. In addition, even if middle and high-end models are covered, if a large part of middle and low-end models are still not covered, some customers will be lost from the perspective of business and commercialization. Therefore, our goal is to cover all models.

The diagram Outlines the structure of our hyperpartitioned network. We know that algorithms cannot be separated from real scenes, and only scenarioized algorithms can have better business value. In RTC scenarios, the primary consideration is the loss caused by coding compression and downsampling.

When designing the model, we have taken into account the compression of coding and the loss of down-sampling, and added a distortion repair module in the first half of the model. In the latter part of the model, the upsampled graph is enhanced to approximate the GroundTruth. Attention structure-assisted feature extraction is adopted in both parts.

It is relatively easy to over-classify portraits and ordinary pictures, but it is easy to over-classify scenes with text and subtitles. High-frequency text information is not only easy to be lost in the process of sampling, but also easy to be destroyed in the process of coding, which is very prone to badcase. We have carried out a series of optimizations for this problem.

Firstly, a large number of data enhancements will be made to the text, including font, color, Angle, etc. In addition, the introduction of EdgeLoss for edge optimization can effectively improve the text hypersegmentation effect.

Lightweight is always a consideration when landing. We use structural reparameterization to design the network. The essence of structural reparameterization is to train some feature extraction branches in parallel.

For example, there is only one link extraction feature of 3×3, and several other convolution can be parallel, which can be combined through the structure reparameterization formula in the final inference. Although a large amount of computation will be increased during training, it has no effect at all during reasoning, and more information and features can be extracted. Through this structure, we have a good balance between algorithm effect and power consumption.

Lightening can be done not only with structure heavy parameterization, but also with thinning pruning. If the connection is purely sparse, the computation on CPU and GPU may not be faster after the sparse. For GPU calculation, highly parallel data is relatively more friendly. Purely sparse connection seems to empty some connections to simplify parameters and computation, but in actual calculation, calculation delay may not be reduced due to channel alignment or access discontinuity.

Therefore, the current industry mostly adopts the structured sparse approach. In the two figures on the left, the parameters related to a convolution kernel are found to be very sparse if they gradually tend to 0 after the curves of the absolute values of the parameters and their changes over time are counted. For some of the connections with very small values, clipping can be carried out, but when clipping, the connection problem of the front and back layers should also be taken into account, and clipping should be carried out from the overall structure.

The two graphs here show a statistical baseline comparison of hyperfractal algorithms. It can be found from the left figure that the effect of the hyperscore algorithm is better than that of the traditional algorithm in different stalls, which is obviously not at the same level. In the figure on the right, we have calculated the approximate distribution of PSNR of the hyper-partitioning algorithm under different bit rate frame rates. With this distribution, QoS strategies can be guided in turn to reduce bit rate and frame rate reasonably under different bandbandths.

This is the effect of the hyperfraction algorithm for the live scene.

This is the effect of superpartition algorithm for text scene.

3. In-depth optimization of software and hardware

At the beginning, we have mentioned that there are a lot of heterogeneous hardware, but in actual business scenarios, CPU and GPU optimization will account for more than 90% of the work. Therefore, in this part, we mainly take CPU and GPU as examples to introduce optimization strategies. By contrast, CPU is more suitable for complex and serial work with control logic, while GPU is more suitable for parallel computing due to a large number of ALU units.

This figure briefly describes the HARDWARE and software architecture of the CPU. From the overall design, CPU architecture is divided into complex instruction set and simplified instruction set. The complex instruction set is represented by X86, and the compact instruction set is represented by MIPS, ARM, and RISC-V. An important feature of the CPU is its level 3 cache. In order to select a more suitable optimization method, we must first understand its hardware and software structure.

In order to complete the calculation, a virtual address should be taken to the memory management unit and addressed by TBL lookup. If a match is found, the data can be directly fetched from the Cache for calculation. This is very efficient. It is necessary to access main memory or even disk to load data through the bus (low efficiency). In this case, the influence on power consumption, performance and delay is very large, which must be considered in optimization.

This leads to the CPU optimization method. There are roughly code level, loop level, memory level, instruction level, and thread level. The code level requires minimal memory reads and writes, minimal data types, structure alignment, and branch optimization.

The common methods at the loop level are loop unrolling, loop merging, and loop splitting. The memory level follows the principles of time locality and space locality to ensure that data can be used by multiple computing instructions at a time.

In addition, frequent memory requisition and release should be reduced as much as possible, and continuous memory access, aligned access, merged access, forced aligned loading, cache prefetching, and memory overcommitment should be ensured. Instruction level should reduce data dependence, optimize multiplication and division, make full use of system register resources, use instruction stream to hide memory and instruction delay execution, SIMD instruction optimization and so on.

For SIMD instruction optimization, there is NEON for ARM, SSE/AVX for X86, etc. The so-called vectorization, for example, A matrix calculation A×B+C=D, requires multiple access and calculation if scalar calculation is used, but if vector calculation is used, the access and calculation times can be greatly reduced.

In the CPU instruction stream, the execution of an instruction needs to go through finger fetching, decoding, execution and write back several processes. It can start the value operation of the next instruction at the same time as the start decoding of the previous instruction, so as to not only maximize CPU throughput, but also hide the fetch and calculation latency. Thread-level optimization, including data chunking multithreading, computational branch multithreading, and asynchronous processing multithreading.

Compilation and assembly can use automatic compiler optimization, inline assembly or handwritten assembly, etc. In terms of CPU binding, frequent switching between CPU cores leads to frequent context switching, which greatly affects performance. Therefore, you can selectively bind cores to improve performance. There are differences between large core and small core in mobile terminals, and different cores should be bound according to needs to obtain the optimal performance.

Let’s take a look at the hardware and software architecture of GPU. The main manufacturers of server GPU are Nvidia and AMD, while PC GPU mainly includes Intel HD series, AMD APU series and Radeon series. The mainstream mobile GPU is Qualcomm’s Snapdragon Adreno series, ARM’s Mali series and Apple’s A series, while Imagination PowerR series is rarely used now.

GPU software standards mainly include Microsoft DirectX series, OpenGL maintained by Khronos Group, Vulkan, OpenGL ES, OpenCL, etc., Apple Metal, AMD Mantle and Intel ONE API, etc.

This figure shows the hardware and software architecture of a GPU. At the hardware level, it is divided into master device and slave device. The master device is generally the CPU side, and the slave device is generally the GPU side. Here, OpenCl and CUDA are taken as examples to introduce the architecture at the hardware and software level.

The hardware level of CUDA mainly includes multi-stream processor SM, numerous stream processor SP and some registers, etc. Open CL includes CU and PE. From the memory point of view, CPU end main memory, GPU end video memory. Video memory is also divided into global memory, constant memory, texture memory, local memory and shared memory, private memory and so on. From the perspective of Thread execution, CUDA is divided into Grid, Block and Thread. OpenCL is divided into WorkGroup and WorkItem.

After understanding the basic structure of GPU, now let’s introduce the optimization method of GPU.

At the code level, CPU should be used for serial computation and GPU should be used for mass parallel computation. Use asynchronous mode to minimize direct interaction between the CPU and GPU. Otherwise, the CPU performance may be affected due to access and I/O restrictions. Large-scale sequential reading and writing of data, single-load multi-instruction computation, use of vectorized loads and stores, reduced division, low-bit quantization, and use of built-in functions can also be classified as code-level optimization methods.

Kernel-level optimization methods include reasonably adjusting thread grouping according to Kernel, hiding instruction execution delay with a large number of threads, and determining the optimal Kernel by multiple experiments.

Thread-level grouping optimization in OpenCL allows the system to automatically decide how to divide WorkGroup sizes. As a rule of thumb, you can also set WorkGroup size to a factor of NDRange size or a power of 2. The method to guarantee the bottom is auto-tuning to find the optimal grouping mode.

In terms of CUDA, the CONFIGURATION of the SM affects the number of thread blocks and warp concurrency it supports. Additionally, since warp is usually 32, the size of the thread contained in the block must be set to a multiple of 32.

Memory level optimization, including the use of linear access and highly local data structures, optimization of data layout to maximize Cache utilization, maximize spatial and temporal locality, merge access, Zero Copy, differentiated use of Image and buffer, etc.

In theory, the Image-Object has better performance in Qualcomm Snapdragon series, and the Buffer-Object has an advantage in ARM Mali series, but this is not absolute. It is possible to get better performance of in-memory mode reasoning in the trial run in advance. In addition, we should avoid bank conflicts, reduce off-chip Memory access, and make full use of Local Memory and data fragment reuse.

The data block multiplexing of A GPU is similar to that of a CPU. Appropriately increasing data blocks can effectively reduce the number of data memory accesses on the premise of constant computing complexity, which plays a significant role in improving performance.

I have listed six points on how to design more lightweight operators based on software characteristics:

First, the model with small computation and low parallelism is suitable for CPU calculation.

Second, a large number of parallel models are suitable for GPU computing.

Third, reduce the use of low computation, no computation and high access operators, and can adopt the method of merging with other operators to reduce the number of access.

Fourth, avoid bad parallel operators.

Fifth, the convolution kernel should not be too large, and can be replaced by multiple small convolution kernels if necessary. For example, one 5×5 can be replaced by two 3×3, and the number of parameters can be reduced to 18/25; A 7×7 convolution kernel can be replaced by three 3×3 convolution kernels, and the number of parameters can be reduced to the original 27/49.

Sixth, channel number alignment should be adjusted in combination with hardware and reasoning framework. MNN data is generally organized according to NC4H4W4. Huawei HiAI data is usually aligned with 16. Not knowing the underlying implementation of these frameworks can be a waste of computing resources.

For deep learning, computational graph optimization is a very core part. It includes node replacement, branch optimization, subgraph transformation and operator fusion, among which operator fusion is the most frequently used. On the right is an example of the fusion of convolution and BN. Mathematically, it can be deduced that BN can be completely fused with convolution and simplified into a convolution to be used, thus reducing the access and calculation consumption in the reasoning stage.

In the following example, multiple 1×1 convolution can be combined, and then the convolution of 3 x3 and 5 x5 can be carried out. Since the last concat is a pure access operation, it can be combined with the convolution operation to complete the concAT operation synchronously before the memory is released.

Convolution and matrix multiplication are the two most frequently used operators. There are many ways to realize convolution. It can be realized directly by sliding window and optimized by loop unwrapping, data segmentation, multithreading and instruction set parallelism. You can also do matrix multiplication, but you need to img2Col first and then matrix multiplication.

FFT is rarely used now. FFT is suitable for convolution of large kernel, and the direct realization of convolution of small kernel will be faster than FFT. Winograd is an optimization method that is widely used now. Its idea is very simple: it takes advantage of the shorter instruction cycle consumed by addition than multiplication to combine and calculate some constant formulas in advance, and use addition instead of multiplication to reduce the total amount of memory and calculation.

On the right is the optimization method of Strassen algorithm for matrix multiplication, which mainly calculates the matrix into blocks, simplifies many multiplications, and introduces an intermediate matrix for auxiliary calculation, which further reduces the amount of calculation. This algorithm can reduce the complexity of matrix multiplication algorithm to 7 times and 18 times.

This is the optimization of the algorithm level, and the final engineering optimization level also needs to start from the following aspects, including circular optimization, data partitioning and rearrangement, increasing data reuse, reducing cache miss, vectorization, floating point transfer point, low bit quantization, multi-threading, etc.

The whole algorithm optimization process is roughly described in the figure. The lightweight algorithm model was designed first, and then the software optimization was carried out, including the optimization of CV-Filter and Inference Engine used in various pre and post processing, followed by the adaptation of various platform operating systems, and finally hardware acceleration.

Under the OpenCL framework, it is necessary to create many kernels to distribute computing tasks to each hardware for running. The execution of threads and instructions will be managed in the form of queues, which also have certain optimization strategies and can rely on the optimization scheduling of OpenCL system. You can also use the Flush interface provided by OpenCL to implement an artificial dynamic queue flush mechanism on the schedule to improve performance.

Optimize service pipline to reduce host and device data copy. Focus on whether the performance bottleneck is fetching or calculating, and then choose an optimization strategy. Pay attention to the impact of CPU size and core switching and frequency lowering. Focus on resource preemption by business processes or other algorithms. Pay attention to the system resource usage, power consumption, and audio and video lag rate. In the optimization of heterogeneous devices, pre-tuning of computational performance can be considered to select the best calculation method. Open source tools are also available for performance optimization.

The above is my share of algorithms and software and hardware optimization methods. Now let’s look into the future together.

Iv. Future prospects

With the development of audio and video technology, the way people interact is gradually evolving. Online and offline, virtual and reality will be more and more closely integrated. The new interaction mode is bound to give rise to a new generation of real-time audio and video processing architecture that integrates cloud side and side as well as software and hardware. In addition, the software and hardware development and algorithm optimization will put forward the ultimate challenge of lower delay, greater computing power, lower power consumption.

This is ali Cloud video cloud “cloud integrated” solution. Considering the limited computing resources of the end-to-end devices, some algorithms with high requirements on computing power and delay can be processed and rendered by the cloud GRTP engine. After the processing is completed, the end only needs to do ordinary rendering, truly achieving the end-to-end “zero processing”. On the right is the video demonstration effect of our cloud integrated real-time background change + virtual anchor construction + Bel Canto sound change.

In the future, AI, AR, VR, XR, metasexes, etc. will have higher and higher requirements on computing power. It is essentially limited to rely solely on algorithms and software optimization. Therefore, we must jointly promote the rapid development of hardware to truly open the ceiling of computing power and performance. We judge that in the future, we need deeper integration of the cloud side and further cost reduction and efficiency increase through the integration of software and hardware, so that the algorithm can truly empower thousands of industries.

Above is my share this time, thank you!

“Video cloud technology” your most noteworthy audio and video technology public account, weekly push from Ali Cloud front-line practice technology articles, here with the audio and video field first-class engineers exchange exchange. You can join ali Cloud video cloud product technology exchange group, and the industry together to discuss audio and video technology, get more industry latest information.