On July 31, Aliyun Video cloud was invited to attend the global Open Source Technology Summit GOTC 2021 co-hosted by Open Atom Open Source Foundation, Linux Foundation Asia-pacific Region and Open Source China. At the special session of audio and video performance optimization, Shared the open source FFmpeg in performance acceleration experience and end cloud integrated media system construction and optimization.

As we all know, FFmpeg, as the Swiss Army knife of open source audio and video processing, is very popular with its open source free, powerful, convenient and easy-to-use characteristics. The high computational complexity of audio and video processing makes performance acceleration a constant theme in FFmpeg development. Ali Cloud video cloud media processing system extensively draws on the open-source FFmpeg experience in performance acceleration. Meanwhile, according to its own products and architectures, ali Cloud integrated media system architecture is designed and optimized to create real-time media services with high performance, high picture quality and low latency.

Li Zhong, senior technical expert of Ali Cloud intelligent video cloud, is currently responsible for the RTC cloud media processing service of Ali Cloud video cloud and the media processing performance optimization of the integrated end cloud. He is the maintainer and technical committee member of FFmpeg official code and has participated in the development of many audio and video open source software.

The topic of this sharing is “From FFmpeg Performance Acceleration to End-to-end Cloud Integrated Media System Optimization”, which mainly introduces three aspects:

2. Cloud media processing system 3. Collaborative media processing system of end + cloud

Common performance acceleration methods in FFmpeg

One of the major challenges facing audio and video developers is the high demand for computing, which is not only single task algorithm optimization, but also involves many hardware, software, scheduling, business layer, and different business scenarios. The overall complexity of the backend, cloud, different devices, and different networks requires developers to do systematic architecture development and optimization.

FFmpeg is a very powerful software, including audio and video decoding encoding, a variety of audio and video Filter, a variety of protocols support. The main License for FFmpeg as an open source software is THE GPL or LGPL, and the programming languages are C and assembly. It also provides many ready-to-use command line tools, such as ffMPEG for transcoding, FFProbe for audio and video parsing, and FFPlayer. The core Library includes Libavcodec for encoding and decoding, libavFilter for audio and video, and libavFormat for various protocols.

FFmpeg development community, audio and video performance optimization is an eternal theme, its open source code also provides a lot of classical performance optimization implementation methods, mainly universal acceleration, CPU instruction acceleration, GPU hardware acceleration.

General speed

General acceleration is mainly algorithm optimization, IO read and write optimization, multithreading optimization. The goal of algorithm optimization is to improve performance without increasing CPU Usage. The most typical is the variety of fast search algorithms for codecs, which can greatly optimize coding speed with only a small loss of accuracy. There are similar methods for various algorithms of Filter processing before and after, which can also be combined to reduce redundant calculation and achieve the purpose of performance optimization.

Below is a typical denoising and sharpening convolution template, which requires a 3-by-3 matrix convolution. We can see that the sharpening template here is similar to the smoothing template. For the operation of noise reduction and sharpening at the same time, as long as the subtraction on the basis of the smoothing template can achieve the result of sharpening the template, which can reduce redundant calculation and improve performance. Of course, algorithm optimization also has limitations, such as the loss of accuracy of the codec algorithm, and the need to sacrifice spatial complexity for time complexity.

The second performance optimization method is I/O read optimization. The common method is to use CPU prefetch to improve Cache Miss. In the figure above, the two read and write methods achieve the same calculation result. The row read and write method is faster than the column read and write method. The main reason is that the CPU prereads the current pixel while processing the row. In addition, Memory Copy can be minimized, because the video processing YUV is very large, and the performance loss per read frame is relatively high.

Multithreading optimization of general acceleration is mainly to use CPU multi-core to do multithreading parallel acceleration to greatly improve performance. The graph in the lower left corner shows an 8-fold improvement in performance as the number of threads increases. It also raises the question of whether the more threads that are commonly used to accelerate multithreading are better.

The answer is No.

First, multi-threaded optimizations run into performance bottlenecks due to CPU core limits, multi-threaded waits and scheduling. For example, in the graph in the lower left, the acceleration ratio for 10 threads is very close to that for 11 threads, which shows that there is a diminishing marginal effect of multithreading optimization, as well as an increase in latency and memory consumption (especially for the commonly used inter-frame multithreading).

Secondly, inter-frame multithreading requires the establishment of a Frame Buffer Pool for multi-thread parallelism, which requires buffering many frames. Media processing that is very sensitive to delay, such as low-delay live broadcast and RTC, will bring relatively large negative effects. In contrast, FFmpeg supports multithreading at the Slice level, dividing a frame into multiple slices and performing parallel processing, which effectively avoids Buffer latency between frames.

Third, as the number of threads increases, the cost of thread synchronization and scheduling also increases. Take the acceleration of an FFmpeg Filter in the figure below on the right as an example. As the number of threads increases, CPU Cost also increases significantly, and the end of the graph begins to tilt upward, indicating that the increase in Cost becomes more obvious.

CPU instruction acceleration

CPU instruction acceleration is SIMD (single instruction multi-data stream) instruction acceleration. Traditional general-purpose registers and instructions that handle one element per instruction. But a single SIMD instruction can process multiple elements in an array to achieve significant acceleration.

Today’s mainstream CPU architectures all have corresponding SIMD instruction sets. X86 SIMD instructions include MMX instructions, SSE instructions, AVX2, AVX-512 instructions. One instruction of AVX-512 can process 512 bits, and the acceleration effect is very obvious.

FFmpeg community SIMD instruction writing methods include inline assembly, handwritten assembly. The FFmpeg community disallows the use of intrinsic programming because of its compiler version dependence and inconsistent code compilations and acceleration across compilers.

Although SIMD command has good acceleration effect, it also has some limitations.

First of all, many algorithms are not parallel processing, and SIMD instruction optimization cannot be carried out.

Secondly, programming is difficult. Assembly programming is more difficult, and SIMD commands have special programming requirements, such as Memory alignment. Avx-512 requires Memory alignment to be 64 bytes, but failure to align can lead to performance loss and even Crash.

We are seeing an instruction set race among different CPU vendors, with more and more bit widths supported, from SSE to AVX2 to AVX-512, with significant increases in bit widths. Isn’t the wider the better? The figure above shows the speed increase of the X265-encoded AVX 512 over AVX 2. AVX 512 is twice the bit width of AVX 2, but the performance improvement is often much less than double or even 10%. In some cases, AVX 512 performance is even lower than AVX 2.

The reason?

First, a one-time data entry may not be as large as 512 bits, but may be only 128 bits or 256 bits.

Second, there are many complex operations (such as encoders) that cannot do instruction set parallelism.

Third, the high power consumption of AVX 512 will cause the CPU to lower the frequency, resulting in a decrease in the overall processing speed of CPU. Looking at the X265 Ultrafast above, AVX 512 encodes even slower than AVX 2. (see: networkbuilders.intel.com/docs/accele…

Hardware acceleration

The mainstream hardware acceleration of FFmpeg is GPU acceleration. Hardware acceleration interfaces are divided into two parts. One is that hardware manufacturers provide different acceleration interfaces. Intel mainly provides QSV and VAAPI interfaces, Nvidia provides NVENC, CUVID, NVDEC, AND VDPAU interfaces, and AMD provides AMF and VAAPI interfaces. Second, different OS manufacturers provide different acceleration interfaces and schemes, such as DXVA2 of Windows, MediaCodec of android and VideoToolbox of apple.

Hardware acceleration can significantly improve the performance of media processing, but it can also cause problems.

First, the quality of hardware coding is limited by the design and cost of hardware, and the quality of hardware coding is often worse than that of software coding. But hardware coding has significant performance advantages that can be traded for coding quality. As can be seen from the example below, the motion search window of hardware encoder is relatively small, resulting in the decline of coding quality. The solution is HME algorithm. In the case of a relatively small search window, the large Picture is first scaled to a very small Picture, and the motion search is performed on this Picture, and then the search and enlargement are performed step by step. In this way, the range of motion search can be significantly improved, and then the most matching block can be found. Thus improving the quality of coding.

Second, hardware-accelerated CPU and GPU Memory Copy performance interaction can lead to performance degradation.

The interaction between CPU and GPU is not only a simple process of data transfer, but also the format conversion of corresponding pixels. For example, CPU is I420 Linear format, and GPU is good at matrix calculation and adopts NV12 Tiled format.

This conversion of memory format brings significant performance losses. The problem of CPU/GPU Memory interaction can be effectively avoided by building a pure hardware Pipeline. If the CPU and GPU must interact, you can use the GPU to Copy Memory to speed up the process.

Below is a summary of performance optimization. In addition to some optimization methods mentioned above, the media processing on the client side also has some particularities. For example, the mobile CPU is of large and small core architecture, and the performance of the thread scheduling on the small core will be significantly worse than that on the large core, resulting in unstable performance.

In addition, a lot of algorithms no matter how to optimize, in some models is unable to run, at this time to do optimization in the business strategy, such as making a black and white list, do not support the list of machines will not open the algorithm.

Cloud media processing system optimization

There are two major challenges for media processing, one is cost optimization of cloud, and the other is client device adaptation and compatibility. The following figure shows the typical system of cloud media processing, including single machine layer, cluster scheduling layer and business layer.

The stand-alone layer includes FFmpeg Pipeline processing framework, codec and hardware layer. Take the cloud transcoding system as an example, its core technical indicators include picture quality, processing speed, delay and cost. In terms of picture quality, Ali Cloud video cloud has created narrowband HD technology and S265 coding technology, which can significantly improve the picture quality of coding. In terms of processing speed and delay optimization, FFmpeg performance acceleration methods are widely used for reference, such as SIMD instructions, multithreading acceleration and support for heterogeneous computing. Cost is a complex system, which includes scheduling layer, single-machine layer and business layer. It needs rapid elastic expansion and shrinkage, accurate mapping of single-machine resources, and reducing the calculation cost of single-task.

Cloud cost optimization

The core of cloud cost optimization is to optimize around three curves, and optimize the three curves of actual resource consumption of a single task, estimated resource allocation of a single task and total resource pool respectively. Four realistic problems need to be faced in the optimization process:

First, in the video cloud business, the trend of business diversity will be more and more obvious, including vod, live broadcast, RTC, AI editorial department, cloud editing. The challenge of service diversity is how to mix and match multiple services in a single resource pool.

A few years ago, the mainstream video was 480P, but now the mainstream is 720P, 1080P processing tasks. In the future, it can be predicted that 4K, 8K, VR and other media processing will be more and more, which brings challenges to the single machine performance will be more and more.

Third, the scheduling layer needs to estimate the resource consumption of each task, but the actual consumption of a single task can be affected by many factors. The complexity of video content, different algorithm parameters and multi-process switching all affect task resource consumption.

Fourth, there will be more and more pre-processing of coding, and a transcoding task needs to do multi-bit rate or multi-resolution output. All kinds of pre-processing (picture enhancement/ROI recognition/super frame rate/super grade) can increase the processing cost significantly.

So let’s think about how we can optimize these three curves in general.

The main methods of resource consumption optimization of actual tasks are performance optimization of each task, algorithm performance optimization and Pipeline architecture optimization.

The core goal of resource allocation is to make the yellow curve in the figure above constantly close to the black curve to reduce the waste of resource allocation. If resource allocation is insufficient, the algorithm can be automatically upgraded or degraded to prevent online tasks from being stuck, for example, the encoder’s preset is lowered from medium to fast.

For total resource pool optimization, first of all, it can be seen that there are peaks and troughs in the yellow curve. If the current vod task is in the troughs, you can transfer the live task, so that more tasks can be run under the condition that the peak value of the whole pool does not change. Second, how can the total resource pool be quickly flexible and quickly release resources within a certain time window to reduce resource pool consumption, which is also the core of cost optimization that needs to be considered in scheduling.

Let’s expand on some optimization methods.

CPU instruction acceleration

The main goal of CPU model optimization is to increase the throughput of a SINGLE CPU and reduce CPU fragmentation. The figure above illustrates the advantages of multicore cpus. However, multi-core cpus can also cause direct memory access from multiple NUMA nodes, resulting in performance degradation.

Standalone resource accurate portrait

Stand-alone resource accurate portrait of the main goal is to be able to accurately know each task it need how many resources, it is a systematic tool, need to have a quality assessment tools, computing resources statistical tools, need to include a variety of scenarios more complex video set, and be able to do all kinds of computing resources and iterative feedback, the consumption of fixed cost, guidance algorithm of adaptive demotioneing run.

1-N architecture optimization

A transcoding task may output different resolutions and bit rates. The traditional method is to start N independent one – one processes. Such an architecture obviously has some problems, such as redundant decoding calculations and pre-coding processing. One optimization approach is to combine these tasks, from N to N transcoding to one to N transcoding. In this way, video decoding and pre-coding processing only need to be done once, so as to achieve the goal of cost optimization.

1-n transcoding also presents new challenges. FFmpeg transcoding tool supports 1-N transcoding, but each module is processed in serial, and the speed of a single one-n task will be slower than that of a single one-n task. Second, scheduling challenges. The granularity of single task resources will be larger, and the allocation of required resources will be more difficult to estimate. Third, the difference in algorithm effect, because some video pre-processing may be after Scale, for transcoding architecture from 1 to N, the pre-processing will be put before Scale. Changes in the flow of media processing can cause a difference in algorithm performance (usually this is not a big problem, because processing before Scale has no loss of image quality, and processing before Scale is better).

End + cloud collaborative media processing system

The advantage of end-to-end media processing is to take advantage of the existing computing power on the mobile end to reduce the cost, so the ideal situation is to make full use of all kinds of end-to-end computing power, each algorithm is very good performance optimization, end-to-end adaptation, and can be implemented at zero cost at each end.

But the ideal is very happy, the reality is very skinny. There are four real problems:

First, the difficulty of end-to-end adaptation. Need a lot of OS hardware model adaptation.

Second, the difficulty of algorithm access. In reality, it is impossible to optimize all the algorithms on all the ends, so the performance bottleneck on the end side will lead to the difficulty of algorithm landing.

Third, the difficulty of experience optimization. Customers will have different SDKS, or the SDK of Aliyun will also have different versions. SDK itself is fragmented, which makes it difficult to implement some solutions. For example, non-standard H264 encoding, in fact, the implementation of H265 codec algorithm also faces challenges, some devices do not support H265.

Fourthly, it is difficult to access users. The cycle of upgrading SDK or replacing SDK is relatively long.

Facing this reality, we propose a media processing solution based on cloud and end collaboration. The main idea is to achieve better user experience through cloud processing + side to side rendering scheme.

The main types of cloud – end collaborative media processing are transcoding and preview. Transcoding classes are one-way data flows. Preview class needs to push various streams to the cloud first, then add various special effects, and then pull back to show the host whether the effect is in line with his expectations, and finally push from THE CDN to the audience.

There are challenges to such an approach. First, there is an increase in computing costs associated with cloud processing (which can be optimized in various ways, of course, since the client has no direct motion sensing). Second, the delay will increase, and the cloud processing will increase the link delay.

As RTC technology becomes more and more mature, media processing on the cloud can be supported at low cost/low delay through RTC low-latency transmission protocol and various cost optimization on the cloud, creating a real-time media processing service on the cloud plus. The RTC real-time media processing service RMS constructed by Ali Cloud video cloud can achieve high performance, low cost, high picture quality, low latency and more intelligent cloud collaborative media processing scheme.

The left figure above is the overall ARCHITECTURE of RMS, which is divided into Pipeline layer, module layer, hardware adaptation layer and hardware layer. Pipeline can be used to assemble the module layer of various business scenarios. The module layer is the core of audio and video processing to achieve various AI or effects of low delay and high quality.

End-to-end Cloud collaborative media processing: enabling RTC+

Taking clipping cloud rendering as an example, it is difficult for traditional clipping schemes to ensure the consistency of multi-end experience and smooth performance. Our idea is to only deliver instructions on the end. Video synthesis and rendering are realized on the cloud, which can support a lot of special effects and ensure the consistency of multi-end effects.

Let’s take a look at the Pipeline for clipping cloud rendering. The web page rendered by cloud is responsible for signaling delivery, and the editing instructions are forwarded to THE RMS media processing engine through the scheduling layer for rendering of media processing on the cloud. After synthesis, the encoding is pushed through SFU, and finally the editing effect is watched at the web page end of the editing. As can be seen from the Demo above, the web page combines many tracks into one Track. If the 4-by-4 grid is processed on the end, it will be difficult to run on the low-end computer, but the cloud can easily run such an effect.

After the video with high bit rate and low definition is streamed to the cloud, the goal of higher definition and lower bit rate can be achieved through the processing of cloud narrowband HD technology of Ali Cloud Video cloud. After narrowband HD is enabled in the Demo below, the resolution of the video is significantly improved (the bit rate is also significantly decreased).

At the same time, RTC low delay, plus AI special effects processing, can produce a lot of interesting scenes. The stream of real people is pushed to the cloud, and the cloud does the output processing of cartoon portraits, and then the output of cartoon portraits for real-time communication. The audience in the venue can see each other’s cartoon portraits.

With cloud image matting technology, it is easy to realize virtual education scenes and virtual conference room scenes. For example, in a virtual classroom, people can be inserted into a POWERPOINT presentation to increase the immersion of the entire effect presentation. Different participants in the virtual meeting room arrange them into the virtual meeting room scene through matting to achieve the effect of the virtual meeting room.

“Video cloud technology” your most noteworthy audio and video technology public account, weekly push from Ali Cloud front-line practice technology articles, here with the audio and video field first-class engineers exchange exchange. You can join ali Cloud video cloud product technology exchange group, and the industry together to discuss audio and video technology, get more industry latest information.