In the media content communication industry, video as a carrier of information transmission, its importance accounts for a higher and higher proportion. Usually, in order to cope with different playback scenes, video needs to modify the format of packaging container, encoding format, resolution, bit rate and other media stream properties. These processing processes are collectively referred to as video transcoding.

Netease cloud message set netease 21 years of IM and audio and video technology, to provide first-class integrated communication cloud services in the industry. Among them, the on-demand service of Yunxin, based on distributed processing cluster and large-scale distribution of system resources, can meet the playback requirements of all terminal devices and provide cloud service functions such as video uploading, storage, transcoding, playback and download with high speed and stability for enterprise users. The distributed task processing system of Yunxun bears the capacity of media processing. Its main functions include audio and video transencapsulation, transcoding, merging, screenshot, encryption, watermarking, etc., as well as a series of pre-processing functions such as image scaling, image enhancement, volume normalization and so on.

As the core function of media processing, video transcoding usually takes a long time to transcode large video files. In order to improve the quality of service, we will focus on improving the speed of video transcoding. This article will focus on slice transcoding and introduce netease cloud message’s efforts and effect improvement in transcoding speed.

Influence factors and optimization of transcoding performance

The common video transcoding process is generally shown as the following figure:



In the process of transcoding, the bottleneck is mainly in the video stream, so our discussion on speed improvement is mainly aimed atVideo streaming processing, audio streaming processing is not currently under consideration. In view of the link of video stream processing, the following aspects are discussed:

  • Source Video: Generally, the longer the source video, the longer the encoding and decoding time will be.
  • Encapsulation and codec: for video transfer packaging, key frame clip cutting and other processing that does not need decoding and encoding, the amount of calculation required is very small, generally within 1~2s. If it is necessary to re-encode and decode, it will take different time according to the different source video and encoding output parameters. Common ones include coding format, bit rate, resolution and frame rate. For example, the compression rate and computational complexity of different coding algorithms are different, resulting in different time consumption. For example, the coding time of AV1 is longer than that of H.264. The larger the target bit rate, resolution, frame rate and other parameters are, the longer the calculation time is usually.

<! —->

  • Horizontal and vertical scaling of computing resources: The more power a general-purpose processor has on a single core, the less time it takes to transcode. The use of GPU, which is more suitable for image computing and processing, is also conducive to reducing the transcoding time. Improving the concurrent computation degree of transcoding execution stream is also beneficial to reduce the transcoding time. The number of concurrent paths here can be multi-thread and multi-process. Multi-thread refers to the promotion of multi-thread in a single process, while multi-process refers to the calculation of multi-slice files by multi-process after slicing files.
  • Clustering task scheduling: multi-tenant cloud service system, are usually based on the allocation of resources between the tenant and the tenant priority dantian code and design task priority scheduling algorithm, scheduling efficiency is mainly embodied in the following aspects: how to use much less time scheduling tasks, how to use the cluster with fewer resources to achieve high throughput, how to do the priority and the balance design of hunger.

In view of the above influencing factors, we propose the following optimization directions: improving hardware capability, optimizing coding, sharding transcoding and improving cluster scheduling efficiency.

Dedicated hardware acceleration

Multimedia processing is a typical computation-intensive scenario, so it is very important to optimize the overall computational performance of multimedia applications. CPU is a kind of universal computing resource. It is a common scheme to offload video image class computation to special hardware. At present, the industry such as Intel, NVIDIA, AMD, ARM, TI and other chip manufacturers have the corresponding multimedia hardware acceleration scheme, improve the coding efficiency of high bit rate, high resolution and other video scenes.

Our transcoding system is mainly based on FFMPEG multimedia processing framework, Supported vendor solutions on the Linux platform include Intel’s VA-API (Video Acceleration API) and NVIDIA’s VDPAU (Video Decode and Presentation API for) UNIX), and both vendors also support the more proprietary Intel Quick Sync Video and NVENC/NVDEC acceleration solutions. At present, we mainly use the video acceleration ability of Intel core graphics card, combine the QSV Plugin and Vappi Plugin of FFmpeg community, and do hardware acceleration for AVDecoder, AVFilter and Avencoder three modules. Hardware acceleration technologies, vendors, and the community continue to be optimized, and we will detail further practices in this area in our next series of articles.

AMD large core servers

This mainly refers to the servers equipped with AMD Epyc series processors. Compared with our previous online servers, the single-core computing power is stronger and the parallel computing power is better. The improvement of single core computing power enables us to have a comprehensive general improvement in decoding, pre-processing and coding, while the carrying of super-large core makes the single-machine multi-process scene in our sharding transcoding scheme more like a dog with more advantages, and greatly avoids the cross-machine IO communication of media data.

Since the research Codec

NE264/NE265 is a video encoder independently developed by netease Yunxin, which has been applied in NerTC and live on-demand of Yunxin. In addition to the improvement of coding performance, the more important technical advantage of NE264 is the low bandwidth and high picture quality. It is suitable for high bit-rate and high-definition live broadcast scenes (such as: live games, online concerts, product launches, etc.), which can save 20%-30% bit-rate on average under the condition that the subjective picture quality remains the same for human eyes. Here no longer launch introduction, interested can pay attention to netease intelligent enterprise technology + WeChat public number.

Shard transcoding

If the above performance optimization methods are vertical, then the sharding transcoding described in this section is horizontal. Video stream is essentially a series of images, which are divided into a series of GOPs with IDR frame as the boundary, and each GOP is an independent set of images. This content feature of the video file determines that we can refer to the algorithm idea of MapReduce to cut the video file into multiple slices, and then transcode the slices in parallel, and finally merge them into a complete file.

Task scheduling

In addition to performing flow optimization for individual transcoding calculations, we also need to improve the overall scheduling efficiency of cluster resources. In the scheduling algorithm, the scheduling node should not only receive the task submission, but also complete the key process of task dispatch. The algorithm design of the task dispatch needs to balance multi-tenant assignment, task priority preemption and maximizing the utilization of cluster resources.

We designed two types of task issuing mechanisms:

  1. The Master node pushes the task to the compute node
  2. The compute node actively pulls the task from the Master node

The former has the advantage of higher real-time performance, while the disadvantage is that the Master’s view of computing nodes’ resources is a kind of snapshot snapshot. In some cases, the lag of the snapshot information may lead to overload of some nodes. The latter has the advantage that the nodes take tasks to execute on demand, so there will be no overloading of some nodes. Meanwhile, the programmability of task selection is more convenient, while the disadvantage is that the Master has insufficient control over the real-time strength of global resource allocation.

Practice of shard transcoding scheme

Media process

The simple process of media processing is shown in the figure below, which is mainly divided into four steps: pre-sharding transpackaging (on demand), video sharding, parallel transcoding, and video merging.

Fragmentation process

In the case of sufficient cluster resources, that is, task scheduling and distribution generally do not have the phenomenon of backlog and resource preemption. In this case, the processing and calculation of video stream itself will generally consume 80%-90% of the whole task cycle. Therefore, optimization at this stage can achieve higher cost-effective benefits.

The two dimensions of improving hardware capability and optimizing coding are aimed at improving the computing efficiency of a single transcoding process, but the resources that a single process can call are limited, and the speed increase of large video files is also limited. Therefore, here we discuss how to use distributed MapReduce idea to shorten the time consumed by a transcoding task. The following chapter will describe in detail the implementation of sharding transcoding technology scheme.



The basic architecture of the sharding transcoding process is shown in the figure above. Let’s first introduce the following concepts:

  • Parent task: similar to Job in Hadoop, transcoding Job submitted by the client needs to split its video to be transcoded into several small slices;
  • Subtasks: Similar to tasks in Hadoop, multiple small shards are packaged into Task subtasks that can be independently scheduled and executed.
  • Parent Worker: The compute node that performs the parent task;
  • Subworker: The compute node that performs the subtask.

The main process of shard transcoding:

  1. Dispatch Center dispatches a transcoding job to Work0. Work0 decides whether to perform shard transcoding according to the master switch, job configuration, video file size and other policies.
  2. If it is determined to carry out slicing transcoding, it is divided into N pieces;
  3. Wrap up n transcoding tasks and submit them to the Dispatch center;
  4. The Dispatch Center dispatches these n code-subtasks to n workers that meet the requirements;
  5. After worker1~n transcoding is completed, send a callback to worker0;
  6. Worker0 downloads the transcoded sharding videos from N workers respectively. After the transcoding of all the sharding is completed, the sharding after the transcoding is combined together.
  7. Send a callback to the client.

    Subtask scheduling

In the scheduling system, the task queue of each user is independent, and the task quota is set separately. When the Dispatch Center receives a FETCH JOB request from a computing node, the dispatching program first selects the user queue with the smallest ratio of used quota (a simpler algorithm model can be: number of scheduled tasks/total quota of users) from multiple user queues. Then a subtask that meets the condition of the computation node is retrieved from the queue head and returned. Subtask scheduling and ordinary task scheduling are different in scheduling priority and node selection, which need to be designed separately. Here we will briefly introduce them.

  • Subtask priority


    Subtasks do not need to be re-queued in their respective user queues. The goal of subtask scheduling is to be able to be scheduled in the first place. In fact, the parent task has already been scheduled, and the system is designed to accelerate the execution of this task, so it will be distributed again in the system design. If it has to compete with other unscheduled tasks, it will be unfair to this task and weaken the role of acceleration. So for a split film task, it will be placed in a global high-priority queue to be selected for scheduling.

  • Subtask scheduling node selection
Affect the subtask scheduling node has the following factors: 1. The machine type machine type can be divided into hardware transcoding machines with ordinary transcoding, because the two encoder used in the environment is different, so may lead to merge shard the video will be defective, so we choose subtask scheduling to the same machine as the parent task types. 2. Code versions Different versions of code may cause the rolled out shards to be unable to be combined well. Therefore, when such version iteration occurs, the code version on the calculation node worker can be used to determine which other calculation nodes the subtasks can be scheduled to. 3. Data storage When the tasks on the parent worker are highly concurrent, multiple network transfers of uploads and downloads will be carried out simultaneously, which will increase the time spent in the IO stage of the shard file. Therefore, if the subtasks are executed on the parent worker first, the time of network IO and uploads and downloads will be saved.

Stragglers

In the scenario of shard transcoding, Straggler problem refers to that the parent worker cannot enter the next process for a long time if most of the sub-tasks have been completed, but a few sub-tasks remain to be completed. As a result, the task is blocked. This is a common phenomenon in distributed systems, and the research papers on this problem in system field are not uncommon.

The solution to this problem greatly affects the efficiency of the system. If the parent worker chooses to wait for the child task all the time, the task may wait for too long, which goes against the original intention of speeding up. Therefore, based on the principle of ensuring that the task can be completed within a limited time, there are several optimization directions as follows:

1. Redundancy scheduling

This solution is based on Hadoop’s MapReduce solution to the laggard problem: When the timeout standard is reached and the subtask is not completed, the parent worker will send a new Tsak subtask to the Dispatch Center again for rescheduling and re-execution against the same shard file. When one of the subtasks completes, cancel the other.

The idea is to trade space for time, not to place hopes on a single node, but to adopt a horse-racing mechanism. However, when this happens more often, it creates a lot of task redundancy, and there is no guarantee that newly created subtasks will not block.

2. The parent worker succeeds

In order to solve the shortcoming of redundant scheduling, we optimize it. When the timeout standard is reached and the sub-task is not yet completed, the parent worker will select the shard with the least progress to transcode. Similarly, when one of the tasks is completed, the other redundant task is canceled. If there are still subtasks not completed, continue to select, complete their own transcoding, until all the subtasks are completed.

The difference between the second scheme above and the first scheme is that the redundant tasks will not be rescheduled to other workers for execution, but the parent worker will be given priority for redundant execution. Thus, the parent worker will continue to transcode the shard until the entire job is complete. The biggest advantage is that the parent worker is not guaranteed to be in an infinite waiting state without unlimited consumption of resources. Only in a few cases, when the parent worker has a high load, it will consider using other workers with idle resources.

Subtask progress tracking

When the parent worker selects the subtask for execution, it needs to collect the progress of the subtask and then select the subtask with the slowest progress for redundant execution. When calculating the task progress, we divide a transcoding into four stages: waiting for scheduling, download and preparation, transcoding calculation execution, uploading and closing.

The beginning of different stages denotes the arrival of different stages:

Waiting for scheduling 0% → Download and prepare 20% → Transcoding calculation execution 30% → Upload and close 90% → Finish 100%

Among them, the execution of transcoding calculation accounts for 70%, which is also a stage where the execution speed can not be guaranteed, so it is necessary to calculate the progress in detail. The transcoding execution stream will output the Metric log regularly and carry out monitoring calculation. The current transcoding progress = the length of transcoding (time field)/the length of transcoding required.

HLS/DASH encapsulation

The difference between HLS format and other packaging formats is that it will have multiple TS files and M3U8 files. The task of transferring out HLS video will be partitioned and transcoded, which will increase the complexity of the transmission and management of partitioned video. Therefore, our solution to this problem is to first convert the source video into MP4 video, and then convert the whole video into HLS package after merging in the parent Worker.

The test data

By recording and comparing the conversion rates of the same video to different resolutions, we can find that each individual optimization measure improves the transcoding speed to different degrees. In the actual online scene, we usually decide to use some optimization methods in combination according to user Settings and video properties.





Test video 1 properties:

Duration: 00:47:19.21, bitrate: 6087 KB /s

Stream #0:0: Video: h264 (High), yuv420p, 2160×1216, 5951 kb/s, 30 fps

Stream #0:1: Audio: aac (LC), 44100 Hz, stereo, fltp, 127 kb/s

Test video 2 properties:

Duration: 02:00:00. 86, bitrate: 4388 KB/s

Stream #0:0: Video: h264 (High), yuvj420p, 1920×1080, 4257 kb/s, 25 fps

Stream #0:1: Audio: aac (LC), 48000 Hz, stereo, fltp, 125 kb/s

conclusion

That’s all for this article. The netease Yunxin transcoding team mainly improved the video transcoding speed through scheduling optimization, hardware capability, self-research coding, shard transcoding and other dimensions. The test results showed that the speed of transcoding was significantly improved. In addition, the main design of the chip transcoding module in Yunxin transcoding system is introduced emphatically. We will also continue to explore the technology to achieve faster and more scene coverage. In the following series of articles, we will also share the contents of cluster resource scheduling algorithm, hardware transcoding practice and other aspects. Please keep following us.

The authors introduce

Luo Weiheng, senior server development engineer of netease Yunxin, graduated from Computer School of Wuhan University with a master’s degree. He is a core member of netease Yunxin transcoding team. Currently, he is in charge of the design and development of Yunxin media task processing system, and is committed to improving the service quality of Yunxin video transcoding.