With the rise of cloud editing, cloud broadcast and online collaboration in audio and video production, audio and video processing in production environment has attracted more and more attention. Audio and video processing in the production environment, control accuracy has higher requirements. From server to client, precise time control and picture control are important differences between audio and video processing in production environment and distribution environment. It is easy to produce slight differences between the server and the client.

By Jiang Yuqing

Organizing/LiveVideoStack

Hello everyone, my name is Jiang Yuqing, the head of audio and video research and development of Media Track. I used to work for Panda Live, and have been engaged in front-end player. Later, I was lucky to work for Bytedance, and recently I am participating in the entrepreneurship project with Panda Live. This project is mainly about the solution under the production environment. Instead of the traditional 2C video distribution solution, we make a solution aimed at the collaboration and cooperation of video creators, which is different from the traditional 2C viewing end.

This content is mainly divided into four parts: one is the framework; Second, workflow; Third, consistency; Fourth, scalability.

First of all, let’s have a look at our product. There will be modification and annotation functions on the web page side and small program side, which is the function of this version that we launched at the earliest. If I want to make a production solution, I personally prefer to first understand how users will use the product in the production environment.

Because I personally prefer to use some editing software to edit some films. This picture is the state of my personal editing. First of all, I need to control the frame accurately, and the time stamp of each section is very accurate. I need to know which section to insert content, for example, I need to know the position of subtitles in the picture and which pixel to be accurate. Especially in the process of network video distribution, such consistency is not guaranteed. At ordinary times, there is no need to ensure the frame when watching, which brings great challenges to our service.

Our two core businesses now are: media transcoding and video annotation and screenshot. First of all, media transcoding is network distribution, what we see is impossible to use the source stream, because the source stream may be very large, it may not be decoded in the web page or small program end, which involves transcoding, so whether the transcoding stream and the source stream are consistent has become a big problem. Secondly, video annotations and screenshots also differ in consistency.

architecture

This picture shows the entire structure of our current Media Track. The whole naming method continues the naming method of panda. All projects adopt heroes of the League of Legends as project names. At present, the two most important projects are: first, Sona and Neeko of small programs are visible to users on the Web side. Behind them are the two parts of Riven and Kayn of API on the second layer, that is, the layer that interacts with the front end. They have high flexibility and add interfaces according to product requirements.

The last part is the micro-service cluster, focusing on Ahri, the audio and video service. For other services in the system, Ahri is just audio and video service, and there is no difference with other micro-services.

Ahri is a collection of microservices for all media-related operations, including media transcoding, file format correction, media information acquisition, screenshots, audio Waveform sampling, annotation point drawing, image processing, and so on. To the outside world, Ahri is just a micro service. Internally is the microservices group, which is why it’s named. Ahri is a nine-tailed demon fox that can store energy in fireballs and release it, and its nine tails are as integral to its internal microservices as they are to it.

This figure shows the architecture of Ahri. The external service of Ahri is the GATEWAY of Ahri, and there is no actual operation of Ahri. All operations of Ahri are to create tasks for internal microservices and summarize these tasks, but transcoding still adopts the transcoding of cloud vendors.

workflow

We have two important workflows: one is to call the workflow. Ahri needs to know what needs to be done before it works, and uses Magic number to determine the file type. Because our system allows users to upload any file, the extension of the file will change, or the front end will not be able to determine whether it is an audio or video file. Ahri then makes a judgment call and, if it’s a video file, informs other services to correct it and enter the actual media processing process.

The media processing process also makes a correction that you don’t actually want to process the same file more than once. When multiple users upload the same file, hash is required. A message is broadcast when the event task completes or when the status is updated.

If there is no media information, obtain media information correction. There are some potholes in screenshots: first, the time stamp is balanced; The second is to find out the drawing marks.

consistency

Time consistency. Traditionally, a video clip can now be seen as shown in the figure, starting with the zero point of time in format, followed by the time point of the first audio frame, the time point of the first video frame, and finally the annotation point.

The upper part of the picture is the original video of the server. Because the users are professional, most of the uploaded videos are as shown in the picture, that is, the audio time origin and video time origin are almost the same, and some even have time codes on them.

But the transcoding services we use now often turn the transcoding streams into the bottom part of the graph, meaning they don’t start at the same time. So when looking for a frame, you need a reference point. The general reference point is the first frame of the video image, that is, start time, and then mark the timestamp with the start time reference point of the video time.

I used to be a web player, and the web player handles start time. According to the transcoding stream processing in the figure, if startTime is 4 seconds, the first screen Time has to wait for 4 seconds. Therefore, a base-time is generally calculated, that is, the smaller starttime of audio and video is taken as the Base Time point and the zero point of Time, and this Time point will be subtracted from each subsequent frame. Such time point control will cause the real start time is not the PTS of the first frame, then the MSE Buffer needs to find the time point again.

There will be a pit, the current time point in the browser may be cleared, because the browser has a mechanism to play a certain period of time to clear the previous cache, to save memory space, but this time the start time point is incorrect. Therefore, the MSR Buffer should be filled with the first segment at the time point of the video.

According to the processing shown in the figure, the goal is to speed up the start time, followed by preserving the presentation data as much as possible.

Because the applet player is the bottom layer of the applet, its starting point is the first frame of the video, which is time-stamped by the user according to the video transcoding stream and the source stream PTS. The base point of the applet is 0.

Small program to video as the benchmark play, no special processing. Another pitfall for applets is that in order to ensure that they don’t digest too much, they keep timeUpdate at 250ms, need to be accurate to frames, and have to make their own timers. However, too many timers may crash the program. You are advised to use global timers.

This part is about the positioning of the image, which is much better than consistency in time. This is a special case where the dot is the reference point, the inner square is the actual image, and the outer is the playbox. Based on the actual picture, set a percentage of its width and height as the marking point. Even if the video is processed, the actual position can be found based on relative punctuation.

expanding

This part introduces the scalability of our system. The diagram shows Ahri’s overall architecture. First, Ahri will create work flows and deliver tasks, which may be carried out on Ahri’s own services or cloud vendor services. If the quality of cloud vendors is not as high as our precision, we can directly migrate in the Workflow layer when we need to create our own services. If we need to add third-party transcoding, we can horizontally expand our Work Flow.