This paper is compiled from the speech shared on LivevideoStack by Li Minglu, technical leader of Baidu Intelligent Cloud audio and video SDK products. From the perspective of audio and video data, the content sorts out the development and technological evolution of audio and video SDK. The problems and challenges encountered in common audio and video modules are analyzed in detail, and the corresponding solutions and technical practices are put forward.

Wen/Li Minglu

Organizing/LiveVideoStack

Video playback: https://www.livevideostack.cn…

The theme of this sharing is the development engineering practice of mobile audio and video SDK. The content is mainly divided into the following five parts:

  • Technological evolution of the audio and video SDK
  • Design and implementation of data acquisition pipeline
  • Design and implementation of special effect module data middleware
  • Design and Implementation of Data Middleware of Linmai Module
  • Design and implementation of data middleware for rendering module

Technological evolution of the audio and video SDK

1.1 Digitalization Process

Multimedia technology is a traditional technology, but at the same time it is constantly developing and progressing. We can vividly represent it with cell division, that is, the internal division of multimedia technology is also constantly occurring, and the audio and video SDK is a branch of it.

Several important developments in the evolution of audio and video technology are outlined below. First of all, multimedia technology has been widely used in the early stage of Internet. Thanks to the development of communication technology, we can transmit digital signals to users by means of broadcast, cable TV, satellite and so on.

Decoders actually play a very important role in the development of the whole multimedia technology, but the ability of the decoder at that time compared with today is still a little different, but it can also achieve the separation of decoding and container.

Since then, with the emergence of the Internet, we can realize the extensive use of some IP network protocols and optical fiber to transmit data to the device terminal through WiFi and cellular network (at this time, the terminal more refers to the PC terminal). Therefore, there are some online audio-video services, such as VOD, online voice and other typical audio-video scenes. In these scenarios, the SDK is still more server-oriented.

From 2010 to 2015, with the development of mobile phone hardware, the computing power of terminals has been continuously improved, and the codec chips have also developed rapidly. In addition, with the change of consumers’ usage habits, more fragmented scenes and products have emerged, such as live broadcast and short video. At this time, with the popularity of mobile devices such as mobile phones, the mobile terminal SDK is also slowly entering the consumer field and gradually developing into an independent technology stack.

In recent years, with the development of 5G and AI artificial intelligence, VR/AR, voice interaction and other technologies are also undergoing new changes, and the application scenarios are becoming more extensive. The audio and video SDK is no longer limited to the mobile terminal, but will appear in various device screens and product forms in the future.

1.2 Mobile audio and video framework

Mobile audio and video frameworks are very different from other mobile frameworks. The audio and video framework should first base on the system capabilities provided by the mobile terminal, including the system framework and hardware capabilities. As shown in the figure, the system framework layer includes some IOS /Android multimedia frameworks, hardware codec, hardware processing capabilities such as CPU, GPU, NPU processing modules, and some open source image libraries.

Above the system framework layer is the third party framework, such as FFmpeg, OpenSSL (encryption and decryption); IJKPlayer/ExoPlayer used more in the on-demand scenario; For the underlying communication technology: bi-directional, low delay WeBRTC technology, for one-way on-demand RTMP, currently more popular SRT low delay scheme; In addition, there will also be some image processing framework such as GPUImage.

On top of this, there are some modules that are more combined with the scene. Some core modules are roughly listed here.

From multimedia data acquisition module, passes through all the way or multiplex mix mixed flow, combined with the actual scene), and then transition to the multimedia authoring module: the short video of some ability by multimedia edit the processing unit implementation, to the back of the multimedia post-processing module: AR effects, for example, as well as some more fun interactive ability, etc.;

Once the content is produced, we distribute it under different protocols, depending on the scenario of on-demand, live or short video. Next is the consumption side, where the data can be read directly, or the data can be read indirectly as a cache to speed up the second read data loading. After the data is obtained, it is unsealed according to different encapsulation formats, such as FLV, TS or MP4, and then decoded and rendered.

Compared with other mobile end frame audio and video frame the most special place is pipeline section, because audio video SDK products is different with other products, you need to first is real-time processing of data flow is constantly shuttle between various modules, how to guarantee the efficient transmission between modules, it puts forward the concept of a pipeline.

1.3 Only fast not broken

This paper briefly introduces the framework of audio and video SDK, and then introduces some problems and challenges encountered by the current framework. At present, many business scenarios are actually in pursuit of more new gameplay. There will probably be some who want more resolution, some who want higher refresh rates, and even more extreme experiences. However, we know that the audio and video SDK is largely subject to the capabilities of platforms. As platforms are more diverse, the audio and video SDK actually meets many problems in the development process. It also involves data interaction between modules, so performance is one of the biggest bottlenecks for mobile terminals.

In fact, in some scenes like VOD or short video, the resolution of 720p is not enough for the use of the scene, and even in some sports scenes, the refresh rate of 30fps is not enough for the development of the scene. Of course, these are actually more business challenges. In addition, as I said just now, the encoder has been playing a pivotal role in the development of the whole multimedia technology, so the underlying technology is also pursuing more efficient coding efficiency.

For these issues, in fact, everyone can reach a consensus, that is, speed is an audio and video SDK framework to solve an issue. However, regardless of the phenomenon, speed actually comes from the transmission of data, which is a fundamental prerequisite. Therefore, how to make data transmission and processing more efficient is the most fundamental problem to be solved by the mobile terminal SDK.

Design and implementation of data acquisition pipeline

Here we introduce the design and scheme of data link in some current open source products. Take two products as an example, one is GPUImage. I believe that the students who have done SDK effects on mobile terminals should be familiar with it, because it is really relatively easy to use, and can provide a large number of shaders. Shader aside, let’s look at the way it’s designed in the data chain. First of all, GpuImage provides a class Output that transmits data and an Input that receives data. Then, through the production mode, the production modules, like camera acquisition module and local album module, are actually classes that realize Output, and the production module will be handed over to Output for rendering after the production is completed. Here its power lies in that the Output can be strongly correlated with the programmable Shader, which drives the running efficiency of OpenGL to a large extent. The rendered RGB data is then converted into a FrameBuffer, which is a solid object, and the entire chain is received via the FrameBuffer. After the Input is implemented by the components on the chain, the Output Output is sent to the FrameBuffer. By doing so, the FrameBuffer is passed on to the next chain, and the chain is actually very clean and simple.

Let’s take a look at the FFMPEG on the right, in addition to providing a lot of data chain processing mechanism, data subcontracting, it can also be subcontracted in the case of not decoding, you can also decode the data, do some post-processing and so on, the whole chain is actually relatively clear.

Having introduced two multimedia frameworks, let’s take a look at how our team thinks about these issues. In actual development, the pipeline in the audio and video SDK framework is actually uncontrollable. Let’s take a look at some advantages and problems of these two solutions.

First of all, the advantages of GpuImage are obvious, that is, the protocol is relatively simple and clear. In addition, OpenGL is driven by programmable Shader, so that its operating efficiency becomes very fast. A large number of shaders can enable developers to further learn this language, which is an excellent open source framework for image processing. But there are also some problems, first of all we see that most of the shaders that it provides are actually image processing, and they’re basically synchronous. As a result, when you combine GPUImage with some framework, when one of your modules is involved in time-consuming processing, there is a possibility that one object will be freed by another or something, making the whole thread unsafe.

FFmpeg has a lot of advantages, just to talk about its function pointer, in fact, it is a best practice of FFmpeg, it is the way of pointer, the data layer by layer split with the transfer. But it also has some problems, such as FFMPEG is process-oriented, its link design for students who are not familiar with FFMPEG, may feel that the link is very complex, it may not be so handy to use.

Therefore, after many rounds of discussion and thinking, we hope to get a data pipeline. First of all, it is like GPUImage, with simple protocol, easy to use and friendly to developers. In addition, it needs to be controllable, because as the SDK modules become more and more, we find that in many scenarios, there will be uncontrollable problems; In addition, we want it to be synchronous or asynchronous, secure scheduling, and simple and reliable link configuration.

We also summarized some of our own methodologies and practices: focusing on the chained approach, which is a core idea of the audio-video framework, so we followed this principle to develop our own data pipeline.

The first is to develop the data protocol, the main solution is the most basic module data efficient, stable transfer. Here you can see the image on the left, which is similar to the GPUImage scheme, but also different. We also provide similar AVOutput module and AVInput data transfer receiving protocol. But we don’t bind to the GpuImage OpenGL, we just record and manage the various components in the chain, which we call target. Then, through the mechanism of Dispatcher, the video frames reported from the production end are distributed, and the video frames are continuously transmitted to the targets of each link, and each target realizes the protocol method of AvInput. For example, frame and type are two functions. Frame is the data passed down from RaiseFrame Output. Type is mainly used to distinguish between audio and video of audio and video.

In addition, we also support the distribution of some binary scenarios, mainly in order to cooperate with the live broadcast such as data distribution requirements of the scene to do some protocol upgrades. At the very end of the chain, you can see that we also implemented AVControl, but unlike before we had a control protocol, the control protocol was basically to control the flow of data in and out of the system. Why do you do this? As we know, the core of audio and video SDK is the continuous transmission of data. If there is any abnormality in a certain module, the data cannot be protected by a mechanism, which may lead to unstable operation of the whole SDK. For example, when the live broadcast scene is distributed, we find that the network has jitter. At this time, we can adjust the transmission rate and speed, etc.

Here is a simple picture of our data flow direction. One is the way of pushing, which is to collect data directly from the module like the camera and transfer it directly to a module behind. One is the pull way, mainly refers to read local files or indirect access to data, such as reading data from the network first need to read down the data, and then passed to each module. In fact, in contrast to the previous GPU processing in the asynchronous thread, we also do some compatibility and protection, is to prevent the asynchronous processing of the object is released. So we basically follow the simple idea of GPUImage protocol, and then add some implementation mechanisms like control protocol on this basis to make the whole link controllable.

In addition, we also found that in the daily process, simply designing the data pipeline could not provide good help for the development of business scenarios. Why?

In fact, we only solved the problem of data transfer between each module, but we could not solve all the problems in the final product scene development, because the scene is very different from the module. What we know about things like on-demand, live streaming, and even special effects, it’s all about versatility. However, the actual scene involves the combination between various modules, so the transfer of data cannot say that passing a data can connect a scene in series.

Here are a few typical examples of the daily process, such as the on-demand image quality optimization, will find that type conversion is not so smooth and simple. For the LianMai scene, how to make the use of our SDK or products more simple, such as face effects, for example, involves the diversification of capabilities, how to achieve compatibility, etc., which can not only be solved by data link or module carrying. Therefore, we mentioned a concept of middleware, which is to bridge data to achieve resource sharing, so as to improve the applicability of overall data collection or processing when each module or business is output or used. So usability is another factor in our data acquisition pipeline.

Design and implementation of special effect module data middleware

Next, let’s take a look at some of the problems and solutions encountered in the actual process.

The special effects module is usually a typical PaaS structure. There are various models on it, and the models can be inserted and unplugged. It also has the additional characteristic of consuming resources. So how to make better use of this module in the audio and video SDK to provide external capabilities? Here, we take the advanced beauty interface of facial effects as an example. There are a lot of feature points involved in advanced beauty, such as big eyes, thin face, chin, etc., and these feature points can not be solved by one iteration or one model, but may involve multiple iterations and the combination and superposition of various models. This creates a problem. When integrating special effects modules, if these capabilities are constantly changing, there is an element of insecurity and instability for the use of the module. So how can we solve this problem, or turn it off?

Here we provide a concept: first, we don’t call the capability directly, but abstract and encapsulate the capability, and then the encapsulated model is used to relate the different algorithms later. Because when others use the SDK, they may not integrate all your things, but only use some of the capabilities, which may lead to some special effects of the SDK version is not consistent. Without this proxy layer, when version inconsistencies occur, there may be a lot of interface adjustments and modifications for the upper layer, which can be time-consuming and costly. However, if we do proxy, we may shield the stability of the upper interface, and then through our model and some abstract objects, we can drive different AR module versions.

From the data pipeline, we can see that when the recording module transfers data to the Effect interface and then to the AR SDK, each AR SDK will have a processing and detection capability, which will regularly detect the main screen indicators, etc. Why do we do this?

We know that the current special effects, scenes and gameplay are extremely complex, and the processing effect cannot be completely guaranteed. Therefore, in order to ensure the processing of each frame and the stability of the overall link, we need to continuously monitor some performance indicators, which need to be continuously fed back to the calls of the upper layer. For example, the current data transmission speed is relatively fast, or the transmission frame is too many to process, we can pass back through the data pipeline, for control. You can even adjust the frame rate collected by the recording module, so that the data is returned to the recording module, and the recording module will transfer the data to other modules, such as preview for rendering, etc. Through the data pipeline and agent scheme, we can well integrate different AR versions, capabilities, external interface to maintain the unity.

Design and Implementation of Data Middleware of Linmai Module

LianMai module is also a capability that is widely used in current audio-video products, such as online education, live broadcasting, PK, entertainment, etc., and has gradually become a standardized capability. However, there are still many problems in the use of LianMai module.

Standard live streaming is one-way communication and the signaling is relatively simple. After the integration of the link module, it will become two-way communication, signaling will become complex, media flow from single channel to multi-channel flow. As shown in the lower left figure, the integration structure of a standard live-streaming SDK is shown in the white area. It can be seen that the white area is the standard live-streaming process, from the creation of a live broadcast room to the establishment of a chain, encoding and packet, including distribution through queues, etc. More links are added to the module when the link capability is integrated. First, the RTC server, the media server and the signaling server are introduced. It may also introduce some business-like messaging mechanisms, IM servers, and so on.

If the user initiates Linked Mac, you will see the red arrow in the left image. First of all, it will send a MMAC request to the RTC signaling server, and also send a request to the IM server. The reason for sending a request to the IM server is mainly to do some business processing, such as some process processing of UI interface or scene. In fact, the signaling server is mainly to transmit the request to join the room. After the request arrives at the anchor broadcast room, the anchor broadcast room will respond to the signaling server and choose to agree or refuse. If they agree, the signal will be returned to the small anchors/viewers through the signaling server. At this time, the small anchors/viewers will transmit the data to the RTMP media server, and the anchors will also transmit the media to the RTMP server. After the two streams converge to the RTMP server, the external broadcast will be carried out by means of bypass relay, etc. At this point, the audience will see the picture of two anchors or a combination of anchors and viewers.

This implementation process is actually quite complicated and tedious, which may be very difficult for users (anchors) to understand. First of all, he needs to care about a lot of things, such as IM server, bypass broadcast, which also involves the setting of some templates and so on. He may have to care about all of them, so it is very troublesome to use them. For the consumer side of the audience, also exists some problems, such as there will be the same address, but there will be a two way flow of different coding, this case may be have some challenges for some players, there may be some caton, because it is constantly reset some coding parameters, in addition this solution through the bypass switch push, Maybe the original live stream will break, maybe there will be a delay in the frequent switching between live stream and mixed stream, maybe there will be a lot of problems with the pull side, and it won’t be good.

In view of these problems in fact, we also see that users in the process of use in fact there are a lot of inconvenient, so we can see that simple data transfer, in fact, can not make this scene to do a particularly simple and application, then we did some large number of end-to-end integration of some data chain, First is the code that we considered is the consumption side player some of the problems, the first is to switch coding parameters, the plan may not be an available solution, so we have adopted the technical scheme of local mixed flow, its advantage is that I push flow all aspects of the parameters can be consistent and do not change, and in fact for the mobile end processing capacity, The local confluence of 2 or 3 channels is actually not particularly stressful for some hardware. In addition to the RTC signaling server we put the IM server consolidation, because we feel that allows users to care about so much of the signal is not necessary, and the user may also be beset with these problems, so our internal through the integration of the message in the form of service side that let users to use more simple. Through this end-to-end integration of some media and signaling, in fact, it also makes data flow simple and improves the application of overall access.

Design and implementation of data middleware for rendering module

Why is rendering one of the key technologies in the audio and video SDK? From the point of view of the overall technology link, the rendering module is actually the one that the user is most aware of. In some complex scenes, the rendering module also carries some data interaction and processing. Therefore, the processing included in the rendering module is no longer just about rendering, and may involve some special requirements of the scene, such as screen clearing, etc. : for multiplayer meetings, it is equivalent to an effect of closing the screen; The rendering module can also be used as one of the technical solutions, such as the multiplexing solution, the RTC solution mentioned earlier, etc. It also includes things like applets, especially small programs, Android layering solutions, rendering solutions, and so on.

Every module and even every processing node may have rendering requirements, so we need to split the rendering module further. The first is the ability of the data pipeline to gather data from each module into the rendering module and then process the data. This usually translates some data into the processing power of each platform, such as CPU buffers, some surfaces, and so on.

Data processing, which is a new node compared with the previous production process, is mainly to deal with some complex scenes, such as the same layer rendering of Android and separation of the creation and drawing of Surface. For example, the business module holds the Surface, but the rendering module will indirectly reference and draw. There may also be some parameterization of the multiplexing of streams, where each stream is saved as an interval value, as well as binding layers to video frames, and then drawing the multiplexing simultaneously through the rendering thread.

We all know that rendering must be independent of the current thread, otherwise the impact on CPU and overall overhead is relatively large. Therefore, after creating the GL environment, the data will be traversed and split according to the GL queue to achieve the drawing of single-path or even multi-path flow. Then there’s the standard rendering process, which includes setting the vertices to render, taking the render texture, and then adding a blend mode, binding the texture to render the whole thing in OpenGL.