Mobile Middleware architecture and Audio and video technology based on Flutter

, recruiting

We urgently need talent for browser rendering engine /Flutter rendering engine. WelcomeJoin us.

preface

On January 16, UC Technology Committee, along with nuggets and Google developer communities, held the first Flutter Engine technology Salon in 2021. The activity attracted more than 150 students to apply, but due to the impact of the number of people on site, we could only arrange 50 students to come to the site. In addition, more than 2,000 students watched the live broadcast. During the event, five technical experts from Alibaba Group shared their research and development system based on Flutter construction, development and optimization experience, dynamic solutions, as well as the advantages and new features of Hummer, UC’s customized Flutter enhancement engine.

The first session was “Building UC Mobile Technology Center Based on Flutter” brought by Hui Hong, head of UC/ Quark client and head of UC Mobile Technology Center.

The second session is “Hummer (Flutter Custom Engine) Optimization and Systematic Construction Exploration” brought by Lu Long, technical director of UC Flutter Hummer Engine.

The third session is “Mobile Middleware Technology System and Audio and Video Technology Based on Flutter”, brought by Li Yuan, video technology expert of UC browser.

Share content

Hello everyone, welcome to today’s sharing. My name is Li Yuan from UC Information Flow team. Today I will introduce to you the Audio and video technology based on the Flutter mobile middleware system.

Today’s sharing content is divided into four parts. First, I will introduce the idea of building Flutter mobile middleware. Then it will focus on how to build video playback and editing capabilities. Finally, I will talk about future plans.

In the scenario of building mobile middleware, we mainly consider how to enable [specific channel] innovative business exploration more quickly?

Our core idea is to split the middleware into general layer and business layer to build, which can further improve the efficiency. The following describes the application scenarios of middleware:

1. Content business in UC browser: We abstract the commonness of these businesses, precipitate a batch of business middleware based on information flow architecture, and incubate new scenarios of vertical class in the end at a high speed. Reduced costs by more than 70% in actual iterations;

2. Innovative APP in UC: The processing method of this scenario is similar to that of the intra-terminal scenario, but the backend uses a unified mid-platform architecture;

3. Usage scenarios of other BU: Other BU also need to use similar capabilities, but will have their own back-end, so we precipitate a batch of general middleware under business middleware;

Next, we will focus on the construction of multimedia. We will build multimedia capability in three layers:

1. The shell layer is responsible for solving business pain points, reducing capacity access costs and supporting differentiated customization of business functions. There are Auror plug-in player and Kaleido multi-video editor.

2. The implementation of the channel layer, in addition to using common shared textures and methodChannels, also explores higher performance FFI channels;

3. The kernel layer is responsible for depositing the general multimedia capability, which is constructed in C++. On the player side, UC’s Apollo is used to play the kernel, and other business parties can switch to the customized kernel. LLVO editing engine is used on the editing side. LLVO not only supports kaleido multi-video editor above, but also supports shooting and transcoding in other scenes.

How did we create a unified architecture and highly scalable Flutter player to support a wide variety of player usage scenarios?

To give you an overview of the current state of the player scenario:

1. Shell layer: No matter in multiple apps or multiple business scenes in an APP, there may be a separate playing shell; These shells are usually not reserved for extensibility, and functional extensions often require framework changes; At the same time, these playback shells are developed using pure platform layer code, although the functions are similar, but it is impossible to avoid repeated development.

2. Kernel layer: UC uses Apollo kernel, but the business side also needs to use other playback kernel.

How did I use Flutter to build a universal playback shell to completely solve the reuse and customization problems for different business scenarios? The following will mainly be introduced from these four aspects.

There are two common techniques used to render images onto Flutter: Platform View and shared textures. In terms of performance, shared textures have higher performance because they don’t have to merge UI and Platform threads. With shared textures, however, the upfront iteration cost is high due to the need to redesign and develop the shell. However, the cross-platform capabilities it provides will be more reusable in subsequent iterations. Considering our business scenarios, we finally adopted a shared texture solution.

In terms of kernel selection, we decided to turn the selection back over to the business side. Provide a playback adapter at the platform level, and access the SDK simply by implementing the adapter’s interface, without worrying about the internal details.

Next, we describe how we abstracted commonalities between businesses to achieve a unified architecture that supports flexible customization and extension of the playback shell. The four images on the right are the business scenarios in which we use the player. You can see that the shell differences in these scenarios are mainly in the layout structure and control styles. In the last image, there are also some customization functions such as following and sharing. Of course, there are commonalities between them. In order to precipitate the commonness of the scene and support the differentiation customization of the business side, we have implemented a set of plug-in playback shell.

1. Firstly, plug-in definition. Plug-in is a UI element on the playback panel, which is based on the general broadcast control management and message receiving mechanism.

2. Before using plug-ins, configure them. In this case, build a plug-in tree, which is responsible for describing the hierarchical relationship between plug-ins and allowing plug-ins to have their own child plug-ins.

3. Before running the plug-in, you can pass in a custom style that describes the common play control properties, such as the font color of titleBar and the play button icon. Plug-ins are displayed in the Playback panel according to their configuration. In the bottom layer, the common play function is abstracted, the broadcast control state is managed by Redux, and the plug-in listens on demand. Plug-ins are decoupled from each other to unify the EventManager to handle message sending.

In addition to implementing plug-in shells, we also tried to optimize the performance of the playback scenes. The video card list is the most commonly used video playback pattern in the information flow. The next part mainly introduces how to realize the video opening in seconds in this scene.

The top right corner is the list of tasks involved in the process of sliding from the list to display the video picture. The main optimization idea is to perform the time-consuming tasks in advance and pre-load the next video source when playing the current video source:

1. Firstly, the playback kernel supports the pre-download of 200 ~ 300KB of data, the specific size depends on the average bit rate delivered by the server, and the content should be able to play 100 ~ 200MS.

2. Next, the kernel will use the pre-downloaded data to initialize the wrapper, decoder and other resources, and render the first frame of the video.

3. After pre-loading is added, the frequency of creating and destroying players will increase, so it is added to the instance memory pool to avoid repeated creation of players.

4. In the scene of fast sliding, it is not necessary to preload the video content of each card. Here, we add a target detection mechanism to judge whether the video card content needs to be preloaded according to the sliding speed of the list.

5. In order to reduce the impact of preloading on the application memory and the server, the guardian mechanism will reduce the memory pool capacity and the number of preloading requests when the memory water level is too high or the QPS warning is received from the server.

After completing a series of optimizations, the average broadcast time was reduced from 561ms to 118ms, and the motion sensation was completely straight out.

The next section describes how to build an audio and video editor on Flutter.

In recent years, the tools around video content production have been frequently capitalised, and the track has proved large enough on the side. In UC, the demand for multi-video editing is increasing, and the author’s ability to edit performance and produce video quality is also improving. The implementation of features, both the editing interface and the building of memory capabilities, is complex. We hope to use Flutter & C++ technology selection to completely solve the problem of multiple iterations.

To sum up, the challenges we face are mainly divided into two aspects: function and performance, which I will introduce next.

First, what is multi-video editing? Supports multiple multimedia materials before video editing, which can be video files, audio files, or pictures. After importing the material, the editing framework manages the material in a graphical structure. There are two important concepts. One is fragment: fragment maps the content of the material within a specified time range; One is the track: the track contains multiple clips, the track is divided into two types, one is video track, the other is audio track. Material content through frame decoding, special effects processing, track fusion will be audio and video data to the upper level of the advanced functions. In the editing stage, it supports clipping, segmentation, reordering and adding various spatial and temporal effects, so the editing effect can be previewed in real time.

This is the edit shell that Flutter implements. A series of editing UI components are encapsulated in the top layer, which can be used by the business side to build their own editing interface. Below the UI component layer are two important modules, task management and state management. The task management is responsible for encapsulating the atomic operation provided by the kernel into an advanced task operation. For example, after deleting fragment A, the position of fragment B and FRAGMENT C needs to be moved simultaneously. At this time, an advanced task can be invoked to complete the operation, because the interaction frequency between the task management module and the kernel is relatively high. Therefore, FFI is used to call C++ directly instead of the traditional method to call the methodChannel, so as to avoid the time of message passing between multiple threads. The other is the state management module, which stores the fragment and track structure information after each edit in the form of stack. It can be used for recall, recovery and other operations during editing, and can also be handed to the draft box module for persistence.

This is the overall architecture of Kaleido. On the top is the shell that was introduced; The engine layer at the bottom is mainly divided into two layers, the upper layer contains advanced editing functions such as preview, screenshot generation and synthesizer, and the lower layer is the core module of multi-video editing, mainly including multi-segment and multi-track decoding scheduling module and audio and video special effects processing and fusion module. The following sections detail some of the core challenges encountered in implementing a multi-video editor.

Seek is a high-frequency operation in editing, how to ensure the real-time screen update? Video frames are divided into three types, namely I, B and P frames. B and P frames cannot be decoded independently and have much more data than I frames. Therefore, the probability of hitting B and P frames will be higher in the process of Seek. For this reason, the decoding speed of Seek process generally cannot keep up with the rendering speed.

1. Look at general optimizations first. There will be some non-reference frames in one GOP, and other frames will not rely on these frames when decoding. Therefore, the middle non-reference frames can be directly skipped in the case of the diagram. Another is that when using FFMPEG or system multimedia API, trigger Seek function will generally jump to a key frame before the specified time, in the case of the picture will also appear the problem of repeated decoding, so should first query the current frame and the target frame near the frame information to decide whether to continue decoding or trigger Seek function.

2. We also optimize for different scenarios. The user adjusts the editing effect by slowly dragging the SeekBar, in which case the cache queue is searched first and the screen is updated immediately if there are available frames. The cache queue is a mechanism designed to improve the concurrency of the framework. The queue is divided into two areas, the recycle area holds used frames, and the preload area holds decoded but unused frames. When Seek is found to have hit the reclaim area and there is not enough available data in the reclaim area, the size of the two areas will be adjusted and the decoding thread will reverse populate the cache queue.

3, in the case of the user quickly drag SeekBar, because the user’s requirements for the accuracy of the screen is reduced, so this will set a timeout time for each Seek task, after the timeout will be the last decoding frame to update the screen, at the same time to ensure that the screen change direction and SeekBar drag direction consistent. In this way, in the process of dragging SeekBar, the picture can be updated continuously and quickly, which can improve the subjective body sense of fluency.

SeekBar displays preview thumbnails of videos to help users locate content quickly, so how do you speed up the loading of thumbnails?

The normal loading process is shown above, and you can see that this process is simple, but has some problems. The image below shows an optimized solution that divides the thumbnail list into visible areas and pre-loading areas on the side of Flutter, so that the content to be displayed can be pre-loaded. When the loading request sent by the thumbnail comes to the Native side, it will first check whether there is any available texture in the texture cache pool. If not, the request will be handed over for decoding and rendering. Before that, the request will be preprocessed. Preprocessing has two aspects:

The first is to reorder requests to ensure that the decoder can continue decoding as much as possible.

The second is to query if there are any keyframes available within the range specified by each request. Since previewing a thumbnail in editing corresponds to a time slice, not a point in time, adding a range query can effectively reduce decoding time.

Finally, used textures are recycled to avoid crashes due to excessive memory.

It is a common practice to use hardware codec to speed up video synthesis, but there are some problems in hardware codec. Our actual business scene of video synthesis picture quality, size, speed has a different appeal, so the simple use of hard editing can not meet the needs, then how to achieve the synthesizer? In order to meet the different demands of each business party, we have implemented a set of equipment adaptive codec mechanism. On the client side, it will automatically execute the Benchmark of various soft and hard matching and encoding gear combination, and record the final score of each process. When appropriate resources are needed, it will obtain the best parameters according to the requirements of the scene.

Now that you’ve added a soft editor, how do you maximize its potential? In terms of performance, in addition to tuning soft codec parameters, we found that the biggest performance bottleneck was in the rendering phase. The figure above shows the general usage process. It can be seen that there are some time-consuming operations in both the decoding side and the encoding side. For these time-consuming operations, the main optimization means is to use GPU to convert formats and adopt data transmission mode with higher performance. In the decoding side, PBO is used to transmit data, and compatibility problems should be noted here. Therefore, the speed of the traditional scheme should be compared before use, and Shader should be used to transform RGBA texture after GPU receives YUV data. On the coding side, use ImageReader with better compatibility. Since OpenGL output is RGBA data, we can store YUV data as shown in the example. It is worth noting that we should output UV interleaved data as far as possible. In this way, the RGBA data can be sampled once and the UV value can be calculated at the same time, which can improve the performance twice as much as the UV data without interleaving.

In a video production scenario, video composition and publishing functions often coexist, but both are usually executed serially, and we want to perform this function concurrently to increase the overall speed of video publishing. But implementations often encounter graphical problems: the composition process modifies content that has already been uploaded. To understand why this is the case, it is important to understand the structure of MP4. MP4 consists of multiple boxes. When ffMPEG encapsulates the video, it will first fill the body of the Box with the content and then write the size of the content in the header. Mdat Box stores encoded media data, and one MDAT can be used as a minimum upload shard. However, ffMPEG usually uses a single MDAT to store all media data, so it can be optimized based on FFMPEG, using multiple MDAT structure to replace single MDAT structure, so as to achieve side synthesis side transmission.

With the rapid development of hardware on mobile devices, ultra-high-definition cameras are becoming the norm, and the videos in albums are becoming more and more high-resolution. A RGBA FORMAT 4K video frame reaches 30MB, and YUV format also goes to 10MB, and editing framework in order to improve concurrency, there are often various cache queues, which means that there may be various audio and video frames at the same time, so how to optimize the memory usage in editing?

1. Hardware decoder should be used first in decoding side, because the memory is mainly distributed in multimedia process, so the memory impact on the current process is less.

2. Fragments for the same source should reuse the decoder.

3, because the number of hard solution instances on the Android platform is limited, so it also needs to use the soft decoder (FFMPEG), and the soft decoder will maintain a frame memory pool, this memory pool is ffMPEG to reduce memory jitter an optimization means, but the size of the increase does not decrease in the multi-video editing scene is not suitable. Therefore, the memory pool resources should be released immediately after the transition between two fragments from different sources.

4. In the preview scene, the resolution of the preview picture may be smaller than that of the actual video. For this scene, the video frame can be reduced in advance and the memory occupied by the original video frame can be immediately released.

5. On the rendering side, we have encapsulated a series of common base GL Fliter to avoid memory abuse caused by improper implementation of special effects.

6. Finally, GL memory pool is constructed to reduce the total memory footprint. When shared context scenarios require resources, they need to apply for resources from the memory pool, and immediately reclaim resources to the memory pool after use.

Finally, our future plans.

1. First of all, we hope to try to implement H.265 for some high-end models in the editing side. Compared with h.264 currently in use, it has a higher compression rate, that is to say, the size of the video will be smaller with the same picture quality, but the challenge is that the decoding speed will be correspondingly reduced.

2. Try to update AV1 of the next generation on the player side with relatively simple architecture.

3. The industry is becoming more and more obvious in the direction of intelligent mobile production, and the threshold for users to use mobile production tools will be constantly reduced in the future. In this respect, we will consider combining the current business situation to realize the two functions of intelligent one-key film and adaptive coding.

4. At present, the number of open source multimedia development libraries in the community is relatively small. We hope to create an independent audio and video development library and open source feedback to the community.

Please search for U4 kernel technology and get the latest technology updates immediately

Mobile Middleware architecture and Audio and video technology based on Flutter

, recruiting

preface

Share content

Related Posts

Refactoring legacy Systems in Mobile Applications (4) – Analysis

Android | use ContentProvider without intrusion to obtain the Context

Study Notes – Summary of ThreadLocal essentials