background

Soon after is the Lunar New Year, in recent years, in addition to the Spring Festival Gala during the New Year, there is a new national activity, that is alipay AR sweep blessing. Since the Spring Festival of 2017, Alipay has introduced the AR real-scene red envelope: by hiding the red envelope in a specific scene and opening the AR camera to look for it, this novel lBS-BASED red envelope has caused a boom.

In the Spring Festival of 2018, Alipay upgraded the technological elements and launched an activity of AR scanning any blessing for all people. By scanning blessings to collect blessings, Alipay can recognize them whether they are store-bought blessings or handwritten blessings. Besides scanning, alipay also has a large number of such requirements, such as bank card identification, ID card identification, etc., and the intelligent application framework of xMedia multimedia terminal is derived from this.

“End Intelligence”

AR scanning has very fast recognition speed even when the network condition is not good. Different from the general process of uploading pictures for recognition, it does not require users to select and upload pictures. Why is that? In addition to algorithm and model optimization, another important reason is that the recognition engine runs on the end, which is also one of the reasons that the end AI has become increasingly popular recently. We call it “end intelligence”. Why run the engine on the end? What’s the difference between general intelligence running in the cloud? There are two main reasons why the engine runs on the end rather than the cloud. First, there are a lot of demands on the end that cannot be met by the cloud, so the end has obvious advantages in many scenarios. Second, due to the end of its own conditions mature.

  • Advantages of on-end identification engine:
Edge points Specific information
The cost of AI is not a low-cost computing. For large-scale activities like sweeping, if all computing is deployed in the cloud, the server traffic, computing and storage costs will be unimaginable. But the mobile end itself contains these resources, with huge amounts of local edge computing power.
experience At present, the average time of scanning is hundreds of milliseconds. If you go to the cloud, the time of image data transmission on the network will greatly affect the experience, especially when the network is not good. Many people have had the experience of snatching train tickets in the New Year. When a large amount of traffic comes in at the same time, users are waiting for endless circles and retries. There is no network transfer of big data on the end, and there is no large amount of concurrent access, so this scenario is much better.
privacy In the era of big data, people are more and more aware of personal privacy, and the cloud may face the risk of uploading sensitive data. For example, ZAO, a popular face changing app recently, needs to upload a photo of a face, which makes many people give up using it. However, if the computing is purely on the end, all data are stored on the user’s mobile phone, and there is no risk of uploading sensitive data like in the cloud.
  • Serving conditions ripe:

In addition to the advantages of end-to-end cloud comparison, another more important reason is that the conditions on the end-to-end are mature enough to support intelligent computing. The main reasons are as follows:

  1. First, the computing power of terminal devices is enhanced. Powerful CPUS and Gpus can carry out very complex computing, and many specialized AI chips are integrated into mobile phones, further strengthening the computing power.
  2. Secondly, the development of AI engines on mobile terminals, such as xNN of Alipay and TensorFlow Lite of Google, makes it possible for AI to run on mobile terminals. At the same time, the development of model compression technology has also made it possible for models to be stored in mobile phones, which used to be hundreds of megabytes in the cloud.
  3. Finally, there are many real-time application scenarios similar to Wufu emerging on the end, such as AR flower recognition/car recognition of Alipay, which are all for video stream recognition.

AI has brought a lot of innovation to the business, and the booming development of the business has also driven the rapid development of AI technology. How to make these scenes quickly fall into the ground on the mobile end has become extremely important, and the Intelligent application framework of xMedia end is produced to solve this problem.

End intelligent application framework xMedia

“End Intelligent Process”

Since THE AI runs on the end, what is the overall development process taking Saofu as an example? Take a look at this flow chart:

It is mainly divided into two parts. The first part is how to generate the model for the sweeping scene. This is not very different from cloud model making, which is generally taken charge of by algorithm engineers, including data collection, algorithm design, model training, model compression, model deployment and other steps.

Once the model is complete, the next step is the end work, which is handled by the mobile developer. The first is data collection, such as scanning requires opening the camera, and some scenes also need data from other sensors for static judgment. After collecting the data, you can’t directly feed the image to the engine. Why? Because the original image resolution is very high, maybe 1080P, the engine cannot directly process such an image, the image needs to be clipped and scaled first. There is also the processing of the detection area, only the area located in the center of the screen is passed to the algorithm, this is to optimize performance. In the running process, all the collected data cannot be directly transferred to the future for processing, and certain scheduling strategies are needed, such as how to control the frame rate of algorithm processing. Then came the AI engine. After the engine calculated the result, it could not be directly returned to the business, but needed some processing to display it, such as the AR animation in the scan.

The whole interaction process seems simple, but there is a lot of detail involved. For example: what if the next frame comes and the last frame has not been processed? System fever serious how to do? What if coverage is too low? And so on. Therefore, the success of mobile terminal multimedia AI applications depends not only on a good model, but also on resource scheduling, data synchronization, the speed of deployment model and how to reach more users, which requires the joint efforts of algorithm engineers and mobile development engineers.

In addition to AR scanning, There are many other similar intelligent requirements in Alipay. XMedia framework is to solve these common problems existing in various scenarios.

“The challenge”

It can be seen from the previous process that the application scenarios of AI on the end face some challenges, mainly including the following four aspects: • Engineering efficiency and threshold • diversity of terminal hardware and software • Terminal resources and algorithms • limitations of terminal intelligence

Efficiency and threshold for engineering, one of the most important condition of the success of the Internet age, the world martial arts, fast break, not only this kind of challenge is how to elastic flexible way to complete a variety of business needs, how to under the condition of not hair version of the deployment and upgrade model, technical resources less vertical scene business how to quickly access? There are two main solutions. One is the atomization of capabilities, which enables businesses to combine businesses quickly and conveniently in a certain way. The second idea is the encapsulation of vertical scenes, which encapsulates OCR bank card/ID card and other scenes into components that can be used directly, so as to reduce the threshold of such scenes.

The second challenge we face is the diversity of terminal hardware and software. The Android developers have experienced it deeply. With thousands of models and GPU models 50+, the simple compatibility and coverage problem is a developer’s nightmare. The challenge is how to reach more useful users and make low-end devices have a better experience. For this point, we mainly solve the diversity problem through the configuration strategy of different dimensions, aiming at the compatible processing of various GPU features.

The third challenge is terminal resources and algorithms. The problem is how to balance the effect and electricity. For example, if Alipay AR scanning consumes 30% of electricity in one minute, Alipay will be faced with endless drooling from users. To solve this problem, in addition to the ultimate optimization of the algorithm itself, there are also some adaptive scheduling strategies, which will be described in detail next.

Although end-to-end intelligence has many advantages in many scenarios, it still has certain limitations compared with cloud. The recognition rate and processing capacity cannot achieve the effect of cloud, which can be dealt with through the combination of end-to-end cloud.

As mentioned above, more and more other intelligent services in Alipay are also facing these problems. How to solve these common problems has become the most important task faced by xMedia framework, and the solutions to these challenges also constitute the main characteristics of the framework.

“XMedia Technology Big Picture”

To address the challenges mentioned earlier, the xMedia framework evolved into the following architecture, with the most important parts at the framework layer and capability layer.

Taking a look at the capability layer, as mentioned earlier, the idea behind the framework is to atomize processing capabilities and combine them quickly and easily to describe the entire business scenario. It is similar to various operators in AI engine, but the difference is that the operators in AI engine are very fine-grained, while operators here are more abstract, such as detection/classification algorithm, OCR general algorithm, image recognition ability, etc. It is oriented to business developers, rather than algorithm engineers. At the same time, there is another difference. The operator here is not only the encapsulation of the algorithm, but also includes the rendering ability and the collection ability of the data source, etc. Therefore, in order to distinguish them, we call these atomized abilities functor. The business assembles these atomized functors to fulfill requirements, and how they are assembled is the job of the framework layer.

The framework layer is similar to the model in AI, which combines various operators into a directed acyclic graph to describe the overall processing flow of the algorithm. In xMedia, we assembled the various functors of the previous capability layer into a graph by protocol to describe the process flow of the overall business. Because these functors not only contain algorithms, but also include rendering, acquisition, etc., they can constitute a complete processing process. In addition to building these diagrams that represent the business, the framework layer is also responsible for scheduling control of algorithms, data synchronization, dynamic configuration of parameters, and other things. In a nutshell, the capability layer is responsible for Functor, and the framework layer is responsible for making functor work in the most efficient way possible.

At the interface layer, xMedia provides a variety of APIS, including Android/iOS platforms on the end. In order to facilitate front-end use, APIS of H5 and Alipay small programs are also provided. For external users, xMedia is also output independently through mPaaS (mPaaS is a complete APP solution provided by Ant Financial and used in a large number of banks and insurance company apps).

Due to the large number of features and limited space, we will choose three of them to introduce, which are roughly related to the complete process of business development:

  1. How to respond flexibly to demand
  2. How to make the algorithm work better
  3. How to improve coverage

1. How to respond flexibly to demands

As mentioned in the previous section, there are two main ideas for dealing with complex business requirements:

  1. To atomize the abilityfunctor, can be flexibly expanded;
  2. Use the business process above by protocolfunctorA directed acyclic graph is constructed.

Take the following figure as an example:

The figure is divided into two parts, the upper part runs on the client side, there are multiple business entry. The business passes the protocol described by the process into the framework, which parses the protocol into assembly by Functor. For different services, the protocol can be configured from the cloud.

Diagrams generally consist of three parts, input, intermediate processing and output. Input can be a microphone, camera, sensors and other data, these data are collected after input into the back of the handle functor, because the input source is a abstract operator, namely functor, they can be very convenient to expand, when there are other types of input data, can will join the rest of the figure of extended functor. The middle part is usually an algorithm functor, such as inference, OCR, image recognition, etc. They are the consumers of the input data. Each functor processes the data frame and then passes it to the next node of the graph. The last section is responsible for presenting the output to the user, such as superimposing key points of gestures or gestures on camera frames, or synthesizing 3D models of AR scenes into camera frames.

Since Functor can be freely extended and assembled into graphs, it brings flexibility to the business. In line with various deep learning frameworks, it increases the capabilities of the framework by extending operators and combines operators through model files to accomplish different algorithm functions.

Signal detection

Let’s use an online example to illustrate how this works.

Take the Qicao of Mafengzhuang Sports Meeting as an example (it can be experienced through Alipay – Ant Manor – Sports Meeting – Hand exercise). This is a scene of gesture recognition. The AI engine detects the gestures collected in the camera of the user, and the success will be judged when the user’s gestures are consistent with the gestures suggested on the UI. So, in addition to the model of gesture recognition, how does this scenario use the thought mentioned above?

The following is a schematic diagram of a protocol description file of this business, which describes the process of the whole business, such as which Functor executes each step, what its parameters are, and which functor flows the data output by functor, etc. :

From the description file, the xMedia framework runtime generates the flow chart shown below, which starts from the data source, transfers the data to each functor for processing, and finally renders it to the screen through the previewed Functor.

In this way, the business flexibility problem can be solved very well. All we have to do is extend the atomization capabilities and extend the control capabilities of the framework layer. In addition to providing models, the business side can quickly describe the requirements clearly with the protocol, and dynamic update will become very convenient.

In this example, the protocol describes the overall business process, but in addition to the business process, there are some very important factors that determine the final experience of the business, such as data synchronization, operation scheduling, coverage improvement, etc. Let’s look at one very important part of this, algorithm scheduling.

2. Adaptive scheduling

Look at the front of the chicken fuck sample, it will be collected the camera frame is passed to the method, but due to the camera frame rate generally is 30 FPS, if every frame to the algorithm to deal with, due to the difference between consecutive frames is not big, cause algorithm processing results in a large number of repeat, algorithm according to the 30 FPS to handle most of the data is invalid. At the same time, the high frame rate can result in a very high CPU, resulting in severe heating and power dropping at a visually visible rate.

If the frame rate is not enough, the algorithm will not be able to detect the results in time, thus affecting the user experience. Therefore, it is very important to select an appropriate frame rate. But because of the variety of models, different machines have different processing capabilities for different algorithms, so it is not possible to select the same frame rate for all machines. What then?

The most direct way is to configure a pre-validated frame rate for different algorithms for high and low end machines. This method is simple and crude. But this heavy-handed approach faces new problems:

  • What if the CPU usage of other services on the phone is different from that in the test environment?
  • How to configure Android’s thousands of machines, even with reduced dimensions, is a very heavy manual work.
  • How to configure the new machine?

To solve these problems, it seems that you need a way to dynamically set the frame rate based on the current machine. One idea is to directly divide the machine into different levels according to hardware parameters such as CPU and memory, and configure different frame rates for each level of machine, which seems to save a lot of manual configuration work. However, another problem is that background tasks of mobile phones are different. Sometimes background tasks are very light, sometimes they are very heavy, and CPU frequency may even drop. It is not possible to achieve the best effect to control frame rate only according to the configuration of the machine, and a more “dynamic” way is needed to select frame rate.

Here, in order to control power consumption, we always control the CPU to a low level to complete the selection of frame rate without affecting user experience, so as to achieve the purpose of adaptive algorithm scheduling. This idea is very much like the temperature regulation of air conditioning, which can control the room at a fixed temperature, which is very common in industrial control, so the algorithm will control the CPU at a fixed value to achieve the purpose of automatic frame rate control. It mainly includes the following three aspects:

Firstly, the CPU monitoring module is introduced to obtain the current CPU in real time and use it as the input of the algorithm. Then set the CPU at a desired value and control it dynamically through the algorithm. The algorithm takes the current CPU and the current frame rate as input to calculate the frame rate of the next algorithm. Finally, to ensure user experience, set additional maximum and minimum frame rates to prevent exceptions.

Let’s take a look at the adjustment effect of this method through the following graph. The graph is divided into two parts: the upper part is the curve of CPU change over time, and the lower part is the curve of time change over time of algorithm output interval. (Notice here, why not frame rate? In fact, the two are the same thing, frame rate and interval reciprocal, and interval is more intuitive, so let the algorithm output the interval of the next frame.) This diagram shows an example of the effectiveness of algorithmic control.

As you can see, the curve can be divided into sections:

  • In the first section, the CPU stays at a low level of about 45% before 241s, while the time interval is about 250ms.
  • At 241s, the CPU suddenly increased by 20% due to other tasks, causing the overall CPU to rise rapidly to 65%, but then dropped back to 45%. At this time, the time interval of the lower part also changed, with the interval rising rapidly from 250ms to about 1700ms, which is exactly the reason why the CPU in the upper part fell down rapidly. The increase of processing interval caused the decrease of frame rate, so the CPU fell down.
  • In the third stage, around 840s, the pressure of other tasks on the CPU disappeared, so the CPU rapidly dropped to about 20%, but then rose to about 45%, back to the initial level. The time interval for the second half also drops to 250ms and then stabilizes, which is what really causes the CPU to rise.

Thus, this method can control the CPU at a suitable level by automatically controlling the frame rate to achieve a balance between the experience and the heat of the phone. Through this adaptive scheduling method, different machines can adapt to get very good user experience, and at the same time, it can prevent power consumption caused by mobile phone heating.

3. The coverage

Third, let’s talk about coverage. All businesses want to deliver their well-designed experiences to as many users as possible. Here’s an AR scenario to describe what the xMedia framework is doing to improve coverage. AR is used as an example because it is typical and can well illustrate the impact of machine diversity on coverage improvement.

Looking at the video above, the AR algorithm simply uses a visual way to locate the current location of the phone in space, such as this flower through the door, using the algorithm can always be placed somewhere on the ground. On the premise that the algorithm has been optimized, now huawei P9 Plus about 18ms a frame, here describes the optimization of xMedia framework layer on Android.

First, take a look at the overall process described in the figure above: After the camera collects data, there are two steps. The first step is to render the camera frame into a texture, and at the same time, preprocess the frame data, and then pass it to the algorithm together with the sensor data. After processing, the algorithm returns the location of the current phone and uses the location for 3D scene rendering. This next step is key because, unlike the previous gesture recognition example, the data needs to be synchronized. What is synchronization? It means that the scene rendered in 3D is the same frame of data as the scene rendered by the camera. Therefore, the overall frame processing time depends on the maximum time of camera rendering and algorithm +3D rendering, and the overall frame rate is determined by this.

This process doesn’t seem particularly complicated, so what’s the problem? Those of you who have experience in Android camera development probably know that there are two ways to get camera data:

  • One is to directly fetch YUV data for each frame
  • The other is to get the texture of each frame

It looks nice, YUV data is passed directly to the algorithm, camera texture is used for camera rendering, and the two paths are combined directly after processing. Their biggest problem, however, is that the two pieces of data are out of sync. That is to say, the YUV data obtained may be several frames old and cannot correspond to the camera texture obtained. There is no correspondence between the two. So if you want to synchronize processing, you can only choose one of the ways to get the camera frame data.

So which way to choose? The answer, as you’ve probably guessed, is no simple choice, but a combination of strategies.

Multi-strategy adaptation

For the two different data acquisition methods, the link processing is completely different, as shown in the figure below:

The first method is to directly obtain the YUV data of each frame, which is easier for the algorithm to process. It is directly transmitted to the algorithm to calculate the pose of the current camera and render the current scene by the 3D engine. However, for camera rendering, it is not so convenient. YUV data needs to be uploaded to GPU and texture frames rendered, which is finally combined with the previous 3D scene. This approach is intuitive.

The method of capturing the camera texture for each frame is completely different. In the case of camera frames, because the texture itself is already received, so rendering texture reality this step does not have to deal with, can be directly used. As the algorithm needs camera data, texture cannot be used directly. Therefore, YUV data needs to be read from GPU to CPU, and the way of reading is also complicated, which will be described in detail in the following.

Strategy selection

The previous two methods were respectively introduced, so how to choose? The main idea is to process according to the characteristics of the GPU, but there are 50+ kinds of GPU on the Android platform alone, so the choice becomes a lot more difficult. Fortunately, gpus with a large proportion of models can be divided into two categories:

  • For most Gpus of Adreno type, the ability to upload texture is strong, and there is no performance problem in rendering YUV data directly. YUV data collected by the camera can be rendered at the same time, and then transmitted to the algorithm for 3D engine rendering, and then synthesized after the two are completed. So this type of GPU chooses the first option. Most qualcomm platform machines are equipped with such Gpus, accounting for about 50%.
  • For Mali type Gpus, the ability to upload textures is very poor. Without algorithmic processing, the YUV data collected by the rendering camera is called back, and there is a significant lag. Therefore, the first method cannot be used for this type of GPU. For this part of the model, render directly using the camera preview callbackGLES11Ext.GL_TEXTURE_EXTERNAL_OESTextures are used as camera frames (EGLImage, an EGL extension used at the bottom of Android that converts YUV data into this type of texture, is hardware dependent and is not open to the application layer).

Then use glReadPixels to read this frame texture from GPU, but it cannot be read simply for the following reasons:

  1. Usually the texture taken out of the camera is large, such as 1280×720, while the algorithm needs very small, such as only 300×300. It is very time-consuming to read the data completely from the GPU.
  2. The algorithm only needs the Y channel in the YUV data, so there is a lot of waste in reading all the data.
  3. glReadPixelsWhen reading data from GPU, texture is output as RGBA format, but now all you need is Y channel, cannot correspond to pixel one by one, so it cannot be easy to process reading.

It is for these reasons that rendering is required before reading, and rendering does several things:

  1. Crop and scale the data to the target size.
  2. Remove the UV channel and only retain the Y channel.
  3. Y channels are arranged in a specific way to ensure thatglReadPixelsThe way of reading is continuous.

After this rendering, the readout data of glReadPixels is the trimmed data with only Y channel, which can be directly output to the algorithm for processing. Generally, the time of these two parts is 5-10ms.

After the algorithm calculates the camera pose, it waits for the engine to finish rendering to get a scene frame, and then synthesizes it with the camera preview texture to get the final frame. The majority of Huawei and MTK’s machines are powered by Mali gpus, accounting for about 40 percent of the total.

Different models can be configured with appropriate policies based on their GPU, which can greatly improve coverage. But this is still not enough, and it needs to be combined with other ways to further improve coverage.

Parallel + cache

The time-consuming part is in two parts, one is GPU rendering, the other is algorithm (running on the CPU side), and these two parts are not completely dependent on each other, so the performance can be improved by parallel. In addition, using multi-level FBO to cache multiple camera frames and YUV data can also improve performance. Note, however, that excessive FBO caching also has graphics memory overhead and secondary rendering overhead. Generally two levels at most, depending on the configuration.

After these two steps, the overall algorithm can reach over 75% and maintain a frame rate of 22fps. Here is the complete strategy:

Reversion algorithm

With thousands of Android machines, 75% coverage is high for such a time-consuming algorithm, but there are still a large number of machines that can’t cover it. Therefore, in order to ensure that the remaining part of users can also reach, we achieved higher coverage through algorithm degradation, such as calculating the Angle of the camera according to the gyroscope, giving up the last frame data synchronization, etc. Finally, the overall coverage can reach more than 99%, of course, at the cost of sacrificing part of the experience.

XMedia application scenarios

Finally, in addition to sweeping the five blessings in the Alipay Spring Festival, we can summarize the application scenarios into the following four categories:

  • Smart interaction
  • Smart services
  • Intelligent reasoning
  • Risk control safety

Intelligent reasoning is mainly used for intelligent preloading of small and medium-sized programs in Alipay and recommending waist seal on home page. Risk control security includes content security detection and video authentication. For now, let’s briefly introduce the first two application scenarios:

“Intelligent Interaction”

Intelligent interaction is to use algorithms to interact with users, mainly used in AR. AR replaces traditional touch interaction and brings completely different experience. For example, the AR sweep and chicken exercises mentioned above are such scenes. Chicken exercises can be experienced in Alipay – Ant Manor – sports meeting – hand exercises.

The AR tree viewing activity of ant forest described in the coverage rate can be experienced in the following way.

Another head-based interaction is neck exercises, which can be experienced from Alipay – small goal – neck exercises. It is suitable for office people. Set a goal for yourself to exercise cervical vertebra every day and relieve fatigue.

All the above interaction modes adopt AR/AI, and the experience is very different from the traditional interaction mode.

“Smart Services”

The second category is intelligent service, which mainly applies to vertical scenarios and is closely related to Alipay business, such as OCR identification of bank card, OCR identification of ID card and Compensation bao, etc., which can be experienced through the following videos without further details. The identification on the end has the characteristics of high accuracy, fast speed, model and so on. Many vertical scenes have been opened in alipay mini program.

How to experience the intelligent ability of Alipay?

At present, xMedia has been applied in mPaaS. Based on the linkage between App end and small program scenario, the intelligent analysis and operation capability of “data calculation and analysis + analysis and decision engine +mPaaS scenario” is realized.

In addition, with the help of such abilities as intelligent creative copywriting, intelligent distribution, intelligent delivery, AB experiment and so on, a fully automated marketing process can be formed to truly improve the level of marketing automation.

At the same time, welcome to use dingdingsearch group number “23124039” to join the mPaaS technical exchange group, looking forward to communicating with you.