SmileAR is a mobile AR solution based on TensorFlow Lite developed by IQiyi. Currently, it has been deployed in many products of IQiyi, including the IQiyi APP, which lives more than 100 million times a day, qibabu, a popular children’s product, and the gingerbread short video.

Iqiyi, one of China’s largest online video companies, has a vision of “becoming a great entertainment company driven by technological innovation”. Innovation is the platform gene engraved in iQiyi. Iqiyi hopes to continuously improve users’ entertainment experience through continuous technological innovation, including innovative research in the field of AI/AR.

Figure 1. SmileAR Logo

SmileAR is a mobile AR solution based on TensorFlow Lite developed by IQiyi, which includes basic algorithms such as face detection, face key points, human key points, portrait segmentation, gesture recognition, object recognition and so on. Based on these basic algorithms, we further encapsulate AR applications such as face beautifying, body beautifying, ga dancing machine and sweep. SmileAR has been deployed in many products of IQiyi, including IQiyi APP, Qibabu, gingerbread short video and IQiyi livestreamer, etc.

Figure 2. SmileAR mobile AR solution system framework

Face key point recognition and tracking

For live broadcast and video shooting, face key location is one of the most important basic ability. The key point of the face is the facial position that can reflect the expression state in the face area, including the eyes, mouth, nose and other parts. The accurate point location information obtained based on the key point location of the face can realize the functions of face trimming, local beauty makeup, virtual wearing and so on. These functions have been implemented in short video shots of iQiyi APP, which can automatically perform facial functions such as face thinning and large eyes, as well as switch between different styles of headwear, so that ordinary users can have a celebrity face.




FIG. 3 SmileAR key point algorithm Demo

Gesture tracking and recognition

Gesture recognition is an important interaction mode on mobile terminal, which can improve the interaction of short video according to different gestures. Gesture recognition adopts SSD detection model, MobileNet is used as the backbone network, quantization technology is used to improve execution speed during deployment, and real-time gesture detection is finally realized in mobile terminal. At present, iQiyi gesture recognition has been deployed in iQiyi client, iQiyi livestreamer, gingerbread short video and other apps.

Figure 4. SmileAR gesture map Demo

Identification of human key points

As a part of IQiyi SmileAR, human key point recognition algorithm has been implemented in the module of “cute baby dance room” in the APP of “Qibabu”, which is designed by IQiyi for 0-12 year old babies. By making the baby imitate the dance movements of professional children’s dance coaches, so as to achieve the purpose of exercising the baby’s coordination ability. In the baby to imitate the process of learning, our algorithm will continue to the baby imitation of action, which can identify and similarity calculation with dance coach action, thus to judge baby movement precision, when action is relatively accurate trigger corresponding dynamic effect (such as a variety of fruits, etc.), continuously encourage the baby to learn and improve. The function has been praised by bao Ma Bao dad since it was launched, saying that it is a “clever tool to coax children”.

FIG. 5. SmileAR dance scoring Demo

Mobile terminal algorithm optimization

As is known to all, reasoning of deep learning requires a large amount of computing power, which is exactly what mobile terminals such as mobile phones and tablets lack. Moreover, most SmileAR algorithms need to be processed in real time, which brings great challenges to our mobile terminal acceleration and algorithm optimization. At the same time, different business needs for algorithm are also diverse, we also made the corresponding algorithm optimization for different business needs.

Conventional acceleration optimization

In order to improve the reasoning speed of the algorithm, we use the conventional model acceleration method to replace the complex backbone network with MobileNet V2 network suitable for mobile terminal. For less precise business scenarios, we further reduced the network input size and the number of channels on the MobileNet V2 network. This series of operations significantly improve the execution speed of the algorithm.

Quantization training acceleration

Since floating-point data is very slow to be calculated on the mobile end, we used TensorFlow to conduct Quantization-aware training for the algorithm model. The new TensorFlow has perfect support for Quantization, as long as two lines of code are added to the training code, Quantization training can be completed. We first obtain the convergent model through normal training, and then conduct quantitative training on it. The CPU inference speed improvement on mainstream chips such as Qualcomm Snapdragon, Huawei Kirin and Mediatek Helio is shown in the following table:

Figure 6. Time comparison of face detection before and after quantization

Face key points jitter optimization

To predict the key point of a single frame image is prone to jitter, so we consider the multi-frame image, and use the timing information of the video to fuse the key point information of the previous frame with gaussian mixture model to obtain the stable face key point of the current frame.

FIG. 7. Comparison of key point stability between single frame and timing model

Multi-task training optimization

In order to get better results, face detection and human key point recognition are adopted multi-task learning. Because face pose and expression change greatly, face model adds prediction to pose and expression and so on. The model structure is shown as follows:

Figure 8. Structure of face multi-task model

In the test data, the pose estimation task was added to face key point location, and the error was reduced by 8%. When facial expression recognition was added, the error was reduced by 2%. If the human body’s key points only predict the heat map, it will lead to a large point position deviation, so the multi-task model is also introduced into the model.

Jump out of the local optimal solution

Human body key algorithm is multitasking learning, heat map is only one of the key tasks, due to network out, some point will output a whole zero heat map (into local optimum, zero output price is small), aiming at this problem, we design a kind of figure of auxiliary heat loss function, to increase the total zero heat a punishment, Finally, the problem of falling into local optimum is solved.


Figure 9. Loss curve after adding heat graph auxiliary loss function

Hard Mining

In the face key point positioning task, various data distribution is not balanced, some difficult scenes, such as large Angle, eyes closed, mouth open, etc., the number of samples is small. When the model is trained to convergence, Hard mining can further reduce the number of false checks and improve the stability of the model.


FIG. 10. Comparison of hard-mining effects

Mobile Deployment

Cross-platform deployment

We use the C++ interface of TensorFlow Lite for Android, iOS, and Windows multi-terminal deployment. Static libraries compiled into TensorFlow Lite for different platforms are introduced into the project, and then Objective C and Java are used for upper encapsulation on iOS and Android platforms respectively. In this way, the mobile native part of the code can be reused and cross-platform development efficiency can be improved. In Windows, we use C++ to do the same packaging, and Tflite successfully applied in face detection and other functions. At 1080P, the speed is real-time. When the SDK is delivered, it only needs to be packaged according to the needs of the business side, and the business side only needs to call the corresponding Java/Objective C layer API without caring about the underlying implementation, which facilitates the integration of the mobile end product line. Since TensorFlow Lite can run directly on Windows, we have also applied it to iQiyi’s Windows client.

SDK authentication and model protection

In the model delivery part, the model file is encrypted by iQiyi’s self-developed encryption algorithm to improve the security of the mobile terminal model. In order to protect iQiyi’s intellectual property rights, we also added License authorization verification.

Package size management

In the complex mobile network environment, package size can seriously affect the success rate of download and user experience. We further reduce the size of TensorFlow Lite library by using the method of link-time optimization. Since TensorFlow Lite registers all operators by default, we actually only need to use part of the operator, so we can weed out the operators that are not needed. When linking static libraries into dynamic libraries or executables, unused code is not included in the link product by the linker. To take advantage of this, we inherit MutableOpResolver, the operator needed in the registration model. This solution does not affect the TensorFlow Lite static library code, and can dynamically add and delete operators.

Summary and Prospect

Taking SmileAR as an example, this paper introduces the application of TensorFlow Lite in IQiyi. As a customized engine for mobile terminals, TFLite’s cross-platform support (Android, IOS, Windows 64-bit) enables us to deploy a set of models at multiple ends, greatly saving r&d costs. Its excellent execution efficiency enables many algorithms to run in real time on the mobile terminal, providing a strong guarantee for product landing; Its rich tools, such as Benchmark Tool, can quickly analyze model time and provide powerful help for model tailoring. With the help of TFLite, SmileAR has been successfully landed in multi-terminal and multi-app.

SmileAR, as an important application scenario of IQiyi AI technology landing, will continue to explore in the future research directions such as video time-domain information stabilization, model mobile acceleration, model precision optimization, GPU acceleration on different platforms, and will be supported by Google TensorFlow ecology. Continue to provide efficient and stable AR solutions for all business lines of IQiyi, help iQiyi’s user growth and retention, enable users to enjoy the immersive extreme AR experience brought by technological innovation conveniently and equally, and provide users with happy entertainment experience.

end

You might want to see more

We are all dream chasers — IQiyi 8K VR live broadcast technology solution

Why is iQiyi a technology company?

Scan the qr code below, more exciting content to accompany you!

Iqiyi technical product team

Think simply, act simply