Since deep learning technology was proposed, it has been practicing the concept of “think big”. Especially when pre-training techniques are widely used, more data combined with a larger number of model parameters will continue to improve model performance. This law has been continuously verified by various large models released recently. In 2021, ERNIE3.0 in baidu wenxin model, mt-nlp jointly launched by Microsoft and nvidia, and Switch Transformer of Google, etc., can be used in hundreds of billions or even trillions. After obtaining the high performance large model, how to combine the large model with the business implementation becomes particularly important. The current landing process of the pre-training model can be summarized as: For specific tasks with only a small amount of annotated data, the fine-tune pre-training model of task data is used and deployed online. However, when the number of pre-training model parameters increases, the process faces two unavoidable problems. Firstly, as the number of model parameters increases rapidly, the computing resources required by fine-tuning of large models become huge, which is usually unaffordable for ordinary developers. Secondly, with the development of AIOT, more and more AI applications migrate from cloud to edge devices and end devices, while large models cannot be directly deployed on these hardware with extremely limited storage and computing power. In view of the problems faced by pre-training large model landing, Baidu proposed the Unified Feature Optimization technology (UFO), which takes into account the cost and deployment efficiency of large model landing while making full use of big data and large model. The main contents of UFO technical scheme include: 1. All in One: design visual representation multi-task collaborative training scheme, which avoids the fine-tuning process of downstream tasks and achieves the overall leading effect of single model in multiple core tasks of smart city. 2. It pioneered the ultra network and training scheme for visual multi-task, supporting the flexible deployment of various tasks and hardware, and solving the problem of poor reasoning performance of large model.

All in One:

A more powerful and versatile visual model

Prior to the mainstream visual model production process, the single-task “Train from Scratch” solution was usually adopted. Each mission is trained from scratch and cannot be copied from one another. Due to the bias problem caused by insufficient single task data, the actual effect is excessively dependent on the distribution of task data, and the scene generalization effect is often poor. The booming big data pre-training technology in the past two years learns more general knowledge by using a large amount of data and then migrates to downstream tasks. In essence, different tasks learn from each other what they have learned. The pre-training model based on massive data has good knowledge completeness, and fine tuning based on a small amount of data can still achieve good results in downstream tasks. However, the model production process based on pre-training and fine-tuning of downstream tasks needs to train models separately for each task, resulting in a large consumption of RESEARCH and development resources. The UFO All in One model proposed by Baidu can be directly applied to handle multiple tasks by training a powerful general model using data from multiple tasks. Not only can the performance of a single task be improved through cross-task information, but also the downstream fine-tuning process is avoided. The RESEARCH and development mode of UFO All in One model can be widely applied to various multi-task AI systems. Taking the multi-task large model of smart city as an example, UFO All in One can achieve the SOTA identification effect of multiple tasks with a single model, and the multi-task model can achieve significantly better effect than the single-task model. The effectiveness of information reference mechanism between multiple tasks is proved.

The single model covers four tasks of smart city

Smart city is one of the most important application scenarios of computer vision technology at present. In each task of smart city, human face, human body, vehicle and general object are often processed at the same time, which puts forward a very high requirement for the multi-task cooperative ability of AI system. Most of the existing visual models can only detect or identify one type of target. Baidu, through the multi-task collaborative learning technology in the UFO scheme, produces the urban visual UFO model to simultaneously process these four types of tasks, and achieves SOTA on 10 public data sets. The following details the UFO multi-task cooperative training program. \

Task setup and data

In order to verify the effectiveness of the scheme and facilitate fair comparison, 10 open data sets were used for training and testing. The statistical information of each dataset is shown in the table:


\

Unified task configuration

From the perspective of model optimization, the batch size, learning rate and optimizer of previous training of different task models are different. In order to facilitate the subsequent multi-task training, the UFO scheme unified the model structure and optimization method of each task. The task configuration is as follows:


\

Heterogeneous data sampling strategy and Drop Path regularization technique

The first problem of multi-task learning is how to construct Batch. There are two common methods. One is to compose Batch data in the same data field, that is, the data in the Batch are all from the same task. Different batches are used to select different tasks to ensure that all tasks are trained. The other is the Batch composition of different data fields, that is, the data in the Batch comes from different tasks. The problem faced by Batch composition of the same data field is that when BatchNorm, a common operation, is used in the model, the statistical values during training (single-task statistical values) and testing (multi-task statistical values) are greatly different, resulting in poor model effect. As shown in the table below, using the mixed data domain scheme can greatly improve the effectiveness of the two tasks verified by ResNet50 structure in human body Market1501 and item SOP.

Of the four tasks, humans and objects had the smallest training sets, with only about 60,000 images each, while faces and vehicles had about 5 million and 400,000 images respectively. Therefore, in the process of multi-task training, there is a phenomenon of fast over-fitting of human body and objects, while under-fitting of human face and vehicle. In order to solve the over-fitting problem caused by the unbalanced data of various tasks, Drop Path regularization method was used in the training process to achieve the improvement of MAP1% ~3% in human body and object tasks, while the effect of other tasks was equal or better. \

\

Refresh 10 public SOTA results in a single model

Compared with the previous single-task SOTA results, the All in One UFO model of urban vision obtained by multi-task collaborative training scheme achieved new SOTA in 10 test sets of 4 tasks. Meanwhile, compared with the single-task results using the same model structure, Ufos also performed better in most tasks. The effectiveness of information reference mechanism between multiple tasks is proved.

In the figure above, gray represents the result of urban vision All in One UFO model, orange represents the single task result of the same model structure as THE UFO model, and blue represents the optimal single task result of the same data set before. All of the above results do not use pre-training data, and there is no reordering strategy. \

One for All:

Flexible and scalable deployment solutions


\

Limited by computing power and storage, large models cannot be deployed directly on edge devices. A model developed for cloud devices often needs to be compressed or completely redesigned for deployment to edge or end devices, and pre-training the compression of the large model itself costs a lot of resources. \

\

In addition, different tasks have different requirements on the function and performance of the model. For example, the face recognition access control system only needs to have face recognition function, while the control system of smart community needs to have the ability of face recognition and human body analysis at the same time, and some scenes need to have the ability of vehicle type recognition and license plate recognition at the same time. Even for the same face recognition task, access control system and financial payment system have different requirements for model accuracy and performance. At present, these tasks often need to be customized to develop multiple single-task models, coupled with the need to adapt to different hardware platforms, AI model development workload increased significantly. In view of the large model of the development and deployment, UFO gives a solution of One for All, by introducing the concept of super network, super network consists of many sparse subnet, each network is a path in the super network, different number, different tasks, functions and the precision of different model training process into a super network model. The large model of One for All UFO super network completed by training can generate corresponding plug-and-play small models for different tasks and equipment at low cost, realizing the capability of One for All tasks and One for All chips. \


\

Supernetwork design and training scheme

UFO based on the Vision Transformer structure design multi-task multi-path hypernetwork. Hypernetwork is divided into multipath FFN hypernetwork and scalable attention hypernetwork. Different from Google Switch Transformer, the UFO hypernet can select different FFN units for different paths. Meanwhile, the Attention module supports flexible expansion according to different tasks, realizing the expansion of the search space of the network, providing more optional sub-networks for hardware deployment and improving accuracy. UFO has also specifically designed training programs for multi-tasking hypernetworks. First, for THE FFN hypernetwork module in the hypernetwork, each block of each task will automatically learn the weight coefficient of shared FFN(FFN-shared) and task-specific FFN(FFN-TaskX). The parameters of shared FFN will be updated for all tasks, and only the exclusive FFN parameters will be updated for specific tasks. In FFN hypernetwork training, for each block, each subnetwork has three different path choices, namely, select shared FFN, select exclusive FFN or select weighted FFN. For all FFNS, you can choose different scaling coefficients. Therefore, there are (T3ratio) ^L different FFN paths in FFN super network, where T is the number of tasks, L is the number of layers of the network, and ratio is the number of scaling coefficients. For self-attention hypernet, each subnetwork can choose different number of heads and number of block repeats. \

One for all tasks

Since the data of each task is used in the training process of hypernet and the task constraints are imposed on the hypernet, the related tasks can share more parameters, while the interference between unrelated tasks is minimized, so as to obtain the optimal subnetwork model for different tasks. In service applications, you only need to extract the corresponding subnetwork structure and parameters according to the effects of different subnetworks on specific tasks, and then deploy them directly without repeated training.

One for all chips

In view of the different storage capacity and computing power of different platforms, sub-networks with different sizes and computing capacity were selected from the trained UFO supernetwork model for deployment. Because many super neutron network data per subnet one by one test precision and delay is not reality, so in the UFO, use the GP – NAS 【 1 】 the super parameter super parameter estimation technique based on gaussian process, just less sampling in the super network to evaluate quantum networks, can accurately predict the precision and speed of the other network.

Single super network supports multi-task and flexible deployment of smart city

Based on the above scheme, the One for All UFO supernetwork model trained with open data can achieve SOTA accuracy in six open test sets of four tasks of smart city: human face, human body, vehicle and object. Meanwhile, the sub-network extracted from THE UFO supernetwork can compress the number of parameters by 20%~30%. It can also exceed the previous RESULTS of SOTA.



conclusion

Although the large model has set new records again and again, showing amazing results, but for the industry, is bound to face the problem of how to apply the implementation. Baidu said the unity of the proposed feature optimization technique (UFO), gives the early training outside another solution: in the model production level, through the All in One solution, make full use of the benefits of big data and model, integrating multiple tasks to a set of training framework and model, through the across tasks with higher specific tasks of information. At the deployment level, based on the One for All scheme, only a single hypernetwork can support the adaptive deployment of different task models on different platforms and different computing devices, realizing the plug-and-play of models. \

Currently, UFO All in One model is available on PaddlePaddle, and UFO One for All model will be available in the near future. For more UFO technical details, you can visit the following link: github.com/PaddlePaddl… Appendix [1] GP-NAS: Gaussian Process Based Neural Architecture Search, CVPR 2020