During the Openl/O 2020 Kickstart Developer Conference, Megvii brought together a team of star lecturers to share our exploration and breakthroughs in deep learning for both on-site and online developers. In order to let more friends know the content of this course, we will sort out the essence of dry goods into a collection, hope to discuss with you.Copy the code
Today, we will share the course “Modern Practical Depth Visual Model” brought by Zhang Xiangyu, head of Base Model Group of Megvii Research Institute. This lecture focuses on the convolutional Neural network (CNN) model design, which is the core of depth visual recognition system. Aiming at the research topic of “how to design a good and fast convolutional neural network Model”, this issue mainly describes the research results of the Base Model group in the direction of efficient Model design from three aspects :(1) Model search, (2) dynamic Model and (3) reparameterization and auxiliary supervision.
At the same time, we provide development practice cases corresponding to relevant research based on Tianyuan. We hope to provide more reference and inspiration to the deep learning community by sharing the Base Model group’s academic and practical exploration and accumulation.
Without further ado, let’s get straight to the dry stuff
Depth visual model development
Basic models largely determine the evolution of the industry, and the impact of a good model on the overall visual task is huge. Next, we’ll share some of our explorations in model design, including some of the more advanced, popular deep visual models, and some of the less popular, but very useful techniques in business. At the same time, we will share the practice of these models in tianyuan.
The first is CNN model, which has always been the core of depth vision recognition system. Its core position is mainly reflected in prediction accuracy and inference speed. In terms of prediction accuracy, the improvement of the back-end identification task by the basic model is the most significant and essential. In terms of inference speed, CNN model design also plays a decisive role in model deployment.
In recent years, convolutional neural networks have developed rapidly in terms of prediction accuracy. In 2015, we proposed ResNet, and the prediction accuracy of ResNet has been greatly improved in ImageNet task. However, with the increase of accuracy, the model inference speed will slow down with the increase of model computational complexity. Therefore, how to achieve a better balance between accuracy and speed has always been a great challenge in the process of model design. In addition, the differences of target task or hardware platform, theoretical complexity and actual speed, and the additional constraints of platform or business greatly increase the difficulty of model design.
In order to solve the problems in the above Model design and design a more practical Model, the Base Model group mainly explores from six aspects: lightweight architecture, Model tailoring, low-precision quantization, Model search, dynamic Model, and heavy parameterization and auxiliary supervision. ** There are many other ideas, of course, but these six techniques are the ones we use most of our business and most often. Our tianyuan framework supports all six of these basic ideas.
It is important to note that in practical efficient model design, we cannot simply rely on a single technique, but often need to use a combination of various techniques to achieve the desired results. Today, we will focus on three relatively new ideas: model search, dynamic modeling, and reparameterization and auxiliary supervision.
Model search
The earliest model search algorithms were very similar to the process that researchers normally go through when they do manual model design — trial and error, trying to design a structure, training that structure for its final predictive performance, then coming back to guide the next attempt, and so on. However, due to trial and error, the model needs to be trained repeatedly, resulting in a very inefficient design process. It can be seen that although model search is a direct way to solve the problem of model design, there are many difficulties in making model search really play a role (more than manual model design).
Model search algorithm design can be described by an impossible triangle composed of efficiency, performance, and flexibility. Usually, it is difficult for a model search algorithm to meet the requirements of the three points at the same time, so in the actual design of the model search algorithm, it needs to be weighed according to the specific requirements.
How to challenge the impossible Triangle?
The first Angle of the impossible triangle is efficiency, and we increase efficiency by inheriting weights. The Base Model group recently proposed SinglePath One-shot NAS, which constructs a hypernetwork containing all candidate subnetworks. After the supernetwork training is completed, every time a candidate subnetwork is found, there is no need for training, and the corresponding weight of the candidate subnetwork is directly taken out from the supernetwork to evaluate the model. Although it is doubtful whether these inherited weights can represent the performance of candidate networks when training alone, it is also an open question in the NAS field and is being studied by everyone. But there is no denying that the parameter sharing mechanism of inheritance weight greatly reduces the time cost of search.
The second Angle is accuracy. Through the uniform random sampling training mechanism of SinglePath One-shot NAS, the paths are decoupled as much as possible. This training mechanism makes the performance of single path in hypernetwork better approximate the performance of real single training model and achieves higher model evaluation accuracy.
The third Angle is flexibility. At the business level, with genetic algorithm as the framework, different indicators of the model can be used to search constraints, such as power consumption, speed limits, so it can achieve very flexible search.
So what exactly can Single pathone-shot NAS do? First can use it to do structural unit search, also can do channel search, but also can do mixed precision quantitative search. For quantization, SinglePath One-Shot NAS was used to search the number of bits at different layers, so as to achieve better quantization effect.
Although the Single Pathone-Shot NAS has been around for two years and looks old, due to its strong scalability, we have made a lot of improvements with Single Pathone-Shot NAS to make it suitable for different tasks. In the fast-moving wave of deep learning, it’s true that very few things are still effective two years later. For example, for object detection, we proposed DetNAS. For channel Pruning, we proposed meta-Pruning last year, and joinTMd-Pruning this year. Jointmd-pruning performs a joint search for network input size, number of channels, and depth – this search can be done in a single training session and is a very efficient, multi-dimensional super-joint search.
As mentioned above, the weight inherited by the subnetwork from the super network cannot fully represent the final prediction performance, one of the reasons for this result is that the search space is too large, and the search space clipping is a common important technique to narrow the difference between model performance and model final prediction performance under the weight inheritance mechanism. Many methods have been derived in this research field, and we proposed the AngleNAS search space clipping method. AngleNAS looks at the rate at which different path parameters are updated — the faster the update rate, the more important the path is considered. By tailoring the search space using AngleNAS, we were able to further improve the search performance of SinglePath One-Shot NAS.
The dynamic model
Dynamic model is a kind of data adaptive model. Once the general model is trained, no matter what data is input, the reasoning calculation process is fixed. For dynamic models, different input data will change the weight used by the model, the inference path, and the size and depth of the hyperparameters used.
Dynamic modeling is essentially a space-for-time technique. As shown in the figure below, we use such a forests model to reflect the idea of dynamic models. This tree has many branches, the whole tree is very complex but each branch is very efficient, according to the characteristics of the data to choose one branch as the reasoning model, so that the whole reasoning process is dynamic change. In the process of reasoning, we hope to select reasoning models with different computational complexity for different data, and further improve the speed of model reasoning on the basis of ensuring the accuracy of model prediction. This is the basic idea of dynamic model.
According to the different variation dimensions of dynamic models, three research ideas of dynamic models are mainly introduced here. The first idea is called dynamic routing. In CVPR2020, we choose different inference paths for different input data based on the idea of dynamic routing to achieve efficient inference.
The second idea is called dynamic weights, and for different data, the weights in the convolution kernel are different. The third idea is called dynamic architecture hyperparameters, which can be adaptive and choose the most appropriate network form based on the data.
In practical application, the above dynamic network method is a good idea, but we prefer the dynamic weight idea. Dynamic weight guarantees that only convolution kernel parameters are different, but the inference path is the same, and the resource consumption and inference time required by each inference are relatively stable, which is more commonly used in business.
WeightNet is an example of dynamic WeightNet.
In model design, there are two very important techniques. One is SENet, which can be regarded as a kind of channel attention, and the other is CondConv, which was proposed by Google last year and was called Soft Conditional Convolution (SCC) in the early stage. Its basic idea is to treat the convolution kernel as a system of many experts. So Megvii thought, can we merge these two things and get a unified framework?
After a series of mathematical derivations, we found that we could use a very simple model to make a unified description. The general Conv model has fixed parameters, but we introduce two computational branches to realize the dynamic weight transformation. A fully connected branch predicts the weighted coefficient of the convolution check, then combines the learned convolution kernel branch to obtain the data-dependent convolution kernel parameters, and finally uses the weighted convolution kernel to carry out the real convolution operation.
Grouped FC the second FC from the all connected branch was called Grouped FC. From the Grouped FC, the Grouped FC became Google’s CondConv when the Grouped FC was very small or not Grouped. But when the Grouped FC was too many, it became SENet. We find that the optimal value of the grouping number is between the two extremes of CondConv and SENet. By adjusting the grouping number to one of the two intermediate values, fast and good inference can be achieved.
Through experiments, we find that the attention mechanism is the best when the same FLOPs or Params are used, and SENet and CondConv also have a very good improvement.
Another example of a dynamic network is FReLU, which we’ve been talking about all year. The experimental results are better than most of the previous methods. We use very small convolution and MAX operations to avoid additional performance overhead, while providing better predictive performance. So how are WeightNet and FReLU implemented in the celestial framework? Click below video link, and unlock more detailed answers: mp.weixin.qq.com/s/S5rNebrB8… (Please enter the official account to view the video)
Reparameterization and auxiliary supervision
Reparameterization is an area that is just beginning to catch on. What is reparameterization? The model uses different network structures and model parameters in the training and reasoning stages, but after the training, the model is transformed from the structure and parameters used for training to the structure and parameters used for reasoning. The constraint condition of model structure and parameter transformation is that the model functions they represent are equivalent in mathematics. Generally speaking, the model structure for reparameterization needs to meet some conditions: the training stage tends to be easy to optimize, and the reasoning stage tends to deploy efficiently. How can the benefits of both be achieved? The most typical case is BatchNomalization.
Auxiliary supervision focuses on the use of additional auxiliary loss functions to achieve higher performance, auxiliary loss functions only exist during training. The common characteristic of the two ideas of reparameterization and auxiliary supervision is that they do not increase the cost of the reasoning stage, but only use the skills in the training stage, so they are very beneficial in the actual model training process and production process.
As for reparameterization and auxiliary supervision, several relatively new methods are given, among which the reparameterization methods are ACNet and RepVGG respectively, and the auxiliary supervision method is LabelEnc.
ACNet
It thinks the trained to the importance of the convolution kernel parameter is different, such as for 3 x 3 convolution kernels, the most important is the center of pixels, so training model structure to use three road, when the first all the way using only 3 x 3, the second use 1 x 3, the third layer using 3 x 1, the final will be 3 road features combined to fusion, through this way for training. In the reasoning stage, three convolution is combined into a 3×3 convolution, which has no increase in complexity compared with before disassembly, but can achieve greater performance improvement compared with direct use of 3×3 convolution training.
RepVGG
It has many advantages. The inference stage is a pure 3×3 network, which can save runtime memory and remove Shortcut single path structure, making access friendly and facilitating cross-layer flow. The original VGG was no match for ResNet for a variety of computer vision tasks. In order to improve the prediction performance, 1×1 ConV and Identity auxiliary branches were introduced on the basis of the original VGG network to solve the model optimization problem in the training stage. Since both 1×1 CONV and Identity are linear operations, they can be combined into a 3×3 in the inference stage, thus eliminating the overhead caused by introducing additional branches and achieving painless point increase.
LabelEnc
The core idea of this method lies in the direct supervision of backbone when training the target detection model, so that backbone is no longer completely dependent on the detection head, but also depends on other parts. The first step is to learn LabelEncoding Function and reversely map the Label back to the feature space. The second step is to use LabelEncoding Function to supervise the network intermediate features and train the detection model completely by using a method similar to model distillation.
Two-stage training:
Step 1, learn LabelEncoding Functiono;
The second step is to train the detection model completely with a method similar to model distillation
Finally, you can click on the following link for detailed explanation of some technologies and algorithms mentioned in the article.
Technology and algorithm details:
-
WeightNet: ECCV 2020, a flexible and efficient network framework for weight generation
-
FReLU: ECCV 2020, a novel activation function that significantly exceeds ReLU in visual tasks
-
LabelEnc: ECCV 2020, A new Intermediate Supervision Method to Improve Object Detection
Open source model + code:
-
Github.com/megvii-mode…
-
megengine.org.cn/model-hub/