2021 | ICLR Meituan AutoML paper: DARTS - robust neural network architecture of the search

background

Meituan’s growing user-side and merchant side businesses have a very broad and strong demand for artificial intelligence (AI) technology. From the perspective of users, Meituan AI has more than 200 life service scenarios, including store consumption and hotel travel, in addition to takeout, all of which require AI to improve user experience. From the perspective of merchants, Meituan AI will help merchants improve efficiency and analyze their operations. For example, it can conduct fine-grained analysis of user comments to describe the service status of merchants, analyze their competitiveness, and provide insights into business circles, etc., and provide refined business suggestions for merchants.

At present, The RESEARCH and development fields of Meituan AI include natural language understanding, knowledge graph, search, speech recognition, speech generation, face recognition, text recognition, video understanding, image editing, AR, environmental prediction, behavior planning, motion control and so on. The two key parts for AI technology to be implemented in these scenarios are large-scale data and advanced deep learning models. The design and update of high-quality models are the pain points and difficulties of AI production and development, and automation technology is in urgent need to assist and improve production efficiency. The technology that emerges from this scenario is called automated machine learning (AutoML). AutoML is seen as the future solution for model design, freeing AI algorithm engineers from the try-and-error of manual design.

In 2017, Google formally proposed Neural Architecture Search (NAS) [1] for automating the generation of model architectures. This technology is highly expected by the industry and has become a core part of AutoML. With its increasing computing power and continuous iteration of NAS algorithms, visual model has led to a series of highly influential models in architecture like EfficientNet and MobileNetV3, and NAS has also been applied in many directions in the fields of vision, NLP and speech [2,3]. As the AI generating AI model, NAS is of self-evident significance. Meituan has also carried out in-depth research on NAS and has been actively exploring this field.

This paper introduces DARTS-[4], a collaboration between Meituan and Shanghai Jiao Tong University, which will be published in ICLR 2021. ICLR (International Conference on Learning Representations) is an acronym for International Conference on Learning Representations. Founded in 2013 by Two deep learning gurus, Turing Prize winners Yoshua Bengio and Yann LeCun. ICLR is only seven years old, but it is already widely recognized in academia as “the premier conference in deep learning”. With an H5 index of 203, ICLR ranks 17th among all scientific publications, ahead of NeurIPS, ICCV and ICML. A total of 2997 papers were submitted in ICLR this year, and 860 papers were finally accepted, including 53 Oral papers (6% acceptance rate), 114 Spotlight papers and 693 Poster papers, with an acceptance rate of 28.7%.

Introduction to neural network architecture search

The main task of neural network architecture search (NAS) is how to search the optimal model in limited time and resources. NAS consists of three parts: search space, search algorithm and model evaluation. NAS was first verified in visual classification task. Common search space in classification task can be divided into substructure unit (Cell) and substructure Block (Block). The former is characterized by rich graph structure, and the same units are connected in series to form the final network. The latter is a straight tube, and the focus of search is the selection of sub-structure blocks of each level.

Classified by search Algorithm, NAS mainly includes methods Based on Reinforcement Learning (RL), Evolutionary Algorithm (EA) and gradient-based optimization. The RL method generates and evaluates models to get feedback, adjusts the generated strategy based on the feedback to generate new models, and cycles the process until optimal. The EA method encodes the model structure as “genes” that can be crossed and mutated, and obtains new generations of genes through different genetic algorithms until the best is achieved. The advantage of EA method is that it can deal with multiple objectives, such as the number of parameters, calculation delay, performance index and other dimensions of a model, so EA method is very suitable for exploration and evolution in multiple dimensions. However, both RL and EA are time-consuming and are mainly limited to the model evaluation part, so the method of full and small amount of training is generally adopted. The latest one-shot approach can greatly improve the efficiency of NAS by training a hypernet containing all substructures to evaluate all subnets. But at the same time, DARTS method based on gradient optimization is more efficient and becomes the mainstream choice of current NAS methods.

DARTS is proposed by Liu Hanxiao, a researcher from Carnegie Mellon University (CMU), and is called Differentiable neural network Architecture Search (DARTS) [5], which greatly improves the Search efficiency and is widely recognized by the industry. Differentiable method (DARTS) is based on gradient optimization. It first defines a substructure (Cell) based on directed acyclic graph (DAG). DAG has four intermediate nodes (gray box in Figure 1 below), and each edge has multiple optional operators (represented by edges of different colors). Softmax adds the results of the different edges as input to the next node. Stacking such substructures can form the backbone of a network. DARTS views the search process as optimization of a stacked backbone network (also known as a hypernet, or hyperparameterized network). Here, each edge is assigned a different structural weight and is crossed with the network weight for gradient update. After optimization, operators with heavy structure weight (represented by bold lines) are used as the final subnet, and the subnet is used as the search result (Figure 1D shows the final Cell structure). This process (Figure 1 from C to D) rigidly truncates continuous structure weights to discrete values, such as 0.2 to 1 and 0.02 to 0, which result in so-called Discretization Gap.

The difficulty of neural network architecture search

Briefly summarize the main difficulties to be solved in the current neural network architecture search:

The efficiency of the search process: the computational resources and time consumed by the search algorithm should be in an acceptable range, so that it can be widely used in practice and directly support the model structure search oriented to business data sets.
Effectiveness of search results: the model obtained from the search should have good performance in multiple data sets, and good generalization performance and domain migration ability. For example, the classification backbone obtained from the search can be migrated to detection and segmentation tasks, and has good performance.
Robustness of search results: While having validity, the results of multiple searches should be relatively stable, that is, to improve the reliability of search and reduce the cost of trial and error.

The shortcoming and improvement of differentiable method

The weakness of the architecture search method of differentiable neural network is that it has poor robustness and is prone to performance collapse. In other words, the performance of hypernetwork is good in the search process, but a large number of Skip connections exist in the inferred subnets, which seriously weakens the performance of the final model. Many improvements have emerged based on DARTS, such as Progessive DARTS[6], Fair DARTS[7], RobustDARTS[8], Smooth DARTS[9], etc. Among them, ICLR 2020 full-score paper RobustDARTS proposed the use of Hessian characteristic root as a measure of DARTS performance collapse symptoms, but the calculation of characteristic root is very time-consuming. Furthermore, RobustDARTS ‘model performance on ciFAR-10 data sets was not outstanding in standard DARTS search space. This led us to think about how to improve both robustness and effectiveness. For these two problems, there are different analysis and solution methods in the industry, the representative ones are Fair DARTS (ECCV 2020), RobustDARTS (ICLR 2020) and Smooth DARTS (ICML 2020) respectively.

Fair DARTS observed the existence of a large number of jump connections and focused on the analysis of their possible causes. In this paper, it is considered that, in the process of differentiable optimization, jumping connections have Unfair Advantage in the competitive environment, which makes jumping connections easily win in the competition. Therefore, FairDARTS proposed to relax the competitive environment (Softmax plus sum) to the cooperative environment (Sigmoid plus sum), so that the influence brought by unfair advantages would be invalid. The final selection method is also different from DARTS. Threshold truncation is adopted, such as selecting an operator with a structural weight higher than 0.8. In this case, jumping connections can appear with other operators at the same time, but this is equivalent to increasing the search space: in the original subnet, only one node is selected between two nodes.

RobustDARTS (R-DARTS for short) calculates Hessian characteristic roots to judge whether the optimization process collapses. This paper argues that there is Sharp Local Minima in Loss Landscape (Sharp Local Minima, (point on the right of Figure 5A), the discretization process (α* to αdisc) would lead to a shift from a sharp point with better optimization to a place with worse optimization, resulting in a decline in the performance of the final model. R-darts found that this process was closely related to Hessian characteristic roots (Figure 5b). Therefore, it can be considered that optimization should be stopped when the change of Hessian characteristic root is too large, or regularization should be adopted to avoid the large change of Hessian characteristic root.

Smooth DARTS (short for SDARTS) follows the judgment of R-DARts and adopts perturbation-based regularization method to implicitly constrain Hessian characteristic roots. To be specific, SDARTS gives a certain degree of random disturbance to the structural weight, which makes the hypernet have better anti-interference and smooth the landform of the loss function.

DARTS-

Analysis of the working mechanism of skip link

We first analyze the performance collapse from the working mechanism of jump connection. ResNet [11] introduced jump connection, so that the shallow layer of the network always contains the gradient to the deep layer during back propagation, so the phenomenon of gradient disappearance can be alleviated. The formula is as follows (I, j, k represent the number of layers, X is the input, W is the weight, and f is the calculation unit).

In order to clarify the impact of skip connection on the performance of residual network, we conducted a set of verification tests on ResNet, that is, adding the learnable structural weight parameter β to skip connection, and then our gradient calculation was changed into the following formula:

In the three experiments, β was initialized as {0, 0.5, 1.0} respectively. It was found that β always grew rapidly near 1 (Figure 2) to increase the transfer of deep gradient to shallow layer, thus alleviating the phenomenon of gradient disappearance.

In DARTS, jump connections are similar to ResNet in that when learnable parameters are available, their structural parameters also have this tendency, thus facilitating the training of the hypernet. However, as mentioned by Fair DARTS [7], the problem is that jumping connections have unfair advantages over other operators.

Solution to collapse: add secondary jump connections

According to the above analysis, DARTS- indicates that Skip connection (Skip in Figure 1 below) has a dual role:

As the optional operator itself, participates in building the subnet.
The residual structure is formed with other operators, which promotes the optimization of hypernet.

The first role is the role it is expected to play, so as to level the playing field with other operators. The second effect is why jump joins have an unfair advantage, promoting optimization but interfering with our inferences about final search results.

To strip out the second function, we propose to add an extra Skip and decay its structural weight β from 1 to 0 (linear decay is used for simplicity) to keep the hypernet and subnets structurally consistent. Figure 1 (b) illustrates the connection between two nodes in a substructure.

The darTS-optimization process is much the same as DARTS, except for the addition of auxiliary jump connections. Firstly, according to Figure 1 (b), the hypernet was constructed, and a β attenuation strategy was selected. Then, the weight W and structure weight α of the hypernet were optimized by alternate iteration, as described in Algorithm 1 below.

In this method, we remove the practice of using indicators to detect performance crashes, such as characteristic roots in R-DARTS, thus eliminating the performance crashes of DARTS, hence the name DARTS-. In addition, according to the convergence theory of PR-DARts [12], the auxiliary jump connection can balance the competition between operators, and the fair competition between operators remains after β attenuation.

Analysis and validation

Hessian characteristic root variation trend

Under r-DARTS and multiple search Spaces adopted by DARTS, DARTS- found that the subnet performance increased (Figure 4b) but the Hessian characteristic root changed too much (Figure 4A). This result became a counter example of the principle proposed by R-DARTS. Even with the R-DARTS criterion, we miss some good models. This also shows that DARTS- can bring a different model structure from R-DARTS.

Validate set accuracy geomorphology

The landform of validation set accuracy can explain the difficulty of model optimization to a certain extent. DARTS (Figure 3A) showed relatively steep landforms and uneven contour lines near the optimal solution, while DARTS- showed slow and smooth landforms and more uniform contour lines. In addition, smoother landforms are less prone to sharp local optimal advantages, which also reduces discretization bias to a certain extent.

The experimental results

Model structure

Figure 9 shows the network structure obtained by us in DARTS search space S0 and Robust DARTS search space S1-S4. Figure 10 is the result of direct search on ImageNet dataset in MobileNetV2 search space.

Classification task Results

DARTS- has achieved industry-leading results on both standard classification dataset CIFAR-10 and ImageNet, as shown in the following table:

In multiple search Spaces S1-S4 proposed by RobustDARTS to test robustness, darTS-search models outperform R-DARTS and SDARTS.

NAS algorithm evaluation

Nas-bench 201[10] is one of the benchmark tools used to measure NAS algorithms. DARTS- also achieves better results than other NAS algorithms, and the best results are almost close to the best model in the benchmark.

Ability to migrate

As A backbone network, DARTS-A also outperforms previous NAS models in the COCO data set target detection task, with mAP reaching 32.5%.

In general, THE DARTS-method inherits the high efficiency of DARTS, and has proved its robustness and effectiveness in standard data sets, NAS benchmark evaluation, r-DARTS search space, and its domain migration ability in detection tasks, thus confirming the superiority of the search method itself. It solves some problems in the current neural network architecture search and will promote the research and application of NAS.

Summary and Outlook

DARTS-, which is included in ICLR 2021, reanalyzes the reasons why DARTS search results are not robust enough, analyzes the dual role of jump join, and proposes a method to separate it by adding an auxiliary jump join with attenuation coefficient, making the inner native jump join only perform its function as an optional operation. At the same time, we make an in-depth analysis of the characteristic roots that R-DARTS relies on, and find that there will be a counter example when r-DARTS is used as a symbol of performance collapse. Future DARTS- as an efficient, robust and versatile search method, it is expected to be extended and applied to other tasks and landing fields. Please refer to the original article for more details. The experimental code is open source on GitHub.

AutoML technology can be applied to computer vision, voice, NLP, search recommendation and other fields. The AutoML algorithm team of the Vision Intelligence Center aims to empower the company with AutoML technology and accelerate the implementation of algorithms. At present, this paper has applied for a patent, and the algorithm is integrated into meituan automatic vision platform system to accelerate the production and iteration of automated models. In addition to visual scenes, we will explore the application in search recommendation, unmanned vehicle, optimization, voice and other business scenes in the future.

Author’s brief introduction

Xiang Xiang, Xiao Xing, Zhang Bo, Xiao Lin, etc., all from Meituan Visual Intelligence Center.

reference

Learning Transferable Architectures for Scalable Image Recognition, arxiv.org/abs/1707.07… .
Na-fpn: Learning Scalable Feature Pyramid Architecture for Object Detection, arxiv.org/abs/1904.07… .
Auto-deeplab: Hierarchical Neural Architecture Search for Semantic Image segmentation, arxiv.org/abs/1901.02… .
DARTS-: Robustly Stepping out of Performance Collapse Without Indicators, openreview.net/forum?id=KL… .
DARTS: Differentiable Architecture Search, arxiv.org/pdf/1806.09… .
Progressive Differentiable Architecture Search: Bridging the Depth Gap between Search and Evaluation, arxiv.org/pdf/1904.12… .
Eliminating Advantages in Differentiable Architecture Search, arxiv.org/pdf/1911.12… .
Understanding and Robustifying Differentiable Architecture Search, openreview.net/pdf?id=H1gD… .
Stabilizing Differentiable Architecture Search via focusing Regularization, arxiv.org/abs/2002.05… .
Nas-bench 201: Extending the Scope of Reproducible Neural Architecture Search, openreview.net/forum?id=HJ… .
Deep Residual Learning for Image Recognition, arxiv.org/abs/1512.03… .
Theoretical – Inspired Path – Regularized Differential Network Architecture Search, arxiv.org/abs/2006.16… .