Exponentially accelerated architecture search: CMU proposes a differentiable architecture search method based on gradient descent

Adapted from arXiv by Hanxiao Liu, Karen Simonyan and Yiming Yang.

The task of finding the optimal neural network architecture usually requires the machine learning experts to spend a lot of time to complete, and the recently proposed automatic architecture search method relies our energy, but consumes a lot of computing power. The “differentiable architecture search” DARTS approach, developed by CMU PhD candidate Hanxiao Liu, DeepMind researcher Karen Simonyan, and CMU professor Yiming Yang, is based on gradient descent in a continuous search space, allowing computers to search neural network architectures more efficiently.

According to the researchers, this method has been proven to achieve industry-best results on both convolutional and recurrent neural networks, and the GPU computing power is sometimes only 700 times that of previous search methods, meaning that a single GPU can complete the task. The research paper “DARTS: Differentiable Architecture Search” has attracted the attention of Andrew Karpathy, Oriol Vinyals and other scholars.

DARTS code: github.com/quark0/dart…

The introduction

Finding the optimal neural network architecture requires a lot of effort by human experts. Recently, there has been interest in developing algorithms to automate the architectural design process. Automated architectural search has achieved very competitive performance in tasks such as image classification and target detection.

The current best architecture search algorithms require high computational overhead despite their superior performance. For example, 1800 GPU days of reinforcement learning (Zoph et al., 2017) or 3150 GPU days of evolutionary algorithms (Real et al., 2018) are required to obtain the current optimal architecture on CIFAR-10 and ImageNet. Many acceleration methods have been proposed, such as imposing specific structures for the search space (Liu et al., 2017b,a) and imposing weights or performance predictions for each individual architecture (Brock et al., 2017; Baker et al., 2018) and sharing weights among multiple architectures (Pham et al., 2018b; Cai et al., 2018), but fundamental challenges of scalability remain. An internal reason for the inefficiency of major methods (such as those based on reinforcement learning, evolutionary algorithms, MCTS (Negrinho and Gordon, 2017), SMBO (Liu et al., 2017a) or Bayesian optimization (Kandasamy et al., 2018)) is that Architectural search is treated as a black box optimization problem on discrete domains, which results in a large number of architectural evaluations.

In this study, we approach the problem from a different Angle and propose an efficient architecture search method called DARTS (Differentiable Architecture Search). This approach is different from the previous approach of searching on the discrete set of candidate architectures, but relaxes the search space to be continuous so that architectures can be optimized by gradient descent based on their performance on the validation set. In contrast to black-box search based on gradient optimization for data validity and inefficiency, it allows DARTS to achieve results that are competitive with current best performance using several orders of magnitude fewer computing resources. It also exceeds other recent efficient architecture search methods, ENAS (Pham et al., 2018b). Notably, DARTS is simpler than many existing approaches because it does not involve any controllers (Zoph and Le, 2016; Baker et al., 2016; Zoph et al., 2017; Pham et al., 2018b), hypernetworks (Brock et al., 2017) or performance predictors (Liu et al., 2017a), however, it is also sufficient to search convolution and cyclic architectures generically.

The concept of search architectures in continuous domains is not new (Saxena and Verbeek, 2016; Ahmed and Torresani, 2017; Shin et al., 2018), but there are several major differences. Although previous work has attempted to fine-tune specific aspects of the architecture, such as filter shapes or branching patterns in convolutional networks, DARTS is able to find high-performance architectures with complex graphical topologies in rich search Spaces. In addition, DARTS is not limited to any particular architectural family and can search both convolutional and circular networks.

Experimental results show that DARTS designed convolution unit can obtain 2.83 ± 0.06% test error on CIFAR-10, which is comparable to the current optimal results using regularization evolution method (Real et al., 2018), which uses three orders of magnitude more computing resources than the former. After the same convolution unit is migrated to ImageNet (mobile terminal), top-1 error of 26.9% is also obtained, which is comparable to the optimal reinforcement learning method (Zoph et al., 2017). In the language modeling task, the cycle unit found by DARTS obtained a 56.1 perplexity on PTB in the training time of single GPU working days. Superior to large-scale tuning LSTM (Melis et al., 2017) and all existing automatic search units based on NAS (Zoph and Le, 2016) and ENAS (Pham et al., 2018b).

The contributions of this paper are as follows:

This paper introduces a new algorithm for searching differentiable network architectures, which is suitable for convolution architectures and cyclic architectures.
A large number of experiments on image classification and language modeling tasks show that gradient-based architecture search achieves very competitive results on CIFAR-10 and outperforms previous optimal results on PTB. This result is interesting given that the best architectural search methods currently use non-differentiable search methods such as reinforcement learning (Zoph et al., 2017) or evolution (Real et al., 2018; Liu et al., 2017b).
Excellent architecture search efficiency (using 4 Gpus: 2.83% error on CIFAR-10 after 1 day of training; After six hours of training, it achieved a confusion level of 56.1 at PTB, which the researchers attributed to gradient-based optimization).
Demonstrates that architectures DARTS learned in CIFAR-10 and PTB can be migrated to ImageNet and Wikitext-2, respectively.

Figure 1: DARTS overview: (a) The operation on edges is not known at first. (b) Continuously broaden the search space by placing various candidate operations on each edge. (c) By solving the second-order (Bilevel) optimization problem, the mixture probability and network weight are jointly optimized. (d) Generalize the final architecture from the learned mixed probability.

Figure 3: DARTS search progress of CIFAR-10 convolutional unit and Penn Treebank cyclic unit. This approach keeps track of the latest architecture over time. The Snapshots for each architecture were retrained from scratch using the training set (100 Epochs on CIFAR-10 and 300 epochs on PTB) and then evaluated on the validation set. For each task, we repeated the experiment four times with different random seeds and reported median and best validation performance for each run of the architecture over time. For reference, we also report (under the same assessment Settings; Results of the best existing units discovered by evolution or using reinforcement learning, including NASnet-A (Zoph et al., 2017) (1800 GPU days), Amoebanet-A (3150 GPU days) (Real et al., 2018) and ENAS (0.5 GPU working days) (Pham et al., 2018b).

Architecture evaluation

To select the architecture to be evaluated, the researchers ran DARTS four times using different random seeds, selecting the best unit based on validation performance. This is especially important for cyclic units, because optimization results are strongly related to initialization (Figure 3).

Figure 4: Normal unit learned on CIFAR-10.

Figure 5: Reduction unit learned on CIFAR-10.

Figure 6: Loop units learned on PTB.

To evaluate the selected architecture, the researcher randomly initializes the architecture weights (weights learned during the search are discarded), trains the architecture from scratch, and tests its performance on a test set. The test set is not used for schema search or schema selection.

Table 1: Comparison results of the current optimal image classifier on CIFAR-10. The results with the † notation are the results of training the corresponding architecture using the setup in this paper.

Table 2: Comparison results of the current optimal language model on Penn Treebank. The results with the † notation are the results of training the corresponding architecture using the setup in this paper.

DARTS: Differentiable Architecture Search

Links to papers: arxiv.org/abs/1806.09…

Abstract: This paper reconstructs architecture search task in a differentiable way to solve the scalability problem of the task. Unlike traditional approaches using evolutionary algorithms or reinforcement learning in discrete and non-differentiable search Spaces, our approach uses gradient descent to efficiently search architectures based on continuous relaxation of architectural representations. We conducted extensive experiments on CIFAR-10, ImageNet, Penn Treebank, and Wikitext-2, and the results show that our algorithm performs well in discovering high-performance image classification convolution architectures and language modeling loops, And the speed of the algorithm is several orders of magnitude faster than the previous optimal non-differentiable method.

Exponentially accelerated architecture search: CMU proposes a differentiable architecture search method based on gradient descent

Related Posts

Distributed: LSMT is the Hbase bearer with strong throughput

Tensorflow. js, transfer learning and AI product Innovation

“Face recognition series tutorial” 1 catalogue and overview