Compiled from arXiv by Chenxi Liu et al., Heart of Machine.

In the past, neural network architectures were largely developed manually by human researchers, a time-consuming and error-prone process. Automatic neural architecture search (NAS) technology liberates human work and improves model efficiency. In large-scale image classification, automated models have surpassed those designed by humans.

Auto-deeplab, proposed by Fei-Fei Li and his colleagues at Stanford University, outperforms many of the industry’s best models in image semantic segmentation, and even achieves the performance of pre-trained models without pre-training. Auto-deep AB develops continuous relaxation of discrete architectures that perfectly match the hierarchical architecture search space, significantly improving the efficiency of architecture search and reducing computing power requirements.

Deep neural networks have been successful in many ARTIFICIAL intelligence tasks, including image recognition, speech recognition, machine translation and so on. Although better optimizers [36] and normalization techniques [32, 79] have played an important role, much of the progress is due to the design of neural network architectures. In computer vision, this applies to image classification and intensive image prediction.

Table 1: Comparison between the model auto-deeplab proposed in this study and other two-layer CNN architectures. The main differences are as follows :(1) auto-deeplab directly searches CNN architecture for semantic segmentation; (2) Auto-Deeplab search network level architecture and unit level architecture; (3) Auto-Deeplab efficient search on a P100 GPU in 3 days.

More recently, with the democratization of AutoML and AI, there has been a great deal of interest in the architecture of automated design neural networks, which do not rely heavily on expert experience and knowledge. More importantly, last year, the Neural Architecture Search (NAS) succeeded in finding network architectures that surpass human design architectures in large-scale image classification tasks [92, 47, 61].

Image classification is a good starting point for NAS because it is the most basic and well-studied advanced recognition task. In addition, the presence of relatively small benchmark data sets (e.g., CIFAR-10) in the research area reduces computation and speeds up training. However, image classification should not be the end of NAS, and the current success suggests that it can be extended to even more demanding areas. In this paper, we study neural architecture search for semantic image segmentation. This is an important computer vision task that assigns labels, such as “person” or “bicycle,” to each pixel of an input image.

Simply transplanting image classification is not enough for semantic segmentation. In image classification, NAS usually uses transfer learning from low resolution images to high resolution images [92], while the optimal framework for semantic segmentation must operate on high resolution images. This suggests that this study needs :(1) a more relaxed and generic search space to capture architectural variants resulting from higher resolution; (2) More efficient architecture search technology, because higher resolution requires more computation.

The author notes that modern CNN designs generally follow a two-level hierarchical structure, in which the outer network controls changes in spatial resolution and the inner cell level architecture manages specific hierarchical computing. The vast majority of current research on NAS follows this two-level, hierarchical design, but only automates the search for the inner network and manually designs the outer network. This limited search space is a problem for dense image prediction, which is sensitive to changes in spatial resolution. Therefore, in this study, the author proposes a lattice network-level search space, which can enhance the common unit-level search space first proposed in [92] to form a hierarchical architecture search space. The goal of this study is to jointly learn a good combination of repeatable unit structure and network structure for semantic image segmentation.

In terms of architectural search methods, reinforcement learning and evolutionary algorithms tend to be computationally intensive, even on low-resolution data sets ciFAR-10, so they are not well suited for semantic image segmentation tasks. Inspired by the NAS differentiable formula [68, 49], this study developed continuous relaxation of discrete architectures that perfectly matched the search space of layered architectures. Hierarchical architecture search is implemented by stochastic gradient descent. When the search terminates, the best cell architectures are greedily decoded, and the best network architectures are efficiently decoded by the Viterbi algorithm. The authors searched directly for architecture on 321×321 images cropped from the Cityscapes dataset. The search is very efficient, taking just 3 days on a P100 GPU.

Experiments were performed on several semantic segmentation benchmark datasets, including Cityscapes, PASCAL VOC 2012, and ADE20K. Without ImageNet pre-training, the best auto-deeplab model outperformed FRRN-B by 8.6% on Cityscapes and GridNet by 10.9% on GridNet. In experiments using Cityscapes rough labeling data, Auto-Deeplab performed similarly to some current best models pretrained by ImageNet. It is worth noting that the best model in this study (without pre-training) performed similarly to DeepLab V3 + (with pre-training), but the former was 2.23 times faster in MultiAdds. In addition, auto-Deep AB’s lightweight model performance was only 1.2% lower than DeepLab V3 +, with 76.7% fewer parameter requirements, and was 4.65 times faster in MultiAdds than DeepLab V3 +. On PASCAL VOC 2012 and ADE29K, the auto-Deeplab optimal model performed better than many current optimal models with minimal data for pre-training.

The main contributions of this paper are as follows:

  • This is one of the first attempts to extend NAS from an image classification task to an intensive image prediction task.

  • This study proposes a network-level architecture search space that augments and complements the already well-studied unit-level architecture search, with more challenging joint searches of network-level and unit-level architecture.

  • This study proposes a differentiable continuous approach to efficiently run two-level layered architecture searches in only 3 days on a GPU.

  • Without ImageNet pretraining, the performance of auto-Deeplab model on Cityscapes data sets is significantly better than that of FRRN-B and GridNet, and is comparable to the current best model of ImageNet pretraining. On PASCAL VOC 2012 and ADE20K datasets, the best auto-Deeplab model is superior to several current optimal models.

Auto-deep Lab: Hierarchical Neural Architecture Search for Semantic Image Segmentation



Address: arxiv.org/pdf/1901.02…

Abstract: Recently, neural network architectures determined by neural architecture search (NAS) outperform human-designed networks in image classification. This paper will study NAS for semantic image segmentation, an important computer vision task of assigning semantic labels to each pixel in an image. The existing research usually focuses on the search of repeatable cell structure and the artificial design of external network structure that controls spatial resolution variation. This approach simplifies the search space, but presents many problems for intensive image prediction with a large number of network-level architecture variants. Therefore, this study proposes to search the network-level architecture in addition to the search unit structure, thus forming a hierarchical architecture search space. This study proposes a network level search space that incorporates multiple popular network designs and proposes a formula for efficient architecture search based on gradients (using 1 P100 GPU on Cityscapes images in 3 days). This study demonstrates the effectiveness of the method on difficult Cityscapes, PASCAL VOC 2012 and ADE20K datasets. Without any ImageNet pre-training, the proposed framework for semantic image segmentation achieves the current best performance.

Four methods

This section first introduces continuous relaxation of discrete architectures that precisely match the hierarchical architecture search described above, then discusses how to perform architectural searches through optimization, and how to decode discrete architectures after the search terminates.

4.2 optimization

The scalars that serve to control the strength of the connections between different hidden states are now also part of the differentiable computational graph. Therefore, gradient descent can be used to optimize it efficiently. The author uses the first-order approximation in [49] to divide the training data into two separate datasets trainA and trainB. Optimizations alternate between:

1. Renew the network weight W with ∇ W L_trainA(W, α, β);

2. Renew structure α,β with ∇ L_trainB(W, α,β).

The loss function L is the cross entropy calculated on the small batch of semantic segmentation.

4.3 Decoding discrete architecture

Unit architecture

Like [49], this study first retained the two strongest predecessors of each building block and then used the Argmax function to select the most likely operator to decode the discrete unit architecture.

The network architecture

Formula 7 essentially shows that the sum of the “outgoing probability” at each blue node in Figure 1 is 1. In fact, β can be understood as the “transition probability” between different “states” (spatial resolution) in different “time steps” (layers). The goal of this study is to find a path with “maximum probability” from scratch. In the implementation, the author can use the classical Viterbi algorithm to decode the path efficiently.

Figure 1: The left figure shows the network-level search space at L = 12. The grey nodes represent the fixed stem layers, and the path formed along the blue nodes represents the candidate network-level architecture. The figure on the right shows the search process in which each cell is a densely connected structure.

5 Experimental Results

Figure 3: Optimal network architecture and unit architecture found using the hierarchical neural architecture search method proposed in this study. The gray dotted arrow represents the connection with the maximum β value at each node. Atr refers to the void convolution (atrous convolution), and SEP refers to the deep separable convolution (depthwise-s convolution).

Figure 4: Validation accuracy of architectural search optimization in 40 Epochs in 10 randomized trials.

Table 2: Results of different auto-deeplab model variants on Cityscapes validation set. F: Filter multiplier controlling model capacity. All auto-deeplab models are trained from scratch and use single-scale inputs during inference.

Table 3: Results of Cityscapes validation set. Experiments were conducted with different training iterations (500,000, 1 million and 1.5 million iterations) and THE SDP (Scheduled Drop Path) method. All models are trained from scratch.

Table 4: Results of the Cityscapes test set using multi-scale inputs in the model inference process. ImageNet: A pre-trained model on ImageNet. Coarse: A model that utilizes rough annotations.

Table 5: PASCAL VOC 2012 validation set results. In this study, multi-scale inference (MS) and COCO pre-training checkpoint (COCO) were adopted. Without any pre-training, the best model proposed in this study (auto-deep Lab-L) outperformed DropBlock by 20.36%. None of the models were pretrained with ImageNet images.

Table 6: PASCAL VOC 2012 test set results. The autodeep Lab-L proposed in this study achieved results comparable to many top models pre-trained on ImageNet and COCO datasets.

Table 7: ADE20K validation set results. Multiscale inputs are used in the inference process. † indicates that the results were obtained from their latest model zoo website. ImageNet: A pre-trained model on ImageNet. Avg: mean of mIOU and pixel accuracy.

Figure 5: Visualization results on Cityscapes validation set. The last line shows the failure pattern of the proposed approach in this study, where the model confuses some of the more difficult semantic categories, such as people and cyclists.

Figure 6: Visualization results on the ADE20K validation set. The last line shows the failure pattern of the proposed approach in this study, where the model fails to segment very fine-grained objects (such as chair legs) and confuses difficult semantic categories (such as floor and carpet).