Preface:

In this paper, we introduce how to search semantic segmentation model based on ProxylessNAS. Finally, the model structure can reach 36 FPS test results on CPU, and demonstrate the application of automatic Network search (NAS) in semantic segmentation.

With the advent of Neural Architecture Search, deep learning has gradually developed to the stage of automatic network structure design and hyperparameter configuration. Especially in the context of AI landing, many models need to be deployed on mobile devices. Depending on different devices (GPU, CPU, chip, etc.) and different model requirements (latency, model size, FLOP), using NAS to automatically search for optimal network structure is a promising direction. The previous article introduced the basic framework of NAS and DARTS [1], a must-read for beginners, as well as its application in the semantic segmentation field. Just a few months from now, the number of NAS papers has grown markedly: In terms of theoretical research, the methods of search strategy and evaluation performance seem to be stable. It has to be mentioned that RegNet [2] from FAIR team recently discussed the design of search space and verified the theories of common design models one by one through a large number of experiments. According to its conclusions, we can narrow the search space and improve the search efficiency; In terms of application, object detection is the main method, and segmentation, reID, GAN and other fields are also included.


NAS is a new technology, but semantic segmentation is a commonplace. Since the advent of FCN, SegNet, UNet, such simple and crude encoder-decoder structure can achieve acceptable results on a variety of images, deeplab series is the peak of open source data set. From the academic perspective, semantic segmentation seems to have reached the bottleneck, so researchers turn to small sample, semi-supervised, domain adaption, cloud point and other directions to find other ways. However, the implementation of semantic segmentation is very difficult. In actual landing scenes, common backbone (Resnet or YOLO series) can complete various object detection tasks, but has poor effect on segmentation:

  1. Due to light and other reasons, the intensity distribution of the actual scene image is more complex, while the segmentation needs to subdivide the boundary, and the determination of pixel value is particularly important. However, the high cost of segmentation data labeling compared to detection results in less training data and limited enhancement by data augmentation and other means.
  2. The Segmentation is the task of PixelWise because it processes every pixel, so the model is generally much larger than the object Detection model (you can see that this model is both long and wide). If your model requires real-time reasoning (>16 FPS) then accuracy and speed are bound to be in conflict, Double kill!
  3. When semantic segmentation is used in video stream, the requirement for accuracy will be higher. Even if the pixel difference is only a few pixels every two frames, even if the value of mIoU is the same, the human eye will not look stable, there will be a “jitter” boundary, Triple kill!
  4. When the semantic segmentation model goes down to the cloud and is deployed in the mobile terminal with limited computing power, the underlying chip may not support many operations, so that the model that can play happily on GPU will fight back to the liberation on CPU. Quadra kill!
The implementation of Semantic segmentation must balance the accuracy and speed of the model, and it is very difficult to design such a network structure. A series of small models such as BiSeNet [3], ShuffleNetv2 [4] and MobileNetv3 [5] were tried, but the accuracy and speed did not meet the requirements. As the saying goes, great oaks grow from little acorns. Success can only be achieved by oneself. Ultimately, it is hoped that NAS can automatically search for models that meet the requirements. NAS, which is still in the exploratory stage for semantic segmentation, runs on gpus and tries to reduce FLOPs or Params. However, FLOPs or the number of parameters is not positively correlated with the model reasoning speed, and only reducing the number of parameters cannot meet the requirements of real-time reasoning. The later FasterSeg [6], which seemed to have an amazing speed, actually used TensorRT to accelerate it. This paper will attempt to complete the task of real-time humanoid segmentation on the CPU, and select ProxylessNAS as the baseline to search the model structure. Experimental results have proved that ProxylessNAS [7] can stand the test, the conscience of the industry.

1.Overview of ProxylessNAS

The reason for choosing ProxylessNAS [7] is not only that it is from a famous family, with open source code, and the accuracy of Cifar10 and ImageNet data sets stands out from other NAS models, but also that it is an early consideration of model performance work (such as speed, model size, parameter number). In addition, unlike the DAG cells searched by DARTS [1] series, the backbone network of ProxylessNAS [7] uses a simple chain-like structure. Chained structure has a significant speed advantage over DAG cell because of its simple connection between operators.


1.1 Super – net setting

We still use the basic framework of NAS to parse ProxylessNAS [7].

Figure 1: NAS framework

  • Search Space: Operation candidate defined in the search space is a block from MobileNetv2 [8], with different kernel sizes (3, 5, 7) and expansion rates (3, 6) respectively. Add identity and zero operations for a total of eight OPS (C.F. Figure 1). The macro structure of the network is a common chain structure to complete classification, with 8 OPS candidates (C.F. Figure 2) for each layer. As mentioned above, too complicated connection between operators will slow down the speed, and the common small model structure is the chain structure.
  • Differentiable Search strategies have been common in the past two years. Although not as stable as RL and EA, it can greatly improve the search speed.
  • Evaluation Performance: One-shot weight sharing is also the most common form of super-net available. For teams and individuals with limited computing resources, this approach can improve search efficiency and reduce memory footprint.
1.2 Super -.net training

Super-net parameters contain two parts: operation parameters and the weight of each operation (denoted as {alpha, beta, sigma… in Figure2). }). The training data is divided into two parts. One part is used to train the weight of operations in super-net, and the other part is used to update the weight of OPS.

  • Training: When each iteration starts, a random operation is activated on each layer (C.F. the binary gate in Figure 2). All activated operations are connected to form a subnet. Update the subnet weight by back Propagation. Inactive OPS are not put into memory, which means only one subnet is in memory during training, which allows the entire search process to be done on a single card.
  • Searching: The weight alpha of each operation represents its importance degree, namely the probability of being selected finally, probability = Softmax(Alphas). In other words, the search process is the process of constantly updating the alpha weight. Just like training, each iteration will randomly activate a subnet, but this time the weight of operation will be fixed, and the alpha of the subnet will be calculated through back Propagation. Eq (4) in the Paper gives the calculation method. Since binary gate is proportional to probability, the derivative of loss to probability is converted into the derivative of binary gate in the formula. The derivative of Loss to Binary Gate has been calculated and saved in back Propagation (this part of paper is not detailed and can be referred to the source code).
Figure 2 illustrates the architecture of the super-net: the chained-structure searchable backbone (left) and each layer of the searchable backbone (right).


The process of ProxylessNAS as shown in Figure 2 is actually to update operation weight alpha while training operation parameters. Finally, Softmax is used to select the operation with maximum probability in each layer. After reading the paper, I do find that there are many points worth learning from, but there are also some questions (C.F. Table 1).

Table 1 discusses the advantages and remaining issues of ProxylessNAS

2.Real-time Semantic Segmentation using ProxylessNAS on CPU

Although ProxylessNAS still has a lot of problems to solve, the fact that single card search training saves time and effort does not outweigh the drawbacks. With the help of Intel OpenVino reasoning framework, this paper attempts to use ProxylessNAS to search real-time Semantic segmentation model that can run on CPU(x86) for human segmentation. The algorithm improvement and experimental results will be introduced in detail below.


2.1 Super – net setting

  • Search space: When I set up the search space, in the spirit of miracles, I plugged in the usual operation, They are MBv3 (3×3), MBv3 (5×5), DilConv (3×3), DilConv (5×5), SepConv (3×3), SepConv (5×5) and ShuffleBlock. Where MBv3 is the basic module from MobileNetv3 [5], DilConv and SepConv are dilated sepatable convolutions and self-convolutions from DARTS [1], ShuffleBlock is a basic module from ShuffleNetv2 [4]. The previous three operations have two kernel sizes to choose from. When defining the macro network structure, deeplabv3+ [9] structure (C.F. Figure 3): head + Searchable Backbone + ASPP + decoder. Similar to UNet, encoder’s feature map is directly “added” to decoder. “concatenation” is not used here in order to avoid slow down due to “wide” model. S2, S4, S8, S16 and S32 refer to the decrease of feature map resolution by 2, 4, 8, 16 and 32 times respectively. Like ProxylessNAS, the supernet parameter contains two parts, one is the weight of operation itself, the other is the weight of operation alpha.
  • Searching Strategy: A differentiable derivation method for continuing ProxylessNAS
  • Evaluation Performance: One-shot Weight sharing
Figure 3 illustrates the macro-architecture of our super-net (top) and the searchable backbone (bottom)


2.2 Improvement from ProxylessNAS

  • Decoupling the training and searching process: in ProxylessNAS, “training” and “searching” are completed at the same time. In the experiment, I completely separated “training” and “searching”. First, I used 50 epochs to update only operation parameters in super-net, and then updated alphas weight of operation after training. The reason for doing this is to avoid having some alpha too much influence on subsequent decisions when operation parameters are unstable.
  • Consider the latency as a hard constraint: Since the model inference speed is very important and cannot be calculated by simple superposition method, the inference speed of this subnet should be calculated every time it is randomly activated. If it does not meet the requirements (e.g. Latency > 30ms), a new subnet should be searched. In this way, to some extent, many operations with slow inference speed are not selected and learned.
2.3 Experiments

Experiment setting:

  • Task: Real-time image segmentation based on CPU (x86)
  • DL platform: Intel openvino
Software.intel.com/content/www…

  • Dataset: > 20K images, some from coco/ PASCAL Dataset with “Person” category and some private data
  • Data augmentation: random crop, cutout, random brightness/contrast adjust, random Gaussian blur/sharpen
  • Searching Time: Single card 2 GPU Days (K80) includes training and Searching
Experimental results:

Under the same network structure, MobileNetv3 [5] is used as backbone for comparison, and the comparison results are shown in Table 2.

Table 2 illustrates the experimental results

According to the experimental data, the number of parameters and FLOPs of MobileNetv3 [5] are twice smaller than those searched by us, but the reasoning speed on K80 is very similar, and the accuracy of mIoU varies greatly. If comprehensive consideration of accuracy and speed, backbone searched by ProxylessNAS [7] is obviously superior to backbone of MobileNetv3 [5]. As can be seen from the experimental results in Figure 4, when the feature is more complex, the result of MobileNetv3 [5] is much worse

Figure 4 compares the segmentation results of our searched network and MobileNetv3

The model can be converted into OpenVino mode and deployed on CPU (Intel Core I7-8700) with a running speed of about 27ms per frame (FPS=36). The result is shown in Figure 5.

Figure 5 shows the segmentation results in real application scenario

It’s time to show the backbone of the company. It looks like this.

Figure 6 illustrates the searched backbone structure

3. The Future work

Through experiments, we can see that the ProxylessNAS search strategy can be migrated from classification to segmentation. In the case of similar speed, the searched network is much more accurate than the original MobileNetv3 [5]. However, given the current situation, it is not safe to say that manually designed models are not good or will necessarily be replaced (although MobileNetv3 was also found by NAS). In specific scenarios and with specific requirements, network structure design with NAS is indeed more efficient than manual design and a large number of parameter adjustment experiments, and it has more development prospects in AI implementation. This paper is only a preliminary study of ProxylessNAS, and the following aspects will be explored in the future.

  • Experimental results show that the form of super-net weight sharing is reasonable. However, in the structure search, it is still unreasonable to take the operation with the highest probability of each layer as the output result of subnet. Because of the coupling between search and training in subnets, each layer of operation is compromised. Finally, the best operation of each layer will be selected, but the combination may not meet the pre-set hard constraint. There are still areas to be improved, for example, the weight of the sub-path of the operation of the adjacent two layers can be calculated instead of the weight of the operation of each layer.
  • ProxylessNAS was an early work of MIT Hansong’s team, and now there is a follow-up OFA (also read on your knees). In OFA, the author thoroughly separated training and searching, combined knowledge meeting, first trained teacher model, and then searched for the best student model in teacher model with NAS thinking. OFA can be understood as automated network pruning or automatic meeting. If the OFA experiment results are good, the practical experience of OFA will be shared later.
  • When the actual effects in Figure 5 are displayed, the fusion of portrait and background is relatively natural, but semantic segmentation is a classification task in the final analysis. The pixel on the edge is either black or white. If you want to blend with the background naturally, you need to calculate the alpha matte of the transparency of the foreground, which involves another background matting technology. The effect is better when combined with segmentation. In fact, it can be seen from the Figure below in Figure 5 that hair is not segmented by segmentation, but retained in the result, which is also the reason why background matting is used. Matting can not only optimize the segmentation results, but also switch the background (cf. Figure 7), PS and other functions.
In the next article I will introduce the actual experience of background matting, please look forward to it.

Figure 7 shows the demo of background matting


References

[1] Liu, Hanxiao, Karen Simonyan, and Yiming Yang. “Darts: Differentiable architecture search.” ICLR (2019).

[2] Radosavovic, Ilija, et al. “Designing Network Design Spaces.” arXiv preprint arXiv:2003.13678 (2020).

[3] Yu, Changqian, et al. “Bisenet: Bilateral segmentation network for real-time semantic segmentation.” Proceedings of the European conference on computer vision (ECCV). 2018.

[4] Zhang, Xiangyu, et al. “Shufflenet: An extremely efficient convolutional neural network for mobile devices.” Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 2018.

[5] Howard, Andrew, et al. “Searching for mobilenetv3.” Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2019.

[6] Chen, Wuyang, et al. “FasterSeg: Searching for Faster Real-time Semantic Segmentation.” ICLR (2020).

[7] Cai, Han, Ligeng Zhu, and Song Han. “Proxylessnas: Direct neural architecture search on target task and hardware.” ICLR (2019).

[8] Sandler, Mark, et al. “Mobilenetv2: Inverted residuals and linear bottlenecks.” Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 2018.

[9] Chen, Liang-Chieh, et al. “Encoder-decoder with atrous separable convolution for semantic image segmentation.” Proceedings of the European conference on computer vision (ECCV). 2018.


Click to follow, the first time to learn about Huawei cloud fresh technology ~