Take you through the paper series on Computer vision –ResNet and ResNeXt

ResNet is strong!

ResNet was released in 2015, there are still a lot of CV tasks using it as backbone (especially the top experimental comparison), and currently many networks are using residual modules.

Deep Residual Learning for Image Recognition

https://arxiv.org/abs/1512.03385
Copy the code

Code:

pytorch:https://github.com/fastai/fastai
tensorflow:https://github.com/tensorflow/models/tree/master/research/deeplab
Copy the code

Highlights in the network:

  • Ultra deep network structure (more than 1000 layers)
  • Present the Residual module
  • Accelerated training using Batch Normalization (drop Dropout)

The introduction

With the importance of depth to networks, the question arises: is it as easy to learn better networks as to stack more layers? One obstacle to answering this question is the well-known problem of gradient extinction/explosion, which prevents convergence in the first place. However, this issue has been largely resolved through normalized initialization, intermediate normalization layers, and bulk normalization, This allows tens of layers of networks to begin converging through stochastic gradient descent (SGD) with back propagation. (ResNet does not solve the gradient extinction/explosion problem)

When deeper networks can begin to converge, one is exposeddegradation: As the depth of the network increases, the accuracy rate reaches saturation and then declines rapidly. And what’s unexpected is this degradationIt’s not caused by overfittingAnd adding more layers to the appropriate depth model leads to higher training errors.In this paper, we introduce a deep residual learning frameworkThe degradation problem is solved. We explicitly let the layersFitting residual mappingRather than expecting every few stacked layers to directly fit the desired underlying mapping. We assume that residual maps are easier to optimize than the original, unreferenced maps. In extreme cases, if an identity mapping is optimal, it is easier to set the residuals to zero than to fit the identity mapping through a bunch of nonlinear layers.

Identity Shortcut Connections add neither additional parameters nor computational complexity.

We find that: 1) Our extremely deep residual network is easy to optimize, but when the depth increases, the corresponding “simple” network (simple stacked layer) shows higher training error; 2) Our deep residual network can easily gain accuracy benefits from greatly increased depth and generate substantially better results than previous networks.

The residuals. In image recognition, VLAD is a representation encoded by a residual vector about a dictionary, and the Fisher vector can be represented as a probabilistic version of VLAD. They are both powerful shallow representations in image retrieval and image classification. For vector quantization, encoding residual vectors proves to be more efficient than encoding original vectors.

In low-level vision and computer graphics, to solve partial differential equations (PDE), the widely used Multigrid method reconstructs the system into subproblems at multiple scales, where each subproblem is responsible for residuals at both coarser and finer scales. An alternative to Multigrid is hierarchical base preprocessing, which relies on variables representing residual vectors between two scales. It has been shown that these solvers converge faster than standard solvers that do not know the residual properties of the solution. These methods show that good refactoring or preprocessing can simplify optimization.

Residual learning

This reconstruction is inspired by a counterintuitive phenomenon about the problem of degradation. If the added layer can be constructed as an identity map, the training error of the deeper model should be no greater than that of its shallower counterpart. The degradation problem indicates that the solver may have difficulty approximating the identity mapping through multiple nonlinear layers. By reconstructing residual learning, if the identity mapping is optimal, the solver may simply approach the identity mapping by pushing the weights of multiple nonlinear connections to zero.

In practice, the identity map is unlikely to be optimal, but our refactoring may help preprocess the problem. If the optimal function is closer to the identity map than the zero map, it should be easier for the solver to find jitter about the identity map, rather than learning the function as a new function. We show experimentally that the learning residual function usually has a smaller response, indicating that the identity mapping provides reasonable preprocessing.

Identity mapping is sufficient to solve the degradation problem, so Ws (1×1 convolution) is only used when matching dimensions. The form of the residual function is variable.

ResNet introduces residual network, that is, a shortcut connection with forward feedback is introduced between input and output (called accumulation layer), which is somewhat similar to “short circuit” in circuit. Identity mapping y=x The original network learned the input to output mapping H(x), while the residual network learned F(x)=H(x)−x. The structure of residual learning is shown in the figure below:The author points out that the training error of deep network is generally higher than that of shallow network. But for a shallow network, add layersIdentity mapping(y=x) becomes a deep network, but such deep network can get the same training error as shallow network. This shows that the layer of identity mapping is easier to train.

Let’s assume: for the residual network, when the residual is 0, the accumulation layer only does the identity mapping. According to the above conclusion, theoretically the network performance will not decline at least. This is also the inspiration of the author, and the final experimental results also prove that the residual network effect is indeed very obvious.

But why is residual learning easier? Intuitively, residual learning needs to learn less content, because the residual is generally relatively small and the learning difficulty is small. In addition, we can analyze this problem from a mathematical perspective. Firstly, the residual unit can be expressed as:Where x_{L} and x_{L +1} represent the input and output of the l-th residual unit respectively. Note that each residual unit generally contains a multi-layer structure. F is the residual function, representing the learned residual, while H represents the identity mapping, and F is the ReLU activation function. Based on the above formula, we can obtain the learning characteristics from shallow L to deep L as follows:Using the chain rule, the gradient of the reverse process can be obtained:The first factor of the formula represents the gradient of the loss function to L. The 1 in the parentheses indicates that the short-circuit mechanism can propagate the gradient nondestructively, while the other residual gradient needs to pass through layers with weights. The gradient is not directly transmitted. The residual gradient doesn’t happen to be all -1, and even if it’s small, the presence of 1 doesn’t cause the gradient to disappear. So residual learning is easier.

The network architecture

Simple networks. The convolutional layer mainly has 3×3 filters and follows two simple design rules :(I) for the same output feature graph size, the layer has the same number of filters; (ii) If the size of the feature graph is halved, the number of filters is doubled in order to maintain the time complexity of each layer. We perform down-sampling directly through the convolution layer of step size 2.

Compared with VGG network, our model has fewer filters and lower complexity. Our 34-layer benchmark has 3.6 billion flops, only 18% of VGG-19 (19.6 billion flops).

After each convolution and before activation, we use batch normalization (BN).ImageNet’s deeper residual function F. Left: The building blocks of ResNET-34 (on the 56×56 feature), as shown below. Right: Bottleneck building blocks for ResNET-50/101/152.The image is resized and its shorter edges are sampled randomly in [256,480] for scaling. A 224×224 clipping is randomly sampled from the image or its horizontal flip and the average value of each pixel is subtracted. Standard color enhancement is used. We use batch normalization (BN) after each convolution and before activation. Initialize weights and train all normal/residual networks from scratch. We used SGD and the mini-batch size was 256. The learning rate starts from 0.1, and when the error is stable, divide by 10, the model trains up to 4 iterations of 60×10. We use a weight attenuation of 0.0001 and a momentum of 0.9. We don’t use dropout.

In the test, we used the standard 10-crop test for comparative study. To get the best results, we use the full convolution form and average the fractions of multiple scales (the image is adjusted to the short edges at {224,256,384,480,640}).

Error rate (%) of single-model results on the ImageNet validation set (except reports on the test set).Error rate of collection (%). Top-5 errors on ImageNet’s test set are reported by the test server.Training on CIFAR-10. The dotted line indicates the training error and the thick line indicates the test error. Left: Normal network. The normal 110 has an error of more than 60% and is not shown. Chinese: ResNets. Right: ResNet with 110 and 1202 floors.Standard deviation (STD) of layer responses on CIFAR-10. The response is the output of each 3×3 layer, after BN and before nonlinear. Top: The layers are displayed in their original order. Bottom: Responses are arranged in descending order.conclusion

1. When the input X and output Y of the accumulation layer have different dimensions, that is, f is a mapping from low dimension to high dimension, at this time, identity mapping cannot be simply added as shortcut connection, but a linear projection Ws should be added, which is equal to a connection matrix. This structure is called projection structure.

2. Residual network can be applied not only to the full connection layer, but also to the convolution layer.

3. The author compares the experimental results of projection structure and identity structure. We found a slight improvement in the projection structure, but this is due to the additional parameters added to the projection connection.

4. The author tested the residual network of 1000 layers, and the test result was worse than that of 110 layers, but the training error was similar to that of 110 layers, so this phenomenon should be over-fitting. For a small data set, the network of 1000 layers is a dead end.

5. ResNet has very good generalization performance on target classification. The author applies ResNet to target detection.

VGG takes you to read a paper series on computer vision –AlexNet

ResNeXt

Aggregated Residual Transformations for Deep Neural Networks

https://arxiv.org/abs/1611.05431
Copy the code

Code and models are published in

https://github.com/facebookresearch/ResNeXt
Copy the code

Highlights 1. Proposed concise, highly modular network 2. The main feature is aggregation transformation 3. Adding Cardinality improves network performance and is more efficient than adding depth and width 6.ILSVRC takes second place, 5K and COCO surpass ResNet

preface

ResNet Network upgrade: ResNeXt. The main reason for proposing ResNeXt is that: Traditionally, to improve the accuracy of the model, the network is deepened or widened. However, with the increase of the number of super-parameters (such as channels number, filter size, etc.), the difficulty of network design and calculation overhead will also increase. Therefore, the ResNeXt structure proposed in this paper can improve the accuracy without increasing the parameter complexity and reduce the number of hyperparameters (thanks to the topology of the submodule).

First, VGG is primarily implemented using stacked networks, an idea borrowed from ResNet. Inception series networks are simply a split-transform-merge strategy. However, Inception series networks have a problem: the hyper-parameter setting of the network is highly targeted, and many parameters need to be modified when applied to other data sets, so the scalability is mediocre.

Network ResNeXt adopts both the idea of VGG stacking and the idea of Inception split-transform-Merge, but has strong scalability. It can be considered that the model’s complexity is not changed or reduced while the accuracy is increased. A noun is mentioned herecardinalityThe size of the set of values is doubling, you can see cardinality=32 on the right of figure 1Every aggregated topology is the same(This is also different from Inception, reducing the design burden).Increasing cardinality is more effective than increasing depth and width.

Related work

1. Wide application of multi-branch network; ResNets can be thought of as two branch networks, one of which is identity mapping. Deep neural decision forest is a tree-pattern multi-branch network with learning splitting function. Deep network decision tree is also a multi-branch structure. Multi-path has a mass base.

2. Grouping convolution is widely used; There is little evidence that grouping convolution improves network performance.

3. Extensive research on model compression; Different from model compression, the structure designed in this paper has strong performance and low computational cost.

4. Model integration is an effective method to improve accuracy; The model in this paper is not model integration, because the training of each module is simultaneously trained rather than independently.

The network structure

The internal structure of ResNET-50 and ResNext-50 is listed, and the last two lines show that there is little difference in parameter complexity between resNET-50 and ResNext-50.Left) ResNet – 50. (Right) ResNext-50 with a 32× 4D template (using the refactoring in Figure 3 (c)). Inside the brackets are the shapes of the remaining blocks, and outside the brackets are the number of blocks stacked on the stage. “C=32” suggests that the grouping convolution has 32 groups. The number of parameters between the two models is similar to FLOPs.

The blocks have the same topology and are constrained by two simple rules inspired by VGG/ResNet :(I) the blocks share the same hyperparameters (width and filter size) if the spatial graph of the same size is generated, and (ii) each when the spatial graph is sampled by factor 2, the width of the block is multiplied by factor 2. The second rule ensures computational complexity to FLOP (floating-point operation, multiplied in # plus), roughly the same for all blocks.

Three different onesResNeXt blocks The equivalent building block for ResNeXt. (a) : Aggregation residual transformation, same as on the right side of FIG. 1. (b) : block equivalent to (a), realized as early series. (c) : equivalent to (a,b), which is achieved as grouping convolution. The bold symbol highlights the reformulated changes. The first layer is represented by (# input channel, filter size, # output channel).

Fig3. A: Aggregated residual observations; Fig3.b: Concatenate and convolution are adopted after two-layer convolution, which is similar to Inception-ResNet, except that the paths are in the same topology. Fig.3.c: A more sophisticated implementation, Group convolution, is adopted.

The author clearly stated in the paper that these three structures are strictly equivalent, and the results obtained by these three structures are exactly the same. The results shown in this paper are those of Fig3.c, because the structure of Fig3.C is simpler and faster.

(left) : Conversions with depth =2. (right) : An equivalent block, slightly wider. ResNeXt improves accuracy without increasing parameters or computation.

Grouping convolution

Group convolution first appeared in AlexNet. Due to limited hardware resources at that time, convolution operations could not be processed by the same GPU during AlexNet training. Therefore, the author distributed feature maps to multiple Gpus for processing respectively. Finally, the results of multiple Gpus are fused.

Interestingly, grouping convolution was an engineering compromise at the time, because AlexNet is easy to train today. Hinton and his students had to split the network between two GTX590 gpus for training for a week. Of course, how to communicate between two Gpus is quite complicated. Fortunately, today tensorFlow libraries help us solve the communication problem of multi-GPU training. So Hinton and his students invented grouping convolution. The other thing they didn’t expect was that the idea of grouping convolution is quite far-reaching, and currently quite lightweightState Of The Art (SOTA) network, all use the operation of grouping convolution to save computation.

Question: If the grouping convolution is divided on different Gpus, the computation amount of each GPU is reduced to 1/groups. But if the computation is still on the same GPU, does the overall computation amount remain unchanged?

In fact, this is not the case. Group convolution itself greatly reduces parameters. For example, when input_channel= 256, output_channel=256 and kernel size=3×3 are not performed, The block convolution parameter is 256x256x3x3.

When grouping convolution, for example, group=2, input_channel and output_channel of each group= 128, the number of parameters is 2x128x128x3x3, which is 1/2 of the original.

The final output feature maps are combined by concatenate instead of elementwise add. If you put it on two Gpus, it’s four times faster.

The experimental results

Imagenet-1k training curve. (left):ResNet/ ResNext-50, preserves complexity (4.1 billion FLOPs, 25 million parameters); (right). Retain the complexity of ResNet/ ResNext-101 (7.8 billion FLOPs, 44 million parameters).Ablation experiments were performed on Imagenet-1K. (top): Retain the complexity of resnet-50 (4.1 billion FLOPs); (Bottom) : Retain the complexity of ResNET-101 (7.8 billion FLOPs). The error rate was assessed on a single crop at 224×224 pixels.Comparison on Imagenet-1K when the number of FLOPs increases to twice that of Resnet-101. The error rate was assessed on a single crop at 224×224 pixels. Factors highlighted are factors that add complexity.

Conclusion:

  • ResNeXt combines the advantages of Inception and Resnet (and, in fact, grouped convolution), with both incomplete structure (easy to train) and concat of feature layers (multi-angle understanding of features). This is similar to model fusion, where models with different advantages work better together.
  • The core innovation lies in the proposed aggregrated transformations. The original ResNet three-layer convolution blocks are replaced by a parallel stack of blocks with the same topology structure, which improves the accuracy of the model without significantly increasing the magnitude of parameters. Meanwhile, due to the same topology structure, Hyperparameters are also reduced to facilitate model migration.

Reference article:

www.jianshu.com/p/11f1a979b… www.cnblogs.com/FLYMANJB/p/…