Zhang Hang and Li Mu et al proposed: ResNeSt: Split-attention Networks (Improved Version of ResNet)

Github address: github.com/zhanghang19…

The paper addresses: hangzhang.org/files/resne…

2020.06.23

Aerial video on ResNeSt:www.bilibili.com/video/BV1PV…

2020.05.26

Why is the ResNeSt not received by the ECCV2020? zhuanlan.zhihu.com/p/143214871

ResNeSt govinda COCO data set (segmentation, target detection, instance panoramic segmentation) zhuanlan.zhihu.com/p/140236141

2020.04.24

Some discussion of ResNeSt:

Is the ResNeSt implementation wrong?

A bit of confusion about ResNeSt.

—————————————————————————————————————–

The key: split-attention blocks

Let’s start with a set of images:

ResNeSt leapfrog its ImageNet dataset in image classification over its predecessors ResNet, ResNeXt, SENet and EfficientNet. The FtREL-RCNN using resNest-50 as its basic skeleton was 3.08 percent higher than the mAP using Resnet-50. DeeplabV3 using resNest-50 as its basic skeleton is 3.02% higher than mIOU using ResNET-50. Go up point effect is very obvious.

1. Proposed motivation

They think that some basic convolutional neural networks such as ResNet are designed for image classification. Due to limited receptive field size and lack of cross-channel interaction, these networks may not be suitable for other fields such as target detection and image segmentation. This means that improving the performance of a given computer vision task requires “network surgery” to modify ResNet to make it more efficient for that particular task. For example, some methods add pyramid modules [8,69] or introduce remote connections [56] or use cross-channel feature map attention [15,65]. While these approaches do improve learning performance on some tasks, it raises the question: can we create a common backbone with a common representation of improved functionality to simultaneously improve performance across multiple tasks? Cross-channel information has been successfully used in downstream applications [56,64,65], while recent image classification networks focus more on groups or deep convolution [27,28,54,60]. Despite their excellent computational power and accuracy in classification tasks, these models do not transfer well to other tasks because their isolated representations fail to capture relationships across channels [27, 28]. Therefore, networks with cross-channel representations are worth doing.

2. Contribution points of this paper

First contribution: Proposed a split-attention blocks construction of ResNeSt, compared to the existing ResNet variant does not require additional computation. And ResNeSt could serve as a skeleton for other missions.

Second contribution: large-scale benchmarking of image classification and transfer learning applications. Models utilizing the ResNeSt trunk are able to achieve state-of-the-art performance in several tasks, namely: image classification, object detection, instance segmentation and semantic segmentation. Compared with the latest CNN model [55] generated by neural architecture search, the proposed ResNeSt performs better than all existing ResNet variants, and has the same computational efficiency and even achieves a better tradeoff of speed accuracy. A single CascaDE-RCNN [3] used the resNest-101 trunk model to achieve 48.3% box mAP and 41.56% Mask mAP in mS-COCO instance segmentation. The single DeepLabV3 [7] model also uses the ResNest-101 trunk, and the mIoU on the ADE20K scenario analysis verification set reaches 46.9%, which is more than 1% mIoU higher than the previous best result.

3. Related work will not be introduced

Split-attention networks

Look directly at ResNeSt Block:

First, the idea of ResNeXt network is used to divide the input into K, each named cardinal1-K, and then divide each Cardinal into R, each named split1-R, so there are G=KR groups.

Then there’s what it looks like for each Cardinal:

This is an inspiration from the squeeze-and-congestion Network (SENet), which is a channel-based attention mechanism. The different weights for the channel are used to model the importance of the channel. The basic blocks of an SE block are as follows:

Of course, also borrowed from SKNet, the core of SKNet is to choose the core module:

Reference: blog.csdn.net/qixutuo6087…

Going back to the original text, for each Cardinal input:

Channel weight statistics can be obtained by global average pooling:

Vk represents the Cardinal output with channel weights:

The final output of each Cardinal is:

And one of theIs the weight calculated after Softmax:

If R=1, all channels in this Cardinal are treated as one.

The output of each Cardinal is then pieced together:

Assuming that the output of each ResNeSt block is Y, then we have:

Where T stands for jump join mapping. This form is consistent with the residual block output calculation in ResNet.

5. Problems existing in residual network

(1) Residual network uses convolution with step size, such as 3×3 convolution, to reduce the spatial dimension of the image, which will lose a lot of spatial information. Spatial information is very important in areas such as target detection and segmentation. Moreover, the convolution layer generally uses 0 to fill the image boundary, which is not the best choice when migrating to other problems of dense prediction. Therefore, this paper uses average pooling with a core size of 3×3 to reduce the spatial dimension.

(2)

The 7×7 convolution in the residual network is replaced by three 3×3 convolution with the same receptive field.
Add a 2×2 average pooling to the 1×1 convolution of step size 2 in the jump join.

6. Training strategies

Here’s a brief list, and you can go to the paper for details.

(1) Big min Batch, use cosine learning rate attenuation strategy. Warm up. BN layer parameter setting.

(2) Label smoothing

(3) automatic enhancement

(4) Mixup training

(5) Large cutting Settings

(6) Regularization

6. Related results

There are also some results in the appendix, which will not be posted.

Split attention block split attention block

import torch from torch import nn import torch.nn.functional as F from torch.nn import Conv2d, Module, Linear, BatchNorm2d, ReLU from torch.nn.modules.utils import _pair __all__ = ['SKConv2d'] class DropBlock2D(object): def __init__(self, *args, **kwargs): raise NotImplementedError class SplAtConv2d(Module): """Split-Attention Conv2d """ def __init__(self, in_channels, channels, kernel_size, stride=(1, 1), padding=(0, 0), dilation=(1, 1), groups=1, bias=True, radix=2, reduction_factor=4, rectify=False, rectify_avg=False, norm_layer=None, Dropblock_prob = 0.0 * * kwargs) : super(SplAtConv2d, self).__init__() padding = _pair(padding) self.rectify = rectify and (padding[0] > 0 or padding[1] > 0) self.rectify_avg  = rectify_avg inter_channels = max(in_channels*radix//reduction_factor, 32) self.radix = radix self.cardinality = groups self.channels = channels self.dropblock_prob = dropblock_prob if self.rectify: from rfconv import RFConv2d self.conv = RFConv2d(in_channels, channels*radix, kernel_size, stride, padding, dilation, groups=groups*radix, bias=bias, average_mode=rectify_avg, **kwargs) else: self.conv = Conv2d(in_channels, channels*radix, kernel_size, stride, padding, dilation, groups=groups*radix, bias=bias, **kwargs) self.use_bn = norm_layer is not None self.bn0 = norm_layer(channels*radix) self.relu = ReLU(inplace=True) self.fc1 = Conv2d(channels, inter_channels, 1, groups=self.cardinality) self.bn1 = norm_layer(inter_channels) self.fc2 = Conv2d(inter_channels, channels*radix, 1, Groups =self. Cardinality) if dropblock_prob > 0.0: self. Dropblock = DropBlock2D(dropblock_prob, 3) def forward(self, x): X = self.conv(x) if self.use_bn: x = self.bn0(x) if self.dropblock_prob > 0.0: x = self.dropblock(x) x = self.relu(x) batch, channel = x.shape[:2] if self.radix > 1: splited = torch.split(x, channel//self.radix, dim=1) gap = sum(splited) else: gap = x gap = F.adaptive_avg_pool2d(gap, 1) gap = self.fc1(gap) if self.use_bn: gap = self.bn1(gap) gap = self.relu(gap) atten = self.fc2(gap).view((batch, self.radix, self.channels)) if self.radix > 1: atten = F.softmax(atten, dim=1).view(batch, -1, 1, 1) else: atten = F.sigmoid(atten, dim=1).view(batch, -1, 1, 1) if self.radix > 1: atten = torch.split(atten, channel//self.radix, dim=1) out = sum([att*split for (att, split) in zip(atten, splited)]) else: out = atten * x return out.contiguous()Copy the code

Please point out any mistakes.

My blog is synchronized to tencent cloud + community, invite everyone to come together: cloud.tencent.com/developer/s…

Zhang Hang and Li Mu et al proposed: ResNeSt: Split-attention Networks (Improved Version of ResNet)

Related Posts

Big data need to master the basic algorithms

Machine Learning Topic –02 Logistic Regression and Maximum Entropy Models (pure Python implementation and SkLearn)

An authentication dataset of 220,000 NSFW images? I have a bold idea…