Visoin MLP之CycleMLP A MLP-like Architecture for Dense Prediction

The original document: www.yuque.com/lart/papers…

Read the passage from the abstract

This paper presents a simple MLP-like architecture, CycleMLP, which is a versatile backbone for visual recognition and dense predictions, unlike modern MLP architectures, e.g., MLP-Mixer, ResMLP, and gMLP, whose architectures are correlated to image size and thus are infeasible in object detection and segmentation.

It looks like another pyramid-shaped MLP architecture, and you can see that the main work must be around the spatial MLP again. This removes the dependency of the schema on the input size. In fact, in the original MLP-Mixer paper, the authors also tried the pyramid structure, which did converge faster than the fixed resolution form.

We tried using the token-mixing MLP to reduce the number of tokens by mapping from S input tokens to S’

CycleMLP has two advantages compared to modern approaches.

  • (1) It can cope with various image sizes.
  • (2) It achieves linear computational complexity to image size by using local windows.

    Linear computational complexity is achieved here using the operation of local Windows, in which fixed local Windows also allow the processing of images of different sizes.

In contrast, previous MLPs have quadratic computations because of their fully spatial connections.

The necessity of improving spatial MLP is pointed out. Its computational complexity is too high.

We build a family of models that surpass existing MLPs and achieve a comparable accuracy (83.2%) on ImageNet-1K classification compared to the state-of-the-art Transformer such as Swin Transformer (83.3%) but using fewer parameters and FLOPs.

As you can see, the performance of this approach is very high. It is not known if any other strategies other than structure are used.

We expand the MLP-like models’ applicability, making them a versatile backbone for dense prediction tasks. CycleMLP aims to provide a competitive baseline on object detection, instance segmentation, and semantic segmentation for MLP models. In particular, One achieves achieves 45.1 mIoU on ADE20K val, Comparable to Swin (45.2 mIoU). Code is available at this HTTPS URL.

The main content

The overall structure

As can be seen, the core is in the improved space MLP. The existing MLP structure has three shortcomings:

  1. Most of them are single-scale structures, which are not easy to transfer to other tasks requiring feature pyramids, such as detection and segmentation.
  2. The spatial MLP connects all points in the input feature space, which limits the model’s dependence on the input size. It is not conducive to multi-scale training, multi-scale testing, and even training and test resolution are different.
  3. Spatial MLP has quadratic computational complexity, which makes it inconvenient to process high-resolution images.

The authors respond to this from two aspects:

  1. Aiming at problem 1, a hierarchical structure is designed to generate a feature pyramid.
  2. For problems 2 and 3, which are actually spatial MLPS, the authors designed a special channel MLP to implement local space processing. Since it is for local space, there is no strong dependence on the input size. And it is still channel MLP (spatially shared point operation), so the computational complexity is reduced to linear.

Different from s2-MLP

Although the channel MLP is used to replace the spatial MLP, the specific way and the overall form of the model are different:

  1. The features are grouped by channels in S2-MLP, and the relative offsets of different groups in different directions in space are carried out. This introduces additional grouping and offsetting operations on the feature map. In this paper, there is no need to change the characteristics, but only adjust the operation form of channel MLP. It has better versatility and pluggability.
  2. S2-mlp is still a single scale structure, and pyramid structure is introduced in this paper to better adapt to detection and segmentation tasks.

As a matter of fact, for point 1, in s2-MLP, the strategy of using depth separation convolution is given, that is, the migration can be realized through a specific form of depth separation convolution kernel, and the grouping and migration of input data can be realized directly through the operation of the convolution kernel. Point 1 here is not true. It can only be said that the realization of Cycle FC may be more direct, unlike S2-MLP, which requires some processing operations unrelated to calculation. Further, in terms of implementation, can Cycle FC also achieve cheap operation by using deep separation convolution? The answer is yes, and I’ll offer some simple attempts in the code analysis below.

The model details

  • Patch embedding Module: window size 7, stride 4, final feature sampling four times.
  • In the middle, stepwise convolution is used to realize two-fold down-sampling.
  • The final output is 32 times the characteristics of the lower sample.
  • Ultimately, a full connection layer is used to integrate all token features.

Core operation — Cycle FC

The core idea of Cycle FC proposed in this paper is to make use of channel MLP’s independence of feature size (reduce the restriction on input shape and reduce the computational complexity to linear), and at the same time try to increase its receptive field to better integrate context features.

As can be seen from the form given in the figure, Cycle FC is actually a channel MLP with a specific location offset (stair sampling, STAIRs-like style) on the channel. So the input shape requirements are not too strict. Of course, at least the offset position should not exceed the core size defined on the HW. As you can see from the code, this is a limited range, which is cyclic offset by having the channel index modulo it. The implementation here is interesting, using deformable convolution to apply the offset to the core parameters. Specifically, the original channel MLP is calculated as follows: Yi, j = ∑ c = 0 ci – 1 fj, c ⊤ ⋅ Xi, cY_ = {I, j} \ sum_ c = {0} ^ {} C_i – 1 \ mathcal {} F ^ {\ top} _ {j, c} \ cdot X_ {I, c} Yi, j = ∑ c = 0 ci – 1 fj, c ⊤ ⋅ Xi, c, Where F∈RCi×Co\mathcal{F} \in \mathbb{R}^{C_i \times C_o}F∈RCi×Co represents the learnable weight of channel MLP. Where I&JI \& JI&J represents the index of space and channel respectively. For the Cycle FC proposed in this paper, the calculation mode is extended to: Yi, j = ∑ c = 0 ci – 1 fj, c ⊤ ⋅ Xi + c % SP, cY_ {I, J} = \ sum_ c = {0} ^ {} C_i – 1 \ mathcal {} F ^ {\ top} _ {j, c} \ cdot X_ {I + c \ % S_ {\ mathcal {P}}, c} Yi, j = ∑ c = 0 ci – 1 fj, c ⊤ ⋅ Xi + c % SP, c, This introduces an offset range parameter SPS_{\mathcal{P}}SP, or pseudo-kernel size, which represents the area of the HW space projected by all the computed positions involved after the channel offset, while another parameter III represents the starting position of the offset. In the code, the value is the relative coordinate of the center inside the pseudo-kernel rectangular region (starting from the upper left corner of the region, the index is 0, and the rectangular region is sorted by the main row order), Namely start_idx = (self. Kernel_size [0] * self kernel_size [1]) / / 2.

Code parsing

Part of the core code is analyzed in the form of comments.

from torchvision.ops.deform_conv import deform_conv2d as deform_conv2d_tv

class CycleFC(nn.Module) :
    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        kernel_size,  # re-defined kernel_size, represent the spatial area of staircase FC
        stride: int = 1,
        padding: int = 0,
        dilation: int = 1,
        groups: int = 1,
        bias: bool = True.) :
        """ Kernel_size is actually 3x1 or 1x3 ""
        super(CycleFC, self).__init__()

        ifin_channels % groups ! =0:
            raise ValueError('in_channels must be divisible by groups')
        ifout_channels % groups ! =0:
            raise ValueError('out_channels must be divisible by groups')
        ifstride ! =1:
            raise ValueError('stride must be 1')
        ifpadding ! =0:
            raise ValueError('padding must be 0')

        self.in_channels = in_channels
        self.out_channels = out_channels
        self.kernel_size = kernel_size
        self.stride = _pair(stride)
        self.padding = _pair(padding)
        self.dilation = _pair(dilation)
        self.groups = groups

        The weight of 1x1 convolution adjusted by offset needs to be constructed by itself due to the later use of deformable convolution function provided by TorchVision
        self.weight = nn.Parameter(torch.empty(out_channels, in_channels // groups, 1.1))
        # kernel size == 1

        if bias:
            self.bias = nn.Parameter(torch.empty(out_channels))
        else:
            self.register_parameter('bias'.None)
        Note that you are registering a buffer, which is a constant and cannot be learned, but can be saved to the model weight.
        self.register_buffer('offset', self.gen_offset())

    def gen_offset(self) :
        The core operation that generates the offset of the convolution kernel. To understand the operation of this function, you need to understand the specific usage of deform_conv2d_tv that follows. Specific visible: https://pytorch.org/vision/0.10/ops.html#torchvision.ops.deform_conv2d here is to the requirement of offset parameters:  offset (Tensor[batch_size, 2 * offset_groups * kernel_height * kernel_width, out_height, Out_width]) -- offsets to be applied for each position in the convolution kernel. That is, for the position (x,y) in channel C of the output characteristic graph of sample S, the function will be taken from offset, The convolution kernel whose shape is kernel_height*kernel_width corresponds to offset[s, 0:2*offset_groups*kernel_height*kernel_width, x, Y] is the set of parameters that correspond to a single position (x,y) of sample S. You can have different offsets for different positions, or you can have the same (the following implementation is the latter). The 2*offset_groups*kernel_height*kernel_width number refers to the grouping of input characteristic channels. They are divided into offset_groups, each of which has a set of relative offsets corresponding to the center position of the convolution kernel, with a total of 2*kernel_height*kernel_width numbers. For each kernel argument, two variables are used to describe the offset, the h and W offset relative to the center, minus kernel_height//2 or kernel_width//2 in the following code. Note that when the offset is outside the tensor boundary after the padding, you are completing the grid with zeros. If there are boundary values on the grid, the boundary values and the mesh vertices completed with 0 are used to compute the result of the bilinear interpolation. "" "
        offset = torch.empty(1, self.in_channels*2.1.1)
        start_idx = (self.kernel_size[0] * self.kernel_size[1/ /])2
        assert self.kernel_size[0] = =1 or self.kernel_size[1] = =1, self.kernel_size
        for i in range(self.in_channels):
            if self.kernel_size[0] = =1:
                offset[0.2 * i + 0.0.0] = 0
                # Here a relative offset position is calculated.
                The offset coordinate index method used by # deform_conv2D centered on the corresponding output position
                offset[0.2 * i + 1.0.0] = (
                	(i + start_idx) % self.kernel_size[1] - (self.kernel_size[1] / /2))else:
                offset[0.2 * i + 0.0.0] = (
                    (i + start_idx) % self.kernel_size[0] - (self.kernel_size[0] / /2)
                )
                offset[0.2 * i + 1.0.0] = 0
        return offset

    def forward(self, input: Tensor) -> Tensor:
        """ Args: input (Tensor[batch_size, in_channels, in_height, in_width]): input tensor """
        B, C, H, W = input.size()
        return deform_conv2d_tv(input,
                                self.offset.expand(B, -1, H, W),
                                self.weight,
                                self.bias,
                                stride=self.stride,
                                padding=self.padding,
                                dilation=self.dilation)
Copy the code

Since there is some confusion about the generation of offset here, I also put issue to the author (Github.com/ShoufaChen/…), along with a small example of deform_conv2D. Here the author provides a more appropriate illustration of the code:

class CycleMLP(nn.Module) :
    def __init__(self, dim, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.) :
        super().__init__()
        self.mlp_c = nn.Linear(dim, dim, bias=qkv_bias)

        self.sfc_h = CycleFC(dim, dim, (1.3), 1.0)
        self.sfc_w = CycleFC(dim, dim, (3.1), 1.0)

        self.reweight = Mlp(dim, dim // 4, dim * 3)

        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

    def forward(self, x) :
        B, H, W, C = x.shape
        h = self.sfc_h(x.permute(0.3.1.2)).permute(0.2.3.1)
        w = self.sfc_w(x.permute(0.3.1.2)).permute(0.2.3.1)
        c = self.mlp_c(x)

        a = (h + w + c).permute(0.3.1.2).flatten(2).mean(2)
        a = self.reweight(a).reshape(B, C, 3).permute(2.0.1).softmax(dim=0).unsqueeze(2).unsqueeze(2)

        x = h * a[0] + w * a[1] + c * a[2]

        x = self.proj(x)
        x = self.proj_drop(x)

        return x
Copy the code

From the code, it can be seen that in practical use, two sets of parallel operations 1×3 and 3×1 are constructed based on a form similar to the fractal convolution in Inception V3. There is also a generic channel MLP for single position processing. Thus a three-branch structure is constructed.

which is inspired by the factorization of convolution [47] and criss-cross attention [26].

The experimental results

Comparison with the corresponding MLP method

Here the author mentions GFNet, which uses FFT to learn spatial features with less computation and similar performance to CycleMLP. However, it is limited by the input resolution, and if you want to change the input resolution, you need to use parameter interpolation. This can hurt the performance of intensive prediction tasks. (Does that really matter much?)

In addition, the authors added ablation experiments with different test resolutions and different branches.

The two experiments revealed some interesting phenomena.

  • Looking first at the effect of resolution, it can be seen that the optimal test piece of chalk may not be consistent with the original training. The results reflect the stability of CycleFC to the test size.
    • However, it should be noted that the experiment here can be seen from the code that the operation mode used in the test is to enlarge the specified size 256224\frac{256}{224}224256 times, and then output the corresponding size from the center crop. The corresponding phenomena of other forms of data processing still need to be verified by more experiments.
  • In Table 4, ablation experiments were performed on three branches of the multi-branch structure. As can be seen, operational diversity has forward gain for multi-branch structures.

The effectiveness of the proposed structure (more flexible input size, more efficient spatial calculation) is also demonstrated through its performance in detection and segmentation tasks.

link

  • Thesis: arxiv.org/pdf/2107.10…
  • Code: github.com/ShoufaChen/…