Hire-MLP: Vision MLP via Hierarchical Rearrangement

The original document: www.yuque.com/lart/papers…

The article is very easy to read, with no complicated words or sentence patterns, and is very smooth from beginning to end. I like this writing style very much.

Learn the article from the abstract

This paper presents Hire-MLP, a simple yet competitive vision MLP architecture via hierarchical rearrangement.

Question:

  1. What is? Simplified processing strategy for spatial MLP.
  2. Why is that? Divide and conquer, simplify.
  3. How to do? The original undifferentiated global space MLP is replaced by local processing and then correlation of each region.

Previous vision MLPs like MLP-Mixer are not flexible for various image sizes and are inefficient to capture spatial information by flattening the tokens.

The purpose of this paper:

  1. Remove the old MLP method’s dependence on input data size
  2. A more efficient way to capture spatial information

Hire-MLP innovates the existing MLP-based models by proposing the idea of hierarchical rearrangement to aggregate the local and global spatial information while being versatile for downstream tasks.

Integration of local and global spatial information: local may depend on convolution operations; In the MLP context, what happens to global spatial information if you don’t use a spatial MLP? Is it pooling? From the back, this is actually more similar to the SwinTransformer process, which starts with dense connections at a local scale and then correlates the regions. It is more general for downstream tasks (segmentation detection, etc.), which indicates that the method in this paper adopts multi-scale structure. What about the downsampling of features here? Pooling? Step convolution? Patch fusion? — It can be seen from the paper that step-convolution is used.

Specifically, the inner-region rearrangement is designed to capture local information inside a spatial region. Moreover, to enable information communication between different regions and capture global context, the cross-region rearrangement is proposed to circularly shift all tokens along spatial directions.

It looks a bit like SwinTransformer, with local processing followed by global offsets to associate regions.

The proposed HireMLP architecture is built with simple channel-mixing MLPs and rearrangement operations, thus enjoys high flexibility and inference speed.

The processing here does not seem to mention spatial MLPS, so what happens inside local regions?

Experiments show that our Hire-MLP achieves state-of-the-art performance on the ImageNet-1K benchmark. In particular, One of the things that hire-MLP achieves is 83.4% top-1 accuracy on ImageNet, which surpasses previous Transformer-based and MLP-based models with better trade-off for accuracy and throughput.

The main content

As you can see, the main change is in the original space MLP, replaced with the Hire Module.

Hierarchical Rearrangement

The processing here is based on regions, so blocks in the module need to be divided according to the H axis and W axis first.

The rearrangement operation here expands from two dimensions, one along height and one along width.

Inner-region Rearrangement

As can be seen from the figure, the rearrangement within the region is as follows: in the high direction, adjacent layers (local strip region) on axis H are stacked onto channel C dimension; in the wide direction, the same is true. By stacking onto channel, local region features can be processed directly by channel MLP.

The idea here is interesting.

But if you think about it, you can actually view this as a decomposition of convolution. In PyTorch, the processing of the convolution operation using Nn.Unfold is actually similar. This is equivalent to the convolution operation of the larger kernel by stacking the data of the local window onto the channel dimension and then using the full connection layer.

And here, you can kind of view it as a window stroke without overlap. Maybe this follow-up work will try to use overlapping forms.

But this way, it’s more like convolution.

>>> # Convolution is equivalent with Unfold + Matrix Multiplication + Fold (or view to output shape)
>>> inp = torch.randn(1.3.10.12)
>>> w = torch.randn(2.3.4.5)
>>> inp_unf = torch.nn.functional.unfold(inp, (4.5))
>>> out_unf = inp_unf.transpose(1.2).matmul(w.view(w.size(0), -1).t()).transpose(1.2)
>>> out = torch.nn.functional.fold(out_unf, (7.8), (1.1))
>>> # or equivalently (and avoiding a copy),
>>> # out = out_unf.view(1, 2, 7, 8)
>>> (torch.nn.functional.conv2d(inp, w) - out).abs().max()
tensor(1.9073 e-06)
Copy the code

Another point is that the processing of partial square Windows is divided into two one-dimensional strip Windows with different directions of H and W. I’m going to split KXK into 1xk and kx1.

It seems that the various designs of the convolution model have almost exhausted the basic units of the model structure (^__^).

Cross-region Rearrangement

For cross-region rearrangements, the features are processed with the overall shift (torch. Roll) along the H or W axis. This operation may not seem useful by itself, but if it is processed within the previously designed area after rearrangement, then cross-window of the local area is implemented exactly.

However, there is a problem that needs to be noticed here. It can be seen that the processing of local regions is only included after the window feature offset, and the local processing of the features before the offset is not considered. More reasonable internal processing should be in the form of window – > Windows features – > internal processing window – > displacement characteristics of window location restore – > Windows internal processing (optional), such as cross processing seems to better cover a wider scope of space, unlike now, window handle always corresponding area are fixed.

The experimental results

The ablation experiments mainly discussed the following points:

  • Number of Windows divided: The default is the same window width along the H axis and W axis. Smaller Windows give more emphasis to local information. And in the experiment, empirically, a larger window width was used in a shallower layer to obtain a larger receptive field.

As you can see, performance degrades as the window width increases. The authors speculate that as the region size increases, some information may be lost in the bottleneck structure.

  • Discussion of step size s for cross-window offset.

As you can see, a larger offset of the shallow window will work better. Perhaps it is because increasing the receptive field of shallow features can bring some benefits.

  • Discussion of different padding forms. For input 224, the feature size at stage4 is 7×7, which cannot achieve even division of Windows. Therefore, padding is needed for the setting of non-overlapping Windows in this paper. Several strategies are discussed here.

  • The importance of different branches in the Hire Module.

As you can see, this part of processing inside the region is very important. It’s actually understandable. If there is no such local operation. Then a simple offset across window offset does not make sense for channel MLP either. Because it’s a dot operation.

  • Different forms of cross-window communication.

This compares offsets (which preserve some adjacency, that is, relative position information) with inter-group shufflenet shufflenet. You can see that relative location information is still quite important.

link

  • Hire-mlp: Vision MLP via Hierarchical Rearrangement: arxiv.org/pdf/2108.13…