Before the speech

The semantic segmentation method mainly adopts full convolutional network (FCN) with encoder-decoder architecture. Encoders gradually reduce spatial resolution and learn more abstract/semantic visual concepts through a larger receptive field. Since context modeling is critical to segmentation, recent work has focused on increasing the receptive field by dilating/voiding convolution or inserting attention modules. However, the encoder/decoder based FCN architecture remains unchanged. In the paper introduced here, the authors aim to provide an alternative by treating semantic segmentation as a sequence-to-sequence predictive task. Specifically, the author deploys a pure transformer (i.e., no convolution and no resolution degradation) to encode the image in patch order. With the global context modeled in each layer of transformer, this encoder can be combined with a simple decoder to provide a powerful SEgmentation model called SEgmentation Transformer (SETR).

Address: arxiv.org/abs/2012.15…

01

The network structure

Firstly, the image is decomposed into patches grid of a fixed size to form a patch sequence. By inputting the pixel vector of each patch into the linear embedding layer, a series of feature embedding vectors are obtained as the input of Transformer. Given the features learned from the encoder transformer, the decoder is then used to restore the resolution of the original image. The key point is that there is no downsampling in spatial resolution, but global context modeling in each layer of Encoder converter, that is to say, Encoder functions are fully realized with attention mechanism, thus providing a new perspective for semantic segmentation.

The model proposed by the author is essentially a ViT+Decoder structure, as shown in the figure below, where the ViT paper link is as follows, interested can go to see this article arxiv.org/abs/2010.11…

Firstly, input images need to be processed into sequences. Pixel-wise processing of images will require a lot of calculation, so the author adopts patch-wise to flatten and serializes H*W*3 images into 256 H/16*W/16*3 patches. In this way, the input sequence length of Transformer is H/16*W/16, and the vectoring patch P_i is obtained through Linear Projectionfunction to obtain the vector e_i, and then the input of Transformer layer is as follows:

E = { e1 + p1, e2 + p2, … , e_L + p_L }

Where e_I is patch embedding and P_I is position embedding.

The serialized image is then entered into Transformer as shown below

Each Transformer layer consists of multi-focus, LN and MLP layers. The output result is {Z1,Z2,Z3… ZLe} input E obtained in the previous step into 24 transformer series, that is, the receptive field of each transformer is the whole image.

Finally, the author proposed three decoders:

  • Naive upsampling (Naive)

After the feature dimension of Transformer output is reduced to the number of categories, the original resolution is restored through bilinear upsampling, i.e. 2-layer: 1 × 1 CONV + Sync Batch norm (W/ReLU) + 1 × 1 CONV

  • Progressive UPsampling (PUP)

In order to recover from H/16 × W/16 × 1024 to H × W × 19(19 is the number of categories of Cityscape), it takes 4 operations, alternating with convolution layer and double upsampling operation to restore to original resolution, as shown in the figure below

  •  Multi-Level feature Aggregation (MLA)

First, the output of Transformer {Z1,Z2,Z3… ZLe} is evenly divided into M equal parts, each of which takes an eigenvector. As shown in the figure below, the output of 24 Transformer is divided into 4 parts, and the last one is taken from each part, i.e. {Z6,Z12,Z18,Z24}. The Decoder behind only deals with these removed vectors.

Specifically, ZL is firstly restored from 2D (H × W)/256 × C to 3D H/16 × W/16 × C, and then through 3-layer convolution 1 × 1, 3 × 3, and 3 × 3, and then through bilinear up-sampling 4× top-down fusion. The last Zl in the following figure theoretically has all the information of the above three features. After fusion, it is restored to the original resolution by bilinear interpolation after 3 × 3 convolution.

02

The experimental results

Experiments on Cityscapes, ADE20K and PASCAL Context show that the experimental results are better than traditional FCN(with & Without attention Module) feature extraction method. The visualization of ADE20k data set and FCN is shown below:

Compared with FCN in Pascal Context data set, the visualization results are as follows:

The results on the three data sets are shown in the following table:

END