This article from the public CV technical guide technical summary series

Click a concern, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.

preface

This paper introduces two experiments to demonstrate the influence of padding on deep learning models.

The experiment of a

Convolution is translational: shift the input image by 1 pixel and the output image by 1 pixel (see Figure 1). If we apply global averaging pooling to the output (that is, summing over all pixel values), we get a translation invariant model: no matter how we translate the input image, the output will remain the same.

In PyTorch, the model looks like this: y = torch. Sum (conv(x), Dim =(2, 3)) input x and output y.

Figure 1: Top: Input image with one white pixel (original and 1 pixel shifted version). Chinese: convolution kernel. Bottom: Output image and its pixel sum.

Can you use this model to detect the absolute position of pixels in an image?

For translation-invariant models like the one described, it should be impossible.

Let’s train this model to classify images containing a single white pixel: if the pixel is in the upper left corner, it should print 1, otherwise it should print 0. The training converges quickly, and testing the binary classifier on some images shows that it can detect pixel positions perfectly (see Figure 2).

Figure 2: Top: Input image and classification results. Bottom: Output image and pixel sum.

How does the model learn to classify absolute pixel positions? This is only possible because of the type of padding we used:

  1. Figure 3 shows the convolution kernel after some EPOCH training
  2. When filling with “same” (used in many models), the core center is moved across all image pixels (implicitly assuming the pixel value outside the image is 0)
  3. This means that the right and bottom rows of the kernel never “touch” the upper left pixel in the image (otherwise the kernel center would have to move out of the image)
  4. However, when moving across the image, the right column and/or bottom row of the kernel touches all other pixels
  5. Our model takes advantage of differences in pixel processing
  6. Only positive (yellow) kernel values are applied to the upper left white pixel, thus yielding only positive values, which give positive values
  7. For all other pixel positions, strong negative kernel values (blue, green) are also applied, which gives a negative sum

FIG. 3:3×3 convolution kernel.

Although the model should be translation-invariant, it is not. The problem occurs near the image boundary caused by the type of fill used.

Experiment 2

Does the input pixel affect the output depending on its absolute position?

Let’s try again with a black image with a single white pixel. The image is fed into a neural network consisting of a convolutional layer (all kernel weights are set to 1 and offset items are set to 0). The influence of input pixels is measured by summing the pixel values of the output image. The “valid” fill means that the full kernel remains within the bounds of the input image, while the “same” fill is defined.

Figure 4 shows the impact of each input pixel. For the “valid” fill, the result is as follows:

  1. There is only one location where the kernel touches the corner of the image, and the corner pixel value of 1 reflects this
  2. For each edge pixel, the 3×3 kernel touches that pixel in three locations
  3. For a normally positioned pixel, there are nine core positions, and the pixel touches the core

Figure 4: Applying a single convolution layer to a 10×10 image. Left: “Same” fill. Right: Valid.

Pixels near the boundary have much less influence on the output than the central pixel, which may cause the model to fail when the relevant image details are near the boundary. For “same same” padding, the effect is less severe, but there are fewer “paths” from the input pixels to the output.

The final experiment (see Figure 5) shows when starting with a 28×28 input image (for example, from a MNIST dataset) and feeding it into a neural network with five convolutional layers (for example, a simple MNIST classifier might look like this). In particular, for the “valid” population, there are now large image regions that the model almost completely ignores.

Figure 5: Five convolution layers are applied to a 28×28 image. Left: “Same” fill. Right: Valid.

conclusion

These two experiments show that the selection of padding is important, and some poor choices may lead to poor model performance. For more details, see the following paper, which also suggests a solution for how to solve the problem:

1. MIND THE PAD — CNNS CAN DEVELOP BLIND SPOTS

2. On Translation Invariance in CNNs: Convolutional Layers can Exploit Absolute Spatial Location

By Harald Scheidl

Compilation: CV technical Guide

Original link: harald-scheidl.medium.com/does-paddin…

Welcome to pay attention to the public number CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.

Reply keyword “technical summary” in the public account to obtain the summary PDF of the original technical summary article of the public account.

Other articles

CV technical Guide – Summary and classification of essential articles

Summary of tuning methods for hyperparameters of neural networks

Use Ray to load the PyTorch model 340 times faster

Summary of image annotation tools in computer vision

Lightweight Model family –GhostNet: Cheap operations generate more features

ICCV2021 | MicroNet: at a low FLOPs improve image recognition

CVPR2021 | to rethink BatchNorm in Batch

ICCV2021 | to rethink the visual space dimension of transformers

CVPR2021 | Transformer used for End – to – End examples of video segmentation

ICCV2021 | (tencent optimal figure) count and rethink the crowd positioning: a purely based on the framework

Complexity analysis of convolutional neural networks

A review of the latest research on small target detection in 2021

Self attention in computer vision

Review column | attitude estimation were reviewed

CUDA optimizations

Why is GEMM at the heart of deep learning

Why are 8 bits enough to use deep neural networks?

Capsule Networks: The New Deep Learning Network

Classic paper series | target detection – CornerNet & also named anchor boxes of defects

What about the artificial intelligence bubble

Use Dice Loss to achieve clear boundary detection

PVT– Multifunctional backbone without convolution intensive prediction

CVPR2021 | open the target detection of the world

Siamese network summary

Summary of computer vision terms (a) to build the knowledge system of computer vision

Summary of under-fitting and over-fitting techniques

Summary of normalization methods

Summary of common ideas of paper innovation

Summary of efficient Reading methods of English literature in CV direction

A review of small sample learning in computer vision

A brief overview of intellectual distillation

Summary of feature pyramid technology in computer vision

Siamese network summary

Technical summary of attention mechanism in computer vision

Tesseract vs. EasyOCR open Source Framework for Word recognition

Summary of computer vision terms (a) to build the knowledge system of computer vision

Summary of normalization methods

Summary of efficient Reading Methods of English literature by computer vision

A review of small sample learning in computer vision

A brief overview of intellectual distillation

Loss function technology summary

CVPR2021 | open the target detection of the world

CVPR2021 | PVT, convolution intensive prediction of multi-functional backbone

New way YOLOF CVPR2021 | characteristics of pyramid

The Transformer is put forward in the Transformer CVPR2021 | huawei Noah laboratory

CVPR2021 | following the SE, a new Attention after CBAM mechanism Coordinate Attention

Classic paper series | to rethink on ImageNet training beforehand