Convolutional neural network CNN has greatly improved the performance of computer vision and brought about a revolution in computer vision. However, as we all know,

As long as the input picture is slightly changed by one pixel or shifted by one pixel, the output of CNN will change greatly, so it is easy to recruit against attacks.

A paper from ICML2019 dedicated to solving CNN translational invariable adversarial attacks, Making Convolutional Networks Shift-Invariant Again, is shared today.

Loss of translation invariance

The fundamental reason lies in down sampling. No matter Max Pooling, Average Pooling or Strided Convolution, as long as the step size is larger than 1, translation invariance will be lost.

For a simple example, the normal input is the one-dimensional numeric sequence 00110011.

Shift = 0 Maxpooling:

Maxpooling: 1 to the left

There is a huge difference in the Maxpooling results by only shifting 1 pixel. Downsampling is the main culprit for the loss of CNN translation invariance.

A strict definition of translation invariance and translation equality

The previous section is just a perceptual understanding of the sensitivity of lower sampling to translation. Strictly speaking, the previous section shows parallel equivalence, not translation invariance. We give a strict definition of translation invariance and translation equivalence.

Use X to represent a picture with resolution H X W, then:

An L-layer CNN can be represented by the following feature extractor:

Each Feature Map F can be upsampled to the same resolution as X:

1) Translation equivalence

If the translation of the input is equal to the translation of the output, the translation is equal.

Note that F in the above formula is upsampled and has the same resolution as the input image X.

2) Translation invariance

If the input is shifted and the output remains the same, it is shifted invariant.

Note that F in the above formula is upsampled and has the same resolution as the input image X.

For the example in the previous section, let’s compare shift=0 with shift=1 and upsample the result:

Obvious translational equivalence and translational invariance are lost.

Shift =0; shift=2;

This obviously maintains the translation equality but does not maintain the translation invariance. The translation equality can be achieved as long as the number of pixels of shift is equal to the step size of pooling.

Make a variation on the example in the previous section:

This obviously preserves translational invariance and translational equivalence.

In general CNN, as long as the first several layers maintain translation equivalence, the final classification generally has translation invariance, so translation equivalence is mainly discussed in the following part.

3) Measure of translation equality

It’s best if the translation equality is maintained, but if the translation is unequal, how do you measure the degree of translation inequality? Specific measures are generally adopted:

Specifically, the author uses cos similarity to measure:

The value range is [-1.0, 1.0]. The larger the value is, the better the translation equality is.

Anti_aliasing(anti-aliasing)

Traditional resolution reduction methods such as Maxpooling, Average pooling and Strided convolution all break translation equality. The textbook solution is to use low-pass filter anti-aliasing, and Average pooling has its own anti-aliasing. However, there is no elegant solution for the specific application of maxpooling and Strided convolution. The author decomposes the pooling process, makes slight improvements and puts forward an elegant solution.

Taking maxpooling as an example, it can be divided into two equal steps as shown in the following figure:

MaxPool/2 = Max/1 + Subsampling/2

The author proposes to add an anti-aliasing Blur operation after Max/1 and Subsamping/2.

MaxBurPool/2 = Max/1 + anti-aliasing Blur/1 + Subsampling/2

The last two steps, called BlurPool, are noted:

BlurPool/2 = anti-aliasing Blur/1 + Subsampling/2

Maxpooling, Average pooling, and Strided convolution are modified to anti-aliased versions.

Anti_aliasing effect

Here is the result of MaxPool/2 and MaxBlurPool/2 with Shift =0:

Here is the result of MaxPool/2 and MaxBlurPool/2 with Shift =1:

Shift =0 and shift=1 MaxPool/2 cosine similarity is: 0.71

The MaxBlurPool/2 cosine similarity is 0.95 when shift=0 and Shift =1

As can be seen, the translation equality of BlurPool with Anti_aliasing is greatly improved.

Effects on data sets

1) Results on ImageNet

The following figure shows the consistency and Accuracy of ImageNet data set after adding anti-aliasing. Consistency is input picture, give two different shift after the categories remain the same proportion. After anti-aliasing can be added, no matter which anti-aliasing kernel is selected, the accuracy and consistency will obviously increase compared with the baseline, which is unexpected. The author thinks that the accuracy will be lost, but the accuracy will be increased instead. Kernel in the figure refers to:

The Rect – 2: [1, 1)

Tri-3: [1, 2, 1]

Bin-5: [1, 4, 6, 4, 1]

All kernels need normalize, which is divided by the sum of all elements of the kernel.

2) Measure of translation equality of all layers in VGG

The following figure shows FeatureMap of each layer in VGG, and the translation equality measure (cosine similarity) corresponding to any possible shift on CIFAR10. Since all CIFAR10 images have 32×32 resolution, there are only 32×32 translations, so the result is 32×32 images. Blue represents perfect translation equality, and red represents greater translation inequality.

Obviously, translation equality is periodic, and the period is the cumulative down-sample rate. Every time you sample down, you lose half of your translation equivalence. It can also be seen that anti-aliasing significantly improves translation equivalence.

3) Comparison of results of different kernels on CIFAR10

In the figure below, the author discusses the changes in accuracy and consistency on CIFAR10 under different network structures (DenseNet and VGG), different Blur kernel, augmentation and no mentation.

It can be clearly seen that data enhancement is of great importance to non-traditional pooling and the effectiveness of anti-aliasing pooling, especially when no data is enhanced.

conclusion

After the Pooling is changed into BlurPooling, translation invariability can be effectively improved, model accuracy can be improved, and the increased calculation is basically negligible. It is indeed a very elegant solution, which is expected to be applied on a large scale and become the standard configuration of future models.