RyanXing

Multimedia Processing & Computer Vision.

Paper | Octave Convolution (OctConv)

Thesis: Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution

1. Scale-space Theory

Reference: Wikipedia

If the size/scale of the image object we are dealing with is unknown, then we can apply scale-space theory.

The idea is to represent images at multiple scales, collectively known as scale-space representations. The image is smoothed by a series of Gaussian filters of different sizes. In this way, we get the representation of the image at different scales.

Formulation: Suppose the two-dimensional image is f(x,y)f(x,y), and the two-dimensional Gaussian function (a cluster of tt) is g(x,y; PI te – x2 + t) = 12 y22tg (x, y; T)=12πte−x2+y22t, then the linear scale space can be obtained by Convolution of the two:

 

L (⋅ ⋅; T) = g (⋅ ⋅; T) ∗ f (⋅ ⋅) L (⋅ ⋅; T) = g (⋅ ⋅; T) ∗ f (⋅ ⋅)


The variance t=σ2t=σ2 of the Gaussian filter is called the scale parameter.

Intuitively, structures with a scale less than t square root of t in the image will be smoothly indistinguishable. As a result,The larger tt, the more intense the smoothing.

In fact, we’re only going to consider some discrete values for t≥0 and t≥0. When t=0t=0, the Gaussian filter degenerates into an impulse function, so the result of convolution is the image itself without any smoothing.

Look at the picture:

In fact, we can construct other scale Spaces. But linear (Gaussian) scale space is the most widely used because it satisfies many nice properties.

The most important property of the scale space method is scale invariant, which allows us to process image objects of unknown size.

Finally, it should be noted that the construction of scale space is often accompanied by downsampling. For example, for the scale space t=2t=2, we will reduce the resolution by half, that is, the area by 1/41/4. This is also the approach of this article.

2. OctConv

The author believes that not only the images in the natural world have high and low frequency components, but also the output feature maps or channels of the convolutional layer and the input channels have high and low frequency components. The low-frequency component supports the whole, such as the penguin’s big white belly. Obviously, low frequency components are redundant and can be saved in the coding process.

Based on the above considerations, the author proposed OctConv to replace the traditional CNN (Vanilla CNN). There are two key steps:

The first step is to obtain the linear scale representation of the input channel (or image), which is called Octave feature representation. The so-called high frequency component refers to the original channel (or image) without Gaussian filtering; The so-called low-frequency component refers to the channel (or image) obtained by t=2t=2 Gaussian filtering. Since the low frequency component is redundant, the author sets the channel length/width of the low frequency component to half the channel length/width of the high frequency component. In music, Octave means an Octave scale, the frequency halved by one Octave; Here, drop an octave means halve the channel size. So what is the ratio of the high frequency channel to the low frequency channel? A hyperparameter α∈[0,1]α∈[0,1] is set to represent the proportion of low frequency channels. In this paper, the input channel low frequency ratio αinαin and output channel low frequency ratio αoutαout are set as the same.

Figure: Penguin white belly (low frequency) redundancy (top); Comparison between traditional CNN and OctConv (below).

Here’s the problem: traditional convolution cannot be performed because of the different sizes of high/low frequency channels. But we can’t simply upsample the low frequency channel, because then we’re doing nothing, and there’s no way to save computation and memory.

Therefore, we have the second step: In the second step, the author proposes the corresponding Convolution solution: Octave Convolution.

Firstly, some definitions are given: let the low-frequency component and the high-frequency component of the image be XLXL and XHXH respectively, and the low-frequency component and the high-frequency component of the convolution output be YLYL and YHYH respectively. In the convolution operation, WHWH is responsible for building YHYH and WLWL is responsible for building YLYL.

We hope that WHWH is responsible for XHXH to YHYH: WH→HWH→H, and XLXL to YHYH: WL→HWL→H, that is, WH=[WH→H,WL→H]WH=[WH→H,WL→H].

Its structure is shown below on the right. Where, WH→HWH→H is traditional convolution, because the size of input and output images are the same; For WL→HWL→H part, we first perform upsample on the input image, and then perform traditional convolution. In this way, the overall computation is still reduced. WLWL is the same, but for WH→LWH→L, down-sampling is performed first.

The specific method is very simple, is the value of the problem:





The convolution after downsampling is equivalent to the convolution with step size, which is not very accurate; Therefore, the author finally selects the method of average pooling to obtain more accurate sampling results.

The complete process is shown on the left:

After the whole process, we can find that the operation of filtering + new convolution is “interpolated” without destroying the original CNN framework. It is worth noting that the receptive field of low-frequency channel convolution is larger than that of traditional convolution. By adjusting the low frequency ratio αα, the prediction accuracy and calculation cost can be trade-off.

Experimental results: Under the condition of limited computing power (low memory consumption), the prediction accuracy of image classification is quite high. See the papers.

3. Inspire

  1. We need to make the neural network learn better. In this paper, scale-space transformation and Octave convolution operation are used to separate high and low frequency components more clearly, and the computation of low frequency components is saved. Another example is the BN technique, which allows the network to self-learn αα and ββ parameters to achieve feature centralization. Neural networks are awesome and self-taught, but we need to make them faster, leaner and more powerful.
  2. We need to think more about the nature of human vision. The ability to encode human vision far outstrips existing technology: the resolution is about a billion pixels (1 billion rods/cones), but the bandwidth of the eye-to-brain pathway is only 8Mbps (1k ganglia). The low-frequency component redundancy considered in this paper is probably one of the main redundancy solved in human visual coding.