If you want to see the formula directly, you can skip to section 3

1. Why is SPP needed

You need to know why SPP is needed in the first place.

As we all know, convolutional neural network (CNN) is composed of the convolutional layer and the full connection layer. The convolutional layer has no requirements on the size of input data, and the only requirement on the size of data is the first full connection layer. Therefore, basically all CNN require fixed size of input data. For example, the famous VGG model requires the input data size to be (224*224).

There are two problems with fixed input data sizes:

1. The data obtained from many scenes are not of fixed size. For example, the aspect ratio of street view text is basically not fixed.


2. You might say you can cut up the picture, but cutting up the picture will probably lose important information.

To sum up, SPP is proposed to solve the problem that the size of CNN input image must be fixed, so that the aspect ratio and size of input image can be arbitrary.

SPP principle

More detailed principles can be found in the original paper: Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

The diagram above is given in the original text, which needs to be viewed from the bottom up:

  • The first is the input image, which can be of any size
  • The convolution goes to the last convolution layer) output the feature maps of this layer, whose size is also arbitrary
  • Next, enter the SPP layer
    • Let’s start with the 16 blue squares on the far left, which means we’re going fromThe obtained feature map is divided into 16 parts, in addition16X256The 256 in SPP represents a channel, that is, the SPP is divided into 16 portions for each tier (not necessarily equal scores, for reasons that can be understood later).
    • The same is true for the 4 small green cells in the middle and the 1 large purple cell on the right. The feature map is divided into 4X256 and 1X256 portions respectively

So what does it do to divide a feature map into equal parts? The SPP is called MAX Pooling, usually by MAX Pooling.

As shown in the figure above, through the SPP layer, the feature map is transformed into 16X256+4X256+1X256 = 21X256 matrix, which can be expanded into a one-dimensional matrix, i.e. 1X10752, when feeding into the full connection, so the parameter of the first full connection layer can be set to 10752. This also solves the problem of arbitrary input data size.

Note that the number of copies divided into above can be set according to the situation, for example, we can also set 3X3, etc., but it is generally recommended to divide according to the paper.

SPP formula

The theory should understand, but how do you do that? The calculation formula given in the paper will be introduced below, but before that, two calculation symbols and the calculation formula of the size of the matrix after pooling will be introduced:

1. Advance knowledge

Rounding symbol:

  • ⌊⌋: integer down ⌊59/60⌋=0, sometimes floor()

  • ⌈⌉: the integer ⌈59/60⌉=1, sometimes also indicated by ceil()

The calculation formula of matrix size after pooling is as follows:

  • Stride without a Stride:
  • Stride: ⌊+ 1 ⌋ * ⌊+ 1 ⌋

2. The formula

Assuming that

  • The input data size isRepresents channel number, height, and width respectively
  • Pooled quantity:

So there are

  • Kernel size:
  • Stride size:

And we can verify that by assuming that the input size is zero, pooled quantity:

So the size of the nucleus is zeroStep to grow smallAnd the pooled matrix size is indeed.

3. Formula modification

Yes, there are some omissions in the formula given in the paper, so let’s illustrate it by giving examples

Suppose the input data size is the same as above, but the pooled quantity is changed to:

Now the size of the nucleus is zeroStep to grow smallAnd the pooled matrix size is indeedPlease [Simple way to calculate the size of the matrix :(7=2+1*5, 11=3+2*4)] rather than.

So what’s the problem?

We ignore the padding (I didn’t see a formula for the padding in the original article, if there was one… That is I read wrong, please remind me where to write, thank you).

It is easy to find that there is no padding formula in the preceding calculation formula. After N times of using SPP to calculate the result is not the same as expected, and after searching various online materials (though very few), the calculation formula after adding the padding is summarized as follows.























  • : indicates the height of the core
  • : Indicates the step in the height direction
  • : indicates the number of fillings in the height direction, multiplied by 2

Note that both the core and step size formulas use ceil(), which is rounded up, while the padding uses floor(), which is rounded down.

Now check again:

Suppose the input data size is the same as above, the pooled quantity is:

The Kernel of size, the Stride size is, so the Padding is.

Use the matrix size calculation formula: ⌊+ 1 ⌋ * ⌊+1⌋ the size of the pooled matrix is:.

4. Code Implementation (Python)

Here I use the PyTorch deep learning framework to build an SPP layer with the following code:

#coding= UTF-8 import math import torch import torch. Nn. Functional as F # build SPP layer (spatial pyramid pooling layer) class SPPLayer(torch.nn.Module): def __init__(self, num_levels, pool_type='max_pool'): super(SPPLayer, self).__init__() self.num_levels = num_levels self.pool_type = pool_type def forward(self, x): Num, c, h, w = x.size() # num: number of samples c: number of channels h: height w: width for I in range(self.num_levels): level = i+1 kernel_size = (math.ceil(h / level), math.ceil(w / level)) stride = (math.ceil(h / level), math.ceil(w / level)) pooling = (math.floor((kernel_size[0]*level-h+1)/2), Math.floor ((kernel_size[1]*level-w+1)/2)) # select pool_type if self.pool_type == 'max_pool': tensor = F.max_pool2d(x, kernel_size=kernel_size, stride=stride, padding=pooling).view(num, -1) else: Tensor = f. vg_pool2d(x, kernel_size=kernel_size, stride=stride, padding=pooling). View (num, -1) x_flatten = tensor.view(num, -1) else: x_flatten = torch.cat((x_flatten, tensor.view(num, -1)), 1) return x_flattenCopy the code

The above code reference: sppnet-PyTorch

To prevent the original author from deleting the code, I have forked it and can also access it at the following address: marsggbo/ sppnet-Pytorch





MARSGGBOHas the original











The 2018-3-15