VGG takes its name from Visual GeoMetry Group, the laboratory where the authors of the paper work. VGG came up with the idea that you can build deep models by reusing simple base blocks.

VGG block

The composition rule of VGG block is as follows: several convolutional layers with 1 filling and 3×3 window shape are continuously used, followed by a maximum pooling layer with 2 steps and 2×2 window shape. The convolution layer keeps the height and width of the input constant, while the pooling layer halves them.

With each VGG block, the height and width of the sample are halved and the output channel is doubled. Let the number of output channels be CO, the number of input channels be CI, the height and width of the sample are H and W, the height and width of the convolution kernel are KH and kW, the padding is padding, and the step size is 1. It can be concluded that the shape of the convolution kernel is CO * CI * kH * kW, and the number of mutual operations is Co * CI *(H-KH +3)*(W-KW +3).

It can be seen that the number of interoperations is proportional to h* W * CI *co. VGG’s design of halving the height and width and doubling the channel makes most convolutional layers have the same model parameter size and computational complexity.

VGG network

VGG network is composed of convolutional layer module and then connected to full connection layer module. VGG network firstly fully extracts spatial features by modules composed of convolution layer, and then outputs classification results by modules composed of full connection layer. The following network uses eight convolutional layers and three full connection layers, known as VGG-11.

Sequential( (vgg_block1): Sequential( (0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): ReLU() (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) ) (vgg_block2): Sequential( (0): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): ReLU() (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) ) (vgg_block3): Sequential( (0): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): ReLU() (2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (3): ReLU() (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) ) (vgg_block4): Sequential( (0): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): ReLU() (2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (3): ReLU() (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) ) (vgg_block5): Sequential( (0): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): ReLU() (2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (3): ReLU() (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) ) (fc): Sequential( (0): FlattenLayer() (1): Linear(IN_features =25088, OUT_features =4096, Bias =True) (2): ReLU() (3): Dropout(P =0.5, inplace=False) (4): Dropout(p=0.5, inplace=False) Linear(IN_features =4096, OUT_features =4096, bias=True) (5): ReLU() (6): Dropout(P =0.5, inplace=False) (7): Dropout(p=0.5, inplace=False) Linear(in_features=4096, out_features=10, bias=True) ) )Copy the code

Design principles

In the two-dimensional convolution layer, the input region related to the element X in the output array is called the “receptive field” of X (receptive field). In Figure 1, the four elements in the shaded part of the input are the receptive fields of the shaded part of the output. We denoted the 2×2 output in Figure 1 as Y, and considered a deeper convolutional network: cross-correlation between Y and another convolution kernel with a shape of 2×2 was performed to output a single element Z. So, the receptive field of Z on Y includes all four elements of Y, and the receptive field on the input includes all nine elements of Y. It can be seen that we can make the receptive field of single element in the feature graph broader through deeper convolutional network, so as to capture features of larger size on the input.

The receptive field of the entire output of VGG network is the entire input, but with the increase of network depth, the receptive field of individual elements becomes larger and larger. When designing the network, the stacked small convolution kernels are better than the large ones, because the network depth can be increased to ensure the learning of more complex patterns.