Convolution and pooling

convolution

The convolution only changes the depth of the image (the depth is the SAME as the number of convolutional kernels), but does not change the depth and height of the image.

Input image –> Convolution kernel –> Feature map Given the input image, the weighted average of the input pixels in a cell of that image becomes the output image

Each of these corresponds to a pixel. That is, the image size calculation formula of a pixel value in the final feature map after the convolution kernel is: out_size= (in_size-F_size+2P) /S+1, where: F_size is the size of the convolution kernel; P is the size of the padding; S is the stride step

pooling

Pooling only changes the depth core height of the image, not the depth of the image

Pooling: Adjacent pixels in the image tend to have similar values, so usually adjacent output pixels of the convolution layer also have similar values, which means that most of the information contained in the output of the convolution layer is redundant.

Pooling: Reduce the number of outputs by reducing the size of inputs

Common pooling methods: Max pooling and Mean pooling

In traditional vision, in order to ensure the translation invariance of feature extraction, Gaussian blur operation is usually performed before feature extraction. Therefore, mean pooling is usually adopted in the early CNN network, and post-max pooling has better effect (usually, we think that the extreme value is the feature we pay attention to. In addition, Max pooling increases the nonlinearity and improves the expression ability of the network), and the speed is faster. Therefore, Max pooling is used in the later stage, but the features extracted by CNN network do not have translation invariance. The solution is to increase blur operation

The paper reference arxiv.org/abs/1904.11…

2 Background

Back in 1989, Yann LeCun (now a professor at New York university) and his colleagues published work on Convolution NeuralNetworks (CNN). For a long time, CNN has achieved the world’s best results on small-scale issues, such as handwritten numbers, but has not achieved great success.

In 2012, Alex and Hinton participated in the ILSVRC2012 competition and proposed AlexNet, which successfully applied the Trick of ReLU, Dropout and LRN for the first time in CNN. AlexNet also uses A GPU to speed up computing. AlexNet carried forward LeNet’s ideas and applied the basic principles of CNN to a deep and wide network. AlexNet won the ILSVRC2012 by a wide margin.

In 2014, VGG network was proposed. Based on AlexNet, it uses a smaller convolution kernel, deepens the network and achieves better results.

VGG was proposed in 2014 by the Visual Geometry Group at Oxford University’s Department of Science. The main work is to prove that increasing the depth of the network can affect the final performance of the network to some extent. VGG has two structures, VGG16 and VGG19, which are not different in nature except for the depth of the network. Compared with AlexNet in 2012, a high advance of VGG is to use continuous 3×3 small convolution kernel to replace the large convolution kernel in AlexNet (AlexNet uses 11×11, 7×7 and 5×5 convolution kernel). The receptive field of the superposition of two convolution kernels of 3×3 step size 1 is equivalent to that of a 5×5 convolution kernels. However, the use of stacked small convolution kernel is due to the large convolution kernel, because the increase of layers increases the nonlinearity of the network, so that the network can learn more complex models, and the parameters of the small convolution kernel are less.

VGG network: it is a Deep CNN, which has all the functions of CNN and is often used to extract feature images. The network uses 3*3 convolution to study the network with increased depth, and pushes the depth to 6-19 layers.

It has made great achievements in Localization and Classification tasks, and has good versatility on other data sets.

This article mainly discusses depth. To this end, we fixed the other parameters of the architecture and steadily increased the depth of the network by adding more layers of convolution.

The depth of the ConvNet architecture design was studied, controlling for a single variable (depth), fixing other parameters, and steadily increasing the depth of the network by adding more convolution layers, as very small (3×3) convolution filters were used in all layers.

Ciresan et al. have previously used small-size convolution filters. But their network is much less deep than ours, and they did not evaluate the large ILSVRC data set. Goodfellow, etc. Depth ConvNets (11 weighted layers) were applied to a street number recognition task and showed that the increased depth resulted in better performance. Instead of 3 by 3, they use 1 by 1 and 5 by 5 convolution. However, their network topology is more complex than that of the VGG network, and the spatial resolution of the feature map is reduced more aggressively in the first layer to reduce the computational effort. 1×1 convolution is essentially a linear projection over the same dimensional space (1×1 increases nonlinearity, with the same number of input and output channels).

The paper

VGG network highlights

Compared with AlexNet, highlights: 1. Replace a 55 convolution kernel by stacking two 33 convolution kernels; 2. Replace a 77 convolution kernel by stacking three 33 convolution kernels

Benefit: Can reduce the number of parameters

Suppose the depth (channel) of the input and output eigenmatrices is C. Parameters required for using one 77 convolution kernel: 77CC=49CC Parameters required for using three 33 convolution kernels: 33CC+33CC+33CC=27CC

3 33s save (49-27)/27=81% parameter compared to 1 77.

Calculation of receptive field

Receptive field (receptive fields) : the area size of the input layer corresponding to a unit on the output feature map. Calculation formula: F(I) = [F(I +1)-1]*stride+ksize where: F(I) is the receptive field of layer I; The stride is the i-th step; Ksize is the size of the convolution kernel.

For example, the output of the last layer is a unit: F = 1 conv3: F = (1-1)1 +3 =3 conv3: F = (3-1)1 +3 =5. In this case, two convolution kernels of 33 are replaced by one convolution kernel of 55. Conv3: F = (5-1)1 +3 =7; in this case, three convolution kernels of 33 replace one convolution kernel of 7*7.

architecture

1, fixed size 224*224RGB input; 2. The only preprocessing is to subtract the average RGB value calculated on the training set from each pixel; 3. Use a filter with a very small receptive field: 3×3 (this is the minimum size to capture left/right, top/bottom, and center concepts); 4. The 1×1 convolution filter is used, which can be regarded as the linear transformation of the input channel (followed by nonlinear); 5. The convolution step size is fixed at 1 pixel; The spatial filling of CONV, that is, the spatial resolution is retained after convolution, that is, the filling is 1 pixel, 3×3 convolution; 6. Spatial pooling consists of five maximum pooling layers. Maximum pooling is performed on a 2×2 pixel window with a stride of 2; 7. This is followed by three fully connected layers (FC layer 4096-4096-1000) : the first two layers each have 4096 channels, and the third performs a 1000-channel ILSVRC classification, thus containing 1000 channels (one per channel) classes). The last layer is soft-Max. The configuration of the full connection layer is the same across all networks.

All hidden layers are configured with ReLU nonlinear functions. The network does not use local corresponding normalization (LRN) because this normalization does not improve the performance of the ILSVRC dataset, but results in increased memory consumption and computation time.

The parameters of the convolution layer are expressed as”Conv (receptive field size)-(number of channels)“. For brevity, the ReLU activation function is not displayed.

For example, conv3-64 indicates 64 convolution nuclei of size 33. Conv3: convolution nuclei of size 33; Stride = 1; Padding =1 MaxPool: Pool size 2*2; stride=2

There are only differences in depth: from network A’s 11 weight layers (8 conv. And 3 FC layers) to network E’s 19 weight layers (16 ConV). And three FC layers).

Network D: Conv3 +FC=2+2+3+3+3+3=16

Maxpool and SoftMax are not counted as layers

There is a flatten function between the Max pool and the fully connected layer, which spreads the multidimensional pixels into one-dimensional pixels for the full connected layer to process. The first two fully connected layers: ReLU and Dropout

The network can be seen as layers in two parts: the last full-connect layer (not included) before viewed as the extraction feature network structure, the three full-connect and SoftMax can be seen as the classification structure.

Network training

The training process in ConvNet generally followed the methods of Krizhevsky et al. (2012) (except for the sampling of input crops from multi-scale training images). Training is performed by optimizing multiple cross logistic regression objectives with momentum using min-batch gradient descent (based on back propagation). Batch size is 256, momentum is 0.9. Training is regularized by weight attenuation (L2 penalty multiplier set to 5·10e-4) and dropout regularization (dropout ratio of 0.5) of the first two fully connected layers, with a Learning rate of 0.01. Then, when the accuracy of the verification set does not improve any more, the learning rate drops 3 times and stops after 370,000 iterations (74 times).

We speculate that although our network has more parameters and greater depth than the Alex Net network, the network requires more convergence time due to (a) implicit regularization due to greater depth and smaller conv.filter size; (b) Pre-initialization of some layers.

The initialization of the network weights is important because poor initialization can stall learning due to the instability of the gradient in the deep network.

Network A is shallow enough to be trained with random initialization. Then, when training the deeper architecture, we initialize the first four convolution layers and the last three full connection layers with the layers of network A (the middle layer is initialized randomly).

The image processing

To obtain fixed-size 224×224ConvNet input images, they were randomly cropped from the rescaled training images (once for each SGD iteration). To further increase the training set, images were randomly flipped horizontally and randomly RGB color moved.

S is the smallest edge after resize, and the cut size is 224*224. Although the clipping size is fixed to 224×224, in principle, S can take any value not lower than 224: for S=224, clipping will capture the statistical data of the whole image and completely cover the minimum side of the training image; For S much larger than 224, the clipping will correspond to a small portion of the image, containing either a small object or a portion of an object.

Two methods were used to set the training scale S. The first is fixed S, which corresponds to single-scale training. Two fixed dimensions: S=256 and S=384; First train the network with S =256. In order to speed up the training of S=384 network, we use the weight of S=256 for initialization and use the small initial learning rate of 10 ^ -3.

In the second method of setting the multi-scale training of S, each training image is separately rescaled by randomly sampling S from a certain range [256,512]. Because the objects in the image may have different sizes.

test

2. Dense test method: FC variable convolution 3. Average pooling of fractional graph 4

For the test, given a trained ConvNet and an input image, classify them as follows. First, it is rescaled to a predefined minimum image edge, denoted by Q (which we also refer to as the test scale). We note that Q does not necessarily equal the training scale S. The network is then applied intensively to rescale the test image. Namely, the full connection layer is first converted to the convolution layer (the first FC layer is converted to the 7×7 convolution layer, and the last two FC layers are converted to the 1×1 convolution layer). The resulting full convolutional network is then applied to the entire (uncropped) image. The result is a class fractional graph with the number of channels equal to the number of classes and variable spatial resolution depending on the size of the input image. Finally, in order to obtain the fixed size vector of the class fraction of the image, the class fraction graph is averaged pooling. The test set is also expanded by flipping the image horizontally. The soft-max class average of the original image and the flipped image is used to obtain the final score of the image.

First, we note that using local response normalization (A-LRN network) does not improve model A, which does not have any normalization layer. Therefore, we do not use normalization in the deeper architecture (B-E).

Secondly, we observed that the classification error decreased with the ConvNet depth: from 11 layers in A to 19 layers in E.

It should be noted that configuration C (containing three 1×1 conv layers) performed worse ** than configuration D using 3× 3Conv layers across the network (D was better than C), despite the same depth. The figure above also shows that nonlinearity does help (C is better than B) ** but it is also important to capture the spatial context using CONV. At 19 levels, the error rate of our architecture becomes saturated, but deeper models may benefit larger data sets. It is proved that the deep net with a small filter is superior to the shallow net with a large filter.

Scale jitter during testing results in better performance. The deepest configurations (D and E) perform best, and the scale jitter is better trained than the fixed minimum edge S. The best single-network performance on the validation set is 24.8%/7.5%top-1/top-5 errors (highlighted in bold in the table above). On the test set, configuration E achieved 7.3% of top-5 errors.As can be seen from the table above, the use ofMulti-crop was slightly better than dense assessment, but their combination is better than the single way (multi-crop and dense complement each other). Multi-crop: 150 (552 * 3 = 150).It shows that the deep ConvNets are significantly better than the previous generation model.

The development of analysis

bottleneck

VGG consumes more computing resources and uses more parameters, resulting in a larger memory footprint (140M). Most of the parameters come from the first full join layer. Moreover, simply increasing the depth of the neural network will bring difficulties to the training, and there will be problems such as gradient disappearance and non-convergence.

Future development direction

VGG’s introduction allowed the researchers to see how the depth of the web could affect the results, and inspired them to look deeper into the web. In addition, the middle layer of VGG network can effectively extract the feature of the input graph, so the trained VGG model is usually applied to the middle of the loss function to make up for the shortcoming of too smooth caused by L2 loss function.