This article has participated in the weekend learning program, click the link to see more details: juejin.cn/post/696572…

Introduction to the

Convolutional neural network (CNN) consists of INPUT layer, convolutional layer, activation function, pooling layer and full connection layer, namely input-conv-relu-pool-FC

(1) Convolution layer: it is used for feature extraction, as follows:

The input image is 32323,3 is its depth (R, G, B), and the convolution layer is a filter(receptive field) of 553. Note here: the depth of the receptive field must be the same as that of the input image. A feature graph of 28281 can be obtained by convolution of a filter with the input image. In the figure above, two feature graphs are obtained by using two filters.

We usually use multi-convolutional layers to get deeper feature graphs. As follows:

The input image and the corresponding position elements of filter are multiplied and summed, and finally b is added to obtain the feature graph. As shown in the figure, the first depth of filter w0 is multiplied by the corresponding elements in the blue box of the input image and summed up to get 0, the other two depths get 2, 0, then 0+2+0+1=3, that is, the first element in the feature graph on the right of the figure 3. After convolution, the blue box of the input image slides, and the stride=2, as follows:

As shown above, complete the convolution and you get a 33Feature map of 1; Note also that zero pad item is used to add a boundary to the image, and the boundary elements are all 0. (It has no effect on the original input.

F=3 => zero pad with 1

F=5 => zero pad with 2

F=7=> zero pad with 3, boundary width is an empirical value, zero pad is added to make the input image and the feature graph after convolution have the same dimension, such as:

Input is 553, filter is 333, and zero pad is 1. Then the input image after zero pad is added is 773, and the size of the feature image after convolution is 551 ((7-3) /1+1), which is the same as the input image.

The calculation method of the size of the feature graph is as follows:

Another feature of convolution layer is the principle of “weight sharing”. The diagram below:

If there is no such principle, then the feature graph is composed of 10 feature graphs of 32321, that is, each feature graph has 1024 neurons, and each neuron corresponds to a 553 region of the input image, that is, there are 75 connections between a neuron and this region of the input image, that is, 75 weight parameters. Therefore, the convolutional neural network introduces the “weight” sharing principle, that is, 75 weight parameters corresponding to each neuron in a feature graph are shared by each neuron. In this way, only 75*10=750 weight parameters are required, and the threshold value of each feature graph is also shared. That is, if 10 thresholds are required, a total of 750+10=760 parameters are required.

Supplement:

(1) For multi-channel images, 1*1 convolution is actually the sum of each channel of the input image multiplied by the coefficient, that is, the original independent channels in the original image are “connected” together;

(2) When weights are shared, they are only shared in each channel on each filter;

Pooling layer: compress the input feature graph, on the one hand, make the feature graph smaller and simplify the network computing complexity; On the one hand, feature compression is carried out to extract the main features as follows:

There are generally two Pooling operations, Avy Pooling and Max Pooling, as follows:

Similarly, a filter of 2*2 is adopted for Max pooling to find the maximum value in each area, where the stride=2, and finally extract the main features from the original feature graph to obtain the figure on the right.

(Avy pooling layer is not commonly used now (in fact, it is the average pooling layer). The method is to sum the elements of each 22 area and divide by 4 to obtain the main features), while the general filter is 22, the maximum is 3*3, the stride is 2, and the compression is 1/4 of the original.

Note: the pooling operation here is a feature map reduction operation, which may affect the accuracy of the network. Therefore, the depth of the feature map can be increased to compensate for the reduction (the depth is twice of the original).

Full connection layer: Concatenates all features and sends output values to classifiers (such as softMax classifier).

The overall structure is roughly as follows:

In addition, the proportion of convolutional layer parameters in the first few layers of CNN network is small, and the proportion of computation is large. On the contrary, most CNN networks have this feature. Therefore, we focus on the convolution layer in computational acceleration optimization. When optimizing parameters and cutting weights, emphasis should be placed on the whole connection layer.