The wave of Artificial Intelligence is sweeping the world, and we hear a lot of words: Artificial Intelligence, Machine Learning, Deep Learning. This paper is mainly to comb the notes of Li Hongyi’s course content, and the reference links have been given at the end of the paper.

10 — Convolutional Neural networks (CNN)

An overview of the

Neural networks are part of the research field of artificial intelligence. The most popular neural networks are deep convolutional neural networks (CNNs). Although convolutional networks also have shallow structures, But it is rarely used for reasons of accuracy and expressiveness. At present, when it comes to CNNs and convolutional neural networks, academia and industry no longer make a special distinction between them. Generally, they refer to deep-structure convolutional neural networks, whose layers vary from “several layers” to “tens or hundreds”.

CNNs has achieved great success in many, many research areas, such as: speech recognition, image recognition, image segmentation, natural language processing and so on. Although the problems addressed in these fields are not the same, these applications can all be summarized as:

CNNs can automatically learn features from (often large-scale) data and generalize the results to unknown data of the same type.

The network structure

Basic CNN consists of three structures: convolution, activation, and pooling. The output result of CNN is the specific feature space of each image. When processing the image classification task, we will take the feature space output by CNN as the input of fully connected neural network (FCN), and use the fully connected layer to complete the mapping from the input image to the label set, that is, classification. Of course, the most important work in the whole process is how to adjust the weight of the network iteratively through the training data, that is, the back propagation algorithm.

There are five hierarchies in the CNN network:

1. The input layer

Common preprocessing methods in the input layer include:

  • To the mean
  • The normalized
  • PCA/SVD dimension reduction, etc

2. The convolution layer

3. The activation layer

Excitation is actually a nonlinear mapping of the output of the convolution layer. If the excitation function is not used, then the output at each level is a linear function of the input at the previous level. It’s easy to figure out that no matter how many layers there are, the output is a linear combination of the inputs, which is the same as if there were no hidden layers, and that’s the original perceptron.

Common excitation functions are:

  • The Sigmoid function
  • Tanh function
  • ReLU
  • Leaky ReLU
  • ELU
  • Maxout

Incentive layer suggestion: First ReLU, because the iteration speed is fast, but the effect may not be good. If ReLU fails, consider using Leaky ReLU or Maxout, where the general situation can be resolved. Tanh functions work well for text and audio processing. We’ll have a chance to do a separate issue on activation functions

4. Pooling layer

Pooling: also known as undersampling or undersampling. It is mainly used to reduce the dimension of features, compress the number of data and parameters, reduce overfitting, and improve the fault tolerance of the model. Mainly include:

  • Max Pooling: Indicates the maximum Pooling of Pooling Windows. This is common
  • Pooling: Average Pooling: Pooling window Average Pooling: Pooling window

5. Fully connect the FC layer

After several times of convolution + excitation + pooling, finally came to the output layer, the model will learn a high-quality feature picture fully connected layer. In fact, before the full connection layer, if the number of neurons is too large and the learning ability is strong, overfitting may occur. Therefore, the dropout operation, which randomly deletes some neurons in a neural network, can be introduced to solve this problem. You can also perform local normalization (LRN), data enhancement, and other operations to increase robustness. When it comes to the full connection layer, it can be understood as a simple multi-classification neural network (such as BP neural network), and the final output can be obtained through softmax function. The whole model is trained.

Rasterization: The image is pooled – subsampled to produce a series of feature maps, while the multi-layer perceptron accepts a vector input. Therefore, the pixels in these feature maps need to be taken out at a time and arranged into a vector.

These layers are described as follows:


The input layer

Data Input Layer: The Input Layer is mainly used as the Input of the network. In this Layer, we mainly preprocess the data and determine the shape of the Input according to the characteristics of the Input data. If 2D convolution is to be carried out, take Keras as an example, the data needs to be processed into: (samples, Channels, Rows, Cols) where:

  • Sample: Indicates the number of samples
  • If the image is a grayscale image, then the depth is 1. If the image is a color image, then it is composed of three channels in RGB.
  • Rows: Can be thought of as the number of rows in the input matrix
  • Cols: You can think of it as the number of columns in the input matrix

How to specify the input shape, also need to see the corresponding framework API source code, source code will have the corresponding explanation, tell you the input shape should be what looks like, such as Keras Conv2D source code.


Convolution layer

Before we talk about convolution, we need to know the following terms and what they mean:

  • Filters: Indicates the number of filters
  • Padding: fill
  • Stride: step length
  • Kernel_width: convolution kernel width
  • Kernel_height: length of the convolution kernel

Here is an example

The green circle is the input matrix, and obviously, it has three layers, so channel is 3

The blue circle represents a convolution kernel whose width and height are the corresponding kernel_width and kernel_height

The red circles represent the filters, and each of the filters has several convolution kernels

We need to understand the relationship between filter and channel:

  1. The original input of the image samplechannels, depending on the image type, such as RGB;
  2. Output after the convolution operation is completeout_channelsDepends on the number of convolution kernels. At this timeout_channelsIt’s also going to be the kernel of the next convolutionin_channels;
  3. In the convolution kernelin_channelsI just showed you in 2, the convolutionout_channelsIf I do the convolution for the first time, it’s the sample picture in 1channels 。

Basic operation

Convolution is to multiply and sum the corresponding position elements of the input image and filter, and then move according to the specified step size, and then perform the same operation, as shown in the figure below

Firstly, the first filter is a 3* 3 matrix. Put this filter in the upper left corner of the image, and take the inner product of the 9 values of the filter and the 9 values of the image. Both sides are 1,1,1(diagonal), and the result of the inner product is 3. And then the stride is 1, so we move 1 distance to the right, and then we compute it again to get -1. After sliding the original image up and down, you get a new feature. If you don’t see it, you can see the diagram below

padding

Padding is padding around the input feature.

The red part is the padding that was done, and you can do that by setting padding=? Specifies how many layers to fill the periphery.

Why do I fill? If it is not filled, after each convolution there will be the following problem:

  1. The convolution matrix gets smaller and smaller (if the convolution has 100 layers, each layer shrinks and you end up with a very small picture)
  2. The input matrix (left) edge pixel (green shadow) is only calculated once, while the middle pixel (red shadow) is convolution calculated many times, meaning that the image corner information is lost

Therefore, padding is added to solve the problem of smaller and smaller size of feature map. At the same time, the processing of convolution check edge information is processed more than once, and the extraction of edge information is more sufficient.

Final calculation

According to the size of the input data and the size of the convolution kernel, the size of the output feature map can be determined:

By input the width and height of the original feature graph, the width and height of the next output feature graph can be calculated. The input channel number of the next layer is the filter number of the convolution.


The activation layer

The convolution layer performs multiple convolution operations on the original image to produce a set of linear activation responses, while the nonlinear activation layer performs a nonlinear activation response on the previous results.

The most commonly used nonlinear activation function in neural networks is Relu function, whose formula is defined as follows:

f(x)=max(0,x)

That is, keep the values greater than or equal to 0, and all the other values less than 0 are rewritten as 0.

Why do you do that? As mentioned above, the values in the feature graph generated by convolution are closer to 1, indicating that they are more associated with the feature, and closer to -1, indicating that they are less associated with the feature. When we extract features, in order to make the data less and the operation more convenient, we will directly discard those unassociated data.

The function expression for relu, relu(x)= Max (x,0), or the piecewise function expression:

For x greater than 0, the derivative of the function is just 1, so there’s no gradient decay. Although the ReLU function alleviates the problem of gradient disappearance, it also brings another problem, namely gradient death. You can see that when x is less than 0, the function is hard saturated, where the derivative is directly zero, and once the input falls into this region, then the neuron does not update its weight, and this phenomenon is called neuron death.

Relu advantages:

  1. (x>0, the derivative of the function is directly 1, there is no gradient decay problem)
  2. The calculation is very simple (just use the threshold judgment, and the derivative is almost no calculation)
  3. (If the output is less than 0, set it to 0, which makes the middle output of the neural network sparse and has a role of Droupout, which can prevent overfitting to some extent.)

Pooling layer

After the convolution operation, we get one feature map with different values. Although the data amount is much smaller than the original one, it is still too large (compared with deep learning, there are often hundreds of thousands of training pictures). Therefore, the following pooling operation can play a role, and its biggest goal is to reduce the data amount.

There are two types of Pooling: Max Pooling for maximum Pooling and Average Pooling for Average Pooling. As the name implies, maximum pooling is taking the maximum value, and average pooling is taking the average value.

Taking the maximum pooling layer as an example, the operation mode of the pooling layer and the convolution layer is the same, but the only difference is that instead of performing the convolution operation in the corresponding position, a maximum value is found within the filter size range (average pooling means that all the numbers within the filter size range are averaged).

The pooling layer determines the size of the output feature graph according to the convolution output image size and the pooling window size, and the Padding:


The connection layer

After several times of convolution + excitation + pooling, finally came to the output layer, the model will learn a high-quality feature picture fully connected layer.

The fully connected feedforward netwwork is then placed in the fully connected feedforward netwwork.


The calculation of the parameters

Many people do not know how to calculate the parameter number of each layer. In convolution, the parameters of training are mainly those in filter. Training significance:

The expression above H (x)=f(wx+b) is the function represented by the neuron. X represents input I, w represents weight, B represents bias, f represents activation function, and h(x) represents output.

The process of training convolutional neural network is the process of constantly adjusting the weight w and bias B, so as to make its output H (x) reach the expected value.

The weight W and the bias B correspond to the memory of the neuron.

Reasons for adding bias:

Without bias B, the function would have to go through the origin, and there would be less scope for classification

Function of parameters W and B in reference neural networks (explanation of why bias B is needed)

So it can be calculated as follows:

=(kernel_width*kernel_height+1)*in_channel (+1)

The following is a network structure and the corresponding information table for each layer

Keras’s CNN example

Input design:

x_train = x_train.reshape(x_train.shape[0], img_x, img_y, 1) #(samples, rows, cols, channels)
Copy the code

Model design:

model = Sequential() Initialize the model
# convolution layer, stride default is 1
model.add(Conv2D(32, kernel_size=(5.5), activation='relu', input_shape=(img_x, img_y, 1)))
model.add(MaxPool2D(pool_size=(2.2), strides=(2.2))) # pooling layer
model.add(Conv2D(64, kernel_size=(5.5), activation='relu')) # convolution layer
model.add(MaxPool2D(pool_size=(2.2), strides=(2.2))) # pooling layer
model.add(Flatten()) # Full connection layer
model.add(Dense(1000, activation='relu')) # Full connection layer
model.add(Dense(10, activation='softmax')) # SoftMax classification output
Copy the code

Below is a diagram of the network structure corresponding to the model code

Reference links:

  • Link: www.zhihu.com/topic/20043…
  • Link: zhuanlan.zhihu.com/p/27908027
  • Link: blog.csdn.net/yjl9122/art…
  • Link: www.cnblogs.com/alexanderku…