Convolutional Neural network (CNN) related concepts

Problems existing in traditional neural networks

Before talking about convolutional neural networks, we should first talk about some problems existing in traditional neural networks. The figure above is an example diagram of a typical traditional neural network. Imagine a scene, assuming that the sample image we want to train is 100×100 (pixels), then the whole image has a total of 10,000 pixels, so when defining a traditional neural network, the input layer needs 1W neurons. So if we need 1W neurons in the hidden layer in the middle, then we need 100 million parameters (weights). Just imagine, this is just a 100×100 image. If the image is bigger, As you can imagine, the computation of the whole neural network is terrible. Of course, once the weight is too much, there must be enough samples for training, otherwise there will be over-fitting phenomenon. Therefore, we can know that the traditional neural network has the following two problems:

Too many weights, too much calculation
Too many weights, if there is no large number of samples to support, there will be over-fitting phenomenon

Convolutional neural network

convolution

What is convolution?

Before we get to convolutional neural networks we need to know what convolution is. Of image data window (different) and filter matrix (a set of fixed weight: because multiple weights of each neuron is fixed, so also can be regarded as a constant filter filter) for inner product (individually elements multiplication and summation) is the so-called “convolution operation, and the name of the convolutional neural network source. Loosely speaking, the red box in the figure below can be understood as a filter, a set of neurons with a fixed weight. Multiple filters are added together to form the convolution layer.

Let me give you a concrete example. For example, in the figure below, the left part of the figure is the original input data, the middle part of the figure is the filter, and the right part of the figure is the new two-dimensional output data.

Let’s break this up

The numbers are multiplied and then added

=

The inner product of the intermediate filter and the data window is calculated as follows: 4×0 + 0x0 + 0x0 + 0x0 + 0x1 + 0x1 + 0x0 + 0x1 + -4×2 = -8

Convolution on the graph

In the calculation process corresponding to the figure below, the input is the data of a certain area (width*height), and the inner product is made with the filter (with a set of neurons with a fixed weight) until the new two-dimensional data.

As shown below:

Specifically, the left side is the image input, and the middle part is the filter (with a set of neurons with fixed weights). Different filter filters will get different output data, such as color shade, contour. It is equivalent to using different filters to extract specific information about the image: color depth or outline if you want to extract different features of the image. To explain the difference between filters in one sentence: a thousand Hamlets for a thousand readers

What is a convolutional neural network?

Convolutional Neural Network (CNN) ** is a feedforward Neural Network, whose artificial neurons can respond to part of the surrounding units within the coverage range and have excellent performance in large image processing. Convolutional neural networks are very similar to ordinary neural networks in that they both consist of neurons with learnable weights and biases. Each neuron takes some input and does some dot product, and the output is the fraction of each category, and some of the computational tricks from ordinary neural networks still apply here. However, the default input of convolutional neural network is image, which allows us to encode specific properties into the network structure, making our feedforward function more efficient and reducing a large number of parameters.

** 3D volumes of neurons **

Convolutional neural network takes advantage of the characteristics of the image input, and designs the neuron into three dimensions: width, height, and depth(note that depth is not the depth of the neural network, but is used to describe the neuron). For example, if the input image size is 32×32×3 (RGB), then the input neuron also has 32×32×3 dimensions. Here is the illustration:

An application example of convolutional neural network

In the figure above, WHAT CNN needs to do is: given a picture, it is unknown whether it is a car or a horse, and what kind of car it is. Now the model needs to judge what the specific thing in this picture is, and output a result: if it is a car, what car it is.

Let’s go from left to right:

Left:

On the left is the data input layer, which conducts some data processing, such as de-mean (centralizing all dimensions of input data to 0 to avoid too much data deviation and affecting the training effect), normalization (normalizing all data to the same range), PCA/ whitening, etc. CNN only does the “de-mean” step for the training set.

In the middle:

CONV: convolution computation layer, linear product summation.
RELU: Excitation layer, RELU is a kind of activation function.
POOL: POOL layer, in short, take region average or maximum.

On the right:

FC: full connection layer

Local perception and weight sharing in convolutional Neural Networks (CNN)

Local perception in CNN

In CNN, the filter (with a set of neurons with fixed weights) performs convolution computations on local input data. After each calculation of local data in a data window, the data window keeps sliding until all the data is calculated. In this process, there are several parameters:

Depth: The number of neurons, which determines the depth thickness of the output. Also represents the number of filters.
Stride: Determine how many steps to slide to the edge.
Zero-padding: Add a number of zeros to the outer edge so that you can slip from the initial position to the last position in step size. Generally speaking, the total length is divisible by step size.

The figure above is a typical example of local awareness. The matrix of the yellow part is the filter, the depth is 1, the step size is 1, and the filling value is 0. Obviously, we can see that each filter is convolved against some local data window, which is the so-called local perception mechanism in CNN.

So why local perception?

A filter, for example, is like a pair of eyes, and humans have a limited view of the world. If you saw the world at a glance, you’d be exhausted and take in all the information in the world at once, and your brain wouldn’t be able to handle it. Of course, even if it is to see the local, for the information in the local human eyes also have a bias, preference. For example, if you look at a beautiful woman, you should focus on face, chest and legs, so the weight of these three inputs is relatively large.

Weight sharing in CNN

What about weight sharing? Again, take the figure above as an example. In the sliding process of the filter, the input is changing, but the weight of the intermediate filter (that is, the weight of the data window connected by each neuron) is fixed and unchanged, which is the so-called weight (parameter) sharing mechanism in CNN.

Let’s say someone travels around the world, and the information they see changes, but the eyes that collect it don’t. BTW, the same local information is felt differently by two eyes of different people, that is, there are a thousand Hamlets for a thousand readers, so different filters are like different eyes, and different people have different feedback results.

Explain local perception and weight sharing with a GIF

I found this picture in the process of data collection. It was very cool at first. If I understood local perception and weight sharing, this picture would not be hard to understand.

I believe you may also have a question. How is the output result 1 in the figure above calculated? Let’s break down the above giFs and explain the calculation process in detail.

First, slide one:

In fact, the calculation process is similar to Wx + B, w corresponds to Filter W0, X corresponds to different data Windows, and B corresponds to Bias B0, which is equivalent to Filter W0 multiplied by one data window and summation, and finally adding Bias B0 to get the output result 1, as shown in the following process:

1×0 + 1×0 + -1×0 + -1×0 + 0x0 + 1×1+-1×0 + -1×0 + 0x1

-1×0 + 0x0 + -1×0 + 0x0 + 0x1 + -1×1 + 1×0 + -1×0 + 0x2

0x0 + 1×0 + 0x0 + 1×0 + 0x2 + 1×0 + 0x0 + -1×0 + 1×0

1(the 1 here is Bias b0)

Then Filter W0 is fixed, and the data window moves 2 steps to the right to continue the inner product calculation, and the output result of 0 is obtained

Finally, another different Filter, Filter W1, different Bias B1, and then convolved with the data window on the left of the figure, another different output can be obtained.

pooling

Pooling, in short, taking regional averages or maximization, is intended to reduce the feature graph. The pooling operation is independent of each depth slice, and the scale is generally 2 * 2. Compared with the convolution operation of the convolution layer, the pooling layer generally carries out the following operations:

Max Pooling. Take the maximum of four points. This is the most common pooling method.
Mean Pooling. Take the mean of the four points.
Gaussian pooling. Use the method of Gaussian blur for reference. Not commonly used.
Training can be pooled. The training function FF accepts 4 points as input and 1 point in and out. Not commonly used.

The most common pooling layer is a 2*2 scale, step 2, downsampling each depth slice of input. Each MAX operation is performed on four numbers, as shown below:

The pooled operation will save the same depth.
If the size of the input unit of the pooling layer is not an integer multiple of two, it is usually filled with a multiple of 2 using zero-padding, and then pooled.

Fully connected layer

The full connection layer and the convolution layer can be converted to each other:

For any convolution layer, to turn it into a fully connected layer is to turn the weights into a huge matrix, most of which are zero except for certain blocks (because of local perception), and many blocks have the same weights (because of weight sharing).
Conversely, for any fully connected layer can also become convolution layer. For example, oneThe input layer size is, it can be equivalent to oneConvolution layer of. In other words, we set filter size to exactly the size of the entire input layer.

reference

CNN Notes: Popular understanding of convolutional neural networks
CS231n: Convolutional Neural Networks for Visual Recognition
Convolutional Neural Networks – Wikipedia
Convolutional feature extraction
Convolutional neural network comprehensive analysis
Stanford Machine Learning Open course
Understand the convolution