This article has participated in the weekend learning program, click the link to see more details: juejin.cn/post/696572…

The purpose of CNN

To put it simply, the purpose of CNN is to extract features of things based on certain models, and then classify, identify, predict or make decisions based on features. In this process, the most important step lies in feature extraction, that is, how to extract the features that can distinguish things to the greatest extent. If the extracted features cannot distinguish different things, then the feature extraction step will be meaningless. The realization of this great model is the iterative training of CNN.

Characteristics of the

In an image (for example), the characteristics of the object are mainly reflected in the relationship between pixels. For example, we can distinguish a straight line in an image because the pixels on the line are different enough from their neighbors (or the pixels on both sides of the line are different enough) that the “line” can be identified:

The same goes for other features besides straight lines. In CNN, most feature extraction relies on convolution operation.

Convolution and feature extraction

Convolution is actually the inner product here, and the procedure is very simple, that is, according to a number of certain weights (namely, the convolution kernel), inner product operation is performed on the pixels of a block, and its output is one of the extracted features:

The reason for choosing convolution

Local awareness

To put it simply, the size of the convolution kernel is generally smaller than the size of the input image (if it is equal to, it is fully connected), so the features extracted by the convolution will pay more attention to the local area — which is very consistent with the daily image processing we contact. In fact, each neuron does not need to perceive the global image, but only needs to perceive the local information, and then synthesize the local information at a higher level to obtain the global information.

Parameters of the Shared

The biggest benefit of parameter sharing is that it greatly reduces the amount of computation.

multicore

Generally, we will not filter the input image with only one convolution check, because the parameters of a kernel are fixed and the features extracted from it will be simplified. It’s a bit like how we look at things objectively. We have to analyze things from multiple perspectives in order to avoid bias as much as possible. We also need multiple convolution checks to convolve the input image.

Down-Pooling

Pooling layer after convolution is absolutely perfect, which can well aggregate features and reduce dimension to reduce computation.

Multilayer convolution

The higher the level, the more global the extracted features are.

pooling

Pooling, the sampling or aggregation of a block of data, for example by selecting the maximum (or average) value of the region in place of the region:

In the pooling example in the figure above, 10 * 10 regions are pooled into 1 * 1 regions, which greatly reduces the sensitivity of data and reduces the computational complexity of data on the basis of preserving data information.

Meaning of activation function

Mathematically, the activation function maps input data to 0 to 1 (tanh is -1 to +1). The reason for the mapping, besides regularizing the data, is probably to control the data so that it stays within a certain range. Of course, there are also other details. For example, Sigmoid (TANH) can pay more attention to the small changes of data before and after zero (or central point) when activated, and ignore the changes of data in extreme conditions. For example, ReLU can avoid the disappearance of gradients. In general, Sigmoid (TANH) is used for the full connection layer, while ReLU is used for the convolution layer.

Or we can change the convolution kernel (from another perspective) to look at this activation function. If we regard each activation action as a classification, that is, the input data is divided into two categories (0 or 1), then the output of the activation function is the value from 0 to 1, which can represent the degree of belonging of this “classification”. If we specify 0 as inactive and 1 as active, then the output of 0.44 means 44% active.

However, the use of activation function may bring some negative effects (negative effects on training). Activation function may make most of the input data activated, for which we have another countermeasure — LRN.

Catalysis and inhibition of LRN

LRN, local response normalization. In neuroscience, there is a concept called lateral inhibition, which prevents excited neurons from spreading their action trend to neighboring neurons, thus reducing the activation degree of neighboring neurons of excited neurons. It borrows from biology (well, everything we do borrows from biology, don’t we?) We use LRN layers to horizontally suppress the output data of the activation function. While cleaning up the mess of the activation function, LRN also highlights a peak in the region — the peak feature is the desired feature.

In particular, the unlimited activation of ReLU makes it more necessary for us to normalize the data by LRN. In large scale data, we tend to focus more on high-frequency features that are highlighted. Therefore, it is not worth using LRN to catalyze the peak of the data while suppressing its surroundings.

The IP layer

In the back part of many CNN, there exists an IP (Inner Product) layer/Inner Product layer/FC (Full Connect) layer/full connection layer. I don’t know what the significance of the representation hierarchy of the fully connected network is at CNN. In many papers, it replaces Softmax to be responsible for the final extraction of features, and some people also point out that CNN can not use IP layer at last.

The Dropout abandon

Abnegation has always been a great philosophy, and biological evolution is full of examples. The task of Dropout is to make certain weights not work during training by setting thresholds and comparing them to weights of nodes in some hidden layer — discarding those weights at that layer. Dropout is also important, in addition to speeding up operations, to prevent overfitting.

If there are any mistakes please point out, this article is from: MkShell — fried fish, reprinted to convey more information, all rights reserved.