1. Structural overview

First of all, we analyze the image processing of the traditional neural network. If the image on CIFAR-10 is still used, there are 3,072 features in total. If the ordinary network structure is input, each neural unit in the first layer will have 3,072 weights. Moreover, the depth of the network used for image processing is generally more than 10 layers, and the amount of parameters added together is too large, and too many parameters will cause overfitting. Moreover, the picture also has its own characteristics, and we need to use these characteristics to reform the traditional network and speed up the processing speed and accuracy.

We noticed that the pixel of the image was composed of three channels, and we took advantage of this feature to place its neurons in a THREE-DIMENSIONAL space (width, height and depth), which respectively correspond to 32x32x3 of the image (taking CIFAR as an example) as shown below:



The red is the input layer where the depth is 3, and the output layer is a 1x1x10 structure. The meanings of the other layers will be introduced later, but for nowEach layer is height × width × depth.

2. Layer of convolutional neural network

Convolutional neural network has three layers: Convolutional Layer, Pooling Layer, and Fully Connected Layer. Taking the convolutional neural network for PROCESSING CIFAR-10 as an example, a simple network should contain the following layers: [input-conv-relu-pool-fc], namely [input-convolution – activation – pooling – classification score]. Each layer is described as follows:

  • INPUT [32x32x3] Length 32 width 32 Picture with three channels
  • CONV: calculate the local area of the image, if we want to use 12 filters fliters, his volume will be [32x32x12].
  • RELU: Still an excitation layer Max (0,x), size is still ([32x32x12]).
  • POOL: sampling along the (width, height) of the image, reducing the dimensions of the length and width, for example, the result is [16x16x12].
  • FC (i.e. fully-connected) calculates the classification score and the final size is [1x1x10]. This layer is fully connected, and each unit is connected to each unit on the previous layer.

Note:

Volume and neural network contain different layers (e.g. CONV/FC/RELU/POOL are also the most popular)

2. Input and output of 3D structured data for each layer except the last layer

3. Some layers may have no parameters, others may have parameters (e.g. CONV/FC do, RELU/POOL don’t)

4. Some layers may have hyperparameters and some layers may not (e.g. CONV/FC/POOL do, RELU doesn’t)

So here’s an example, not in three dimensions but as a column.



The details of each layer are discussed below:

2.1 the convolution layer

Convolutional layer is the core layer of convolutional neural network, which greatly improves the computational efficiency.

The convolutional layer consists of many filters, each of which has only a small part and is connected to only a small part of the original image at a time. Images on UFLDL:



It’s the result of a filter that keeps sliding,

We’re going to go a little deeper here, we’re going to input a three dimensional image, so each filter has three dimensions, Assuming that our filter is 5x5x3 we also get a map of activation values similar to the one shown above called the Activion Map which is calculated by wT×x+bwT×x+b where w is 5x5x3=75 data, namely the weight, He is adjustable.

We can have multiple filters:



To go even further, there are three hyperparameters as we slide:

1. Depth, depth, this is determined by the number of filters.

2. Stride length, stride, the interval of each slide. The animation above slides only 1 number at a time, that is, the step length is 1.

Zero-padding. Sometimes, zeros are used to expand the area of the image according to the need. If the number of zeros is 1, the changing length will be +2



Here is a one-dimensional example:



The formula for calculating the spatial dimension of its output is

(W – F p + 2)/S + 1 (W – F p + 2)/S + 1


Where w is the size of input, f is the size of filter, P is the size of zero complement, and S is the step size. In the figure, if zero complement is 1, the output is 5 numbers, and the step size is 2, and the output is 3 numbers.

I don’t think we’ve touched on that so farnerveThis concept, wow, now we can understand it from a neural point of view:

Each activation value mentioned above is: The formula wT×x+bwT×x+ B is familiar. This is the scoring formula for neurons, so we can consider each activation map as the work of a filter. If there are five filters, there will be five different filters connected to the same part.

Convolutional neural network has another important feature:Weight Shared: Different neural units (sliding Windows) on the same filter have the same weight. This greatly reduces the number of weights.

In this way, each layer has the same weight, and the result of each filter calculation is a convolution (a deviation b will be added later) :



That’s where convolutional neural networks get their name.

The picture below is incorrect. See the official websiteCs231n. Making. IO/convolution…Find the errors and see how convolution works.



Although the weight W of each filter is changed into three parts here, the form of Wx + B is still used in neurons.

– Back propagation: this kind of back propagation of convolution is convolution, and the calculation process is relatively simple

-1×1 convolution: Some articles use 1*1 convolution, such as the original oneNetwork in NetworkIn this way, multiple inner products can be effectively done. Input has three layers, so each layer must have at least three W’s, that is, filter of the dynamic graph above is changed to 1x1x3.

-Dilated convolutions. Recent studies (e.g. see paper by Fisher Yu and Vladlen Koltun) added a hyperparameter to the convolution layer: dilation. When dilation is equal to 0, the convolution w[0]x[0] + w[1]x[1] + w[2]x[2]; Dilation =1 becomes w[0]x[0] + W [1]x[2] + w[2]x[4]; That is, the image we are dealing with is separated by 1. This allows for the use of fewer layers to fuse spatial information. For example, we use two 3×3 CONV layers on the top layer, and the second layer plays the role of 5×5 (effective receptive field). If dilated convolutions are used, the effective receptive field grows exponentially.

 

2.2 pooling layer

Can know the result after the convolution layer or a lot, and as a result of the existence of the sliding window, a lot of information also have overlap, hence the pooling pooling layer, he is will get the results of the convolution layer without overlapping points to several parts, and then select the maximum of every part, or average, or 2 norm, or any other value you like, Let’s take the Max pool as an example:



– Backpropagation: The gradient of the maximum value has been studied before in backpropagation. The maximum value is usually tracked, which will improve the efficiency of backpropagation.

– Getting rid of pooling. Some people think that pooling is not necessaryThe All Convolutional NetMost people believe that the lack of ciudal layers is important to generative models, and it seems likely that later on, the ciudal layers will gradually decrease or disappear.

2.3 other layers

  1. There has been a lot of exploratory Layer which mimics the effects of the brain during the process of Normalization, but there has been less of it as there has been little benefit. This paper describes the use of Alex Krizhevsky’s CUDA-Convnet Library API.
  2. The Fully connected layer is the same as the Fully connected layer.

2.4 Converting FC layers to CONV layers

In addition to different connection methods, the calculation methods of full-connection layer and convolution layer are inner product, which can be converted to each other: 1. If FC does the work of CONV layer, it is equivalent to most positions of its matrix are 0 (sparse matrix). 2. If the FC layer is converted to CONV layer. For example, the input of FC layer with K=4096 is 7×7×512, then the corresponding convolution layer is F=7,P=0,S=1,K=4096 and the output is 1×1×4096. Example: Suppose a CNN input 224x224x3 image, after several changes, a certain layer output 7x7x512 to here, and then use two 4096 FC layers and the last 1000 FC to calculate the classification score. The following is the process of converting these three FC layers into Conv: 1. The conv layer output with F=7 is [1x1x4096]; 2. Use the filter F=1. The output is [1x1x4096]. 3. Use the convolution layer with F=1, and the output is [1x1x1000].

Each conversion converts FC parameters into CONV parameter forms. If a larger image is passed into the transformed system, it will also move forward very quickly. For example, if the image of 384×384 is input into the system above, the output of [12x12x512] will be obtained before the last three layers, and the conV layer transformed above will get [6x6x1000], ((12-7)/1 + 1 = 6). So we immediately got the 6×6 classification. This is faster than the original result using 36 iterations. This is a practical technique. In addition, we can use two convolutional layers with step size of 16 instead of one with step size of 32 to input the above picture, thus improving efficiency.

3. Build convolutional neural network

Below, we use CONV, POOL, FC and RELU to build a convolutional neural network:

3.1 Hierarchy

We set up according to the following structure

INPUT -> [[CONV -> RELU]*N -> POOL?] *M -> [FC -> RELU]*K -> FCCopy the code
  •  

Where N >= 0 (general N <= 3), M >= 0, K >= 0 (general K < 3). A note of caution here: we prefer to use multi-layer, small size ConVs. Why is that? For example, three 3×3 and one 7×7 CONV layers, they all get 7×7 receptive fields. However, 3×3 has the following advantages: 1. The expression ability of nonlinear combination with 3 layers is stronger than that of linear combination with 1 layer; 2. 2. The number of parameters in the small convolution layer of layer 3 is less, 3x3x3<7×7; 3. In backpropagation, we need to use more memory to store middle-tier results.

It is noteworthy that Google’s Inception Architectures and Residual Networks from Microsoft Research Asia have created more complex connection structures than the above structures.

3.2 Size of layers

  1. Input layer: the input layer is usually an exponential form of 2 such as 32 (e.g. cifar-10), 64, 96 (e.g. stl-10), or 224 (e.g. common ImageNet ConvNets), 384, 512, etc.
  2. Convolution layer: it is generally a small filter, such as 3×3 or 5×5 at most, and the step size is set to 1. When adding zeroing, the convolution layer may not change the size of the input. If a large filter must be used, the zeroing method P=(F−1)/2 is often used in the first layer.
  3. Pooling layer: A common setup is to use a maximum pooling layer of 2×2, and rarely more than 3×3.
  4. If our step size is greater than 1 or there is no zero complement, we need to be very careful to see if our step size and scrubs are robust enough and if our network is evenly and symmetrically connected.
  5. A step size of 1 performs better and is more compatible with pooling.
  6. Benefits of zeroing: If zeros are not added, edge information is quickly discarded
  7. Consider your computer’s memory limitations. For example, if you enter a 224x224x3 image, the filter is 3×3 with 64 filters and the padding is 1, then each image needs 72MB memory. However, if it is running on the GPU, the memory may not be enough, so we may adjust the parameter, for example, filter is 7×7. Stride is 2 (ZF net). Or filer11x11, stride of 4. (AlexNet)

3.3 case

  1. LeNet. The first CNN successfully applied (Yann LeCun in 1990’s). His strengths are zip codes, digits, etc.
  2. AlexNet. First widely used in computer vision, (by Alex Krizhevsky, Ilya Sutskever and Geoff Hinton). ImageNet ILSVRC Challenge in 2012, Similar to LeNet structure but deeper and larger, multilayer convolution layer stacking.
  3. The ILSVRC 2013 winner (Matthew Zeiler and Rob Fergus). It became known as The ZFNet (Short for Zeiler & Fergus) Net). The structural parameters of Alexnet are adjusted, and the middle convolution layer is enlarged so that the filter and step size of the first layer are reduced.
  4. GoogLeNet. The ILSVRC 2014 winner (Szegedy et al. From Google.) has greatly reduced The number of parameters (from 60M to 4M). The first FC layer of ConvNet is replaced by Average Pooling, which eliminates a large number of parameters and has many variations such as Inception- V4.
  5. VGGNet. The Runner up in ILSVRC 2014 (Karen Simonyan and Andrew Zisserman) demonstrates The benefits of depth. It can be used on Caffe. However, there are too many parameters (140M), which requires a lot of calculation. But now there are many unnecessary parameters that can be removed.
  6. ResNet. (Kaiming He et al). Winner of ILSVRC 2015. As of May 10, 2016, this is a state-of-the-art model. Identity Mappings in Deep Residual Networks (Published March 2016) The calculated cost of VGG is:
INPUT: [224x224x3] memory: 224*224*3=150K weights: 0 conv3-64: [224x224x64] Memory: 224*224*64= 3.2m weights: (3*3*3)*64 = 1,728 conv3-64: [224x224x64] Memory: 224*224*64= 3.2m weights: (3*3*64)*64 = 36,864 POOL2: [112x112x64] memory: 112*112*64=800K weights: 0 conv3-128: [112x112x128] Memory: 112*112*128= 1.6m weights: (3*3*64)*128 = 73,728 conv3-128: [112x112x128] Memory: 112*112*128= 1.6m weights: (3*3*128)*128 = 147,456 POOL2: [56x56x128] memory: 56*56*128=400K weights: 0 CONV3-256: [56x56x256] memory: 56*56*256=800K weights: (3*3*128)*256 = 294,912 conv3-256: [56x56x256] Memory: 56*56*256=800K weights: (3*3*256)*256 = 589,824 conv3-256: [56x56x256] memory: 56*56*256=800K weights: (3*3*256)*256 = 589,824 POOL2: [28x28x256] Memory: 28*28*256=200K weights: 0 conv3-512: [28x28x512] memory: 28*28*512=400K weights: (3*3*256)*512 = 1,179,648 conv3-512: [28x28x512] memory: 28*28*512=400K weights: (3*3*512)*512 = 2,359,296 conv3-512: [28x28x512] Memory: 28*28*512=400K weights: POOL2: [14x14x512] memory: 14*14*512=100K weights: 0 conv3-512: [14x14x512] memory: 14*14*512=100K weights: (3*3*512)*512 = 2,359,296 conv3-512: [14x14x512] Memory: 14*14*512=100K weights: (3*3*512)*512 = 2,359,296 conv3-512: [14x14x512] Memory: 14*14*512=100K weights: (3*3*512)*512 = 2,359,296 POOL2: [7x7x512] Memory: 7*7*512=25K weights: 0 FC: [1x1x4096] Memory: 4096 weights: 7*7*512*4096 = 102,760,448 FC: [1x1x4096] Memory: 4096 weights: 4096*4096 = 16,777,216 FC: [1x1x1000] Memory: 1000 weights: 4096*1000 = 4,096,000 TOTAL Memory: 24M * 4 bytes ~= 93MB/image (only forward! ~*2 for bwd) TOTAL params: 138M parametersCopy the code
  •  

Notice that the parameters of the first few CONV layers are basically in the last few FC layers when the memory is used most. The first FC has 100M!

3.4 Memory Usage

Memory consumption is high in the following aspects: 1. A large number of activation values and gradients. When testing, you can store only the current active values. Discarding the previous active values in the lower layers will greatly reduce the amount of active values stored. 2. The storage of parameters, gradients in back propagation, and caches using momentum, Adagrad, or RMSProp all take up storage, so estimating the memory used by parameters is usually multiplied by at least 3 times 3. Each network operation needs to remember all kinds of information, such as the batch of graph data. If the memory required by the network is estimated to be too large, the batch of image can be appropriately reduced. After all, the activation value occupies a large amount of memory space.

Other information

  1. Soumith benchmarks for CONV performance
  2. ConvNetJS CiFAR-10 Demo browser for real-time ConvNets demonstration.
  3. Caffe, the popular ConvNets tool
  4. State of the art ResNets in Torch7