Welcome to Tencent cloud community, get more Tencent mass technology practice dry goods oh ~

Author introduction: Gao Chengcai, Tencent Android development engineer, joined Tencent in April 2016, mainly responsible for penguin e-sports push stream SDK, Penguin e-sports APP function development and technical optimization. This article is published in the column of QQ member technical team

This article is a summary of CS231n study notes, adding some Deep Learning Book and Tensorflow practice, as well as Caffe framework knowledge.

Convolutional neural network

1.1 Convolutional neural network and conventional neural network

1.1.1 similarities

Convolutional network is a kind of neural network specially used for processing data with similar grid structure. It is very similar to conventional neural networks: they are made up of neurons with weights and biases that have the ability to learn. Each neuron gets some input data, carries on the inner product operation and then carries on the activation function operation. The whole network is still a differentiable scoring function: the input of this function is raw image pixels, and the output is different categories of ratings. At the last layer (usually the full connection layer), the network still has a loss function (such as SVM or Softmax), and all the tricks and points we implemented in neural networks still apply to convolutional neural networks.

1.1.2 differences

The structure of the convolutional neural network is based on the assumption that the input data is an image, and based on that assumption, we add some special properties to the structure. These special properties make the forward propagation function more efficient and greatly reduce the number of parameters in the network. Conventional neural networks take input from a vector and transform it in a series of hidden layers. Each hidden layer is composed of a number of neurons, each neuron is connected to all the neurons in the previous layer. But in a hidden layer, neurons are independent of each other and don’t make any connections.

Conventional neural networks are unsatisfactory for large image size. In CIFAR-10, the size of the image is 32x32x3 (32 pixels in width and height and three color channels). Therefore, in the first hidden layer of the corresponding conventional neural network, each single fully connected neuron has 32x32x3=3072 weights. This quantity seems acceptable, but it is clear that the fully connected structure is not suitable for larger images. For example, a 200x200x3 image would have a neuron containing 200x200x3=120,000 weight values. And there must be more than one neuron in the network, so the number of parameters increases rapidly! Obviously, this fully connected approach is inefficient, and a large number of parameters will quickly lead to network overfitting.

The convolutional neural network can adjust its structure more reasonably in the case that the input is all images, and it has gained considerable advantages. Different from conventional neural networks, the neurons in each layer of convolutional neural networks are arranged in three dimensions: width, height and depth (depth here refers to the third dimension of the activated data body, rather than the depth of the whole network, which refers to the number of layers of the network). As we will see, the neurons in the layer will only be connected to a small area in the previous layer, instead of being fully connected. The schematic diagram of conventional neural network and convolutional neural network is shown below:

1.2 Convolutional neural network structure

A simple convolutional neural network is composed of various layers arranged in order, and each layer in the network uses a differentiable function to transfer the active data from one layer to another. Convolutional neural networks are mainly composed of three types of layers: convolutional layer, Pooling layer and full connection layer (the full connection layer is the same as that in conventional neural networks). By stacking up these layers, a complete convolutional neural network can be constructed, with the structure as shown in the figure below:

1.2.1 convolution layer

The parameters of the convolution layer are made up of some learnable set of filters. Each filter is smaller in space (width and height), but the depth is consistent with the input data. While propagating forward, slide each filter over the width and height of the input data (more precisely, convolve), and then compute the inner product of the entire filter and the input data at any point. As the filter slides along the width and height of the input data, a 2-dimensional activation map is generated showing the filter’s response at each spatial location. Intuitively, the network tells the filter to learn to activate when it sees certain types of visual features, which could be a boundary in some orientation, or a speck of color on the first layer, or even a honeycomb or wheel-like pattern higher up in the network. At each convolution layer, we will have a whole set of filters (say 12), each generating a different two-dimensional activation diagram. Stacking these activation maps in the depth direction generates output data, as shown in the left of the figure below. The 32*32 image outputs data through multiple filters:

(1) Local connection

Local connections greatly reduce network parameters. When dealing with high-dimensional inputs such as images, it is not practical to have every neuron fully connected to all the neurons in the previous layer. Instead, we let each neuron connect to only one local area of input data. The spatial size of this connection is called the receptive field of the neuron (receptive field), and its size is a hyperparameter (in fact, the spatial size of the filter). In the depth direction, the size of this connection is always equal to the depth of the input. Again, we treat the spatial dimensions (width and height) differently than the depth dimensions: the joins are local in space (width and height), but always correspond in depth to the input data.

(2) Spatial arrangement

The connection mode between each neuron in the convolution layer and the input data body has been explained above, but the number of neurons in the output data body and their arrangement have not been discussed. Three hyperparameters control the size of the output data body: depth, stride, and zero-padding. Here’s a discussion of them:

  1. Output data body depth

It corresponds to the number of filters used, each looking for something different in the input data. For example, if the input of the first convolution layer is the original image, then different neurons on the depth dimension may be activated by different directional boundaries, or color spots. We call these collections of neurons arranged along the depth direction with the same receptive field depthcolumns. As shown in the figure below, there are 6 filters in the convolution layer, and the depth of the output data is also 6.

  1. Step length

When sliding the filter, the step size must be specified. When the step size is 1, the filter moves 1 pixel at a time. When the step size is 2 (or the less common 3, or more, which is rarely used in practice), the filter slides 2 pixels at a time. This operation makes the output data body smaller in space. As shown in the figure below, the step size of the move is 1, the data size of the input is 6 * 6, and the data size of the output is 4 * 4.

  1. fill

The output of convolution layer filter will reduce the size of data, as shown in the figure below. When 32 * 32 * 3 input passes through a 5 * 5 * 3 filter, 28 * 28 * 1 size data will be output.

Zero fill has a good property, that is, it can control the space size of the output data body (the most commonly used is to keep the size of the input data body in space, so that the width and height of the input and output are equal). It is convenient to fill the input data body with zeros at the edges. The zero-padding size is a hyperparameter. Zero padding is used to keep the width and height of input and output data consistent as shown in the figure below.

Output calculation formula of convolution layer:

Suppose the size of the input data body is:

The four hyperparameters of the convolution layer are:

Then the size of the output data body is:

Among them:

Common Settings for these hyperparameters are F=3, S=1, and P=1. For example, if the input is 7×7, the filter is 3×3, the step size is 1, and the fill size is 0, then you get a 5×5 output. If the step size is 2, the output is 3×3.

(3) Parameter sharing

Parameter sharing is used in the convolution layer to control the number of parameters. Each filter is connected to the upper part of the layer, and all the local connections of each filter use the same parameters, which also greatly reduces the parameters of the network.

It is reasonable to assume that if a feature is useful for computing one spatial position (x,y), it will also be useful for computing a different position (x2,y2). Based on this assumption, the number of parameters can be significantly reduced. In other words, a single 2-dimensional slice in the depth dimension is regarded as a depth slice. For example, a data volume size of 55x55x96 has 96 depth slices, each size of 55×55. The same weights and biases were applied to the neurons on each deep slice. In the process of back propagation, the gradient of each neuron to its weight should be calculated, but the gradient of all neurons on the same depth slice to its weight should be accumulated, so as to obtain the gradient of shared weight. Thus, only one weight set is updated per slice.

Ownership weight in a depth slice all uses the same weight vector, so the forward propagation of the convolution layer in each depth slice can be regarded as calculating the convolution of neuron weight and input data volume (this is the origin of the name “convolution layer”). This is why these sets of weights are always called filters (or convolution kernels) because they are convolved with the input. The following figure shows the process of convolution dynamically:

1.2.2 pooling layer

Typically, a pooling layer is periodically inserted between successive convolutional layers. Its function is to gradually reduce the spatial size of the data body, so as to reduce the number of parameters in the network, reduce the cost of computing resources, and effectively control the overfitting. The pooling layer uses Max operation to independently operate each depth slice of the input data volume and change its spatial size. The most common form is that the pooling layer uses a filter of size 2×2 to downsample each depth slice with step size 2 and discard 75% of the activation information. Each Max operation takes the maximum from four numbers (i.e., some 2×2 region in the depth slice). The depth stays the same.

The calculation formula of pooling layer is as follows:

Input data volume size:

The pooling layer has two hyperparameters:

Volume size of output data:

Among them:

Because the input is evaluated as a fixed function, no parameters are introduced. Zero padding is rarely used in the pooling layer.

In practice, the maximum pooling layer usually has only two forms: one is F=3, S=2, and the more common one is F=2, S=2. Pooling larger receptive fields also requires larger pooling sizes and is often destructive to the network.

General Pooling: In addition to maximum Pooling, Pooling cells can also use other functions, such as average Pooling or L-2-norm Pooling. Average pooling was historically common, but is now rarely used. Because in practice, maximum pooling works better than average pooling. The diagram of pooling layer is as follows:

Back propagation: To review back propagation, the back propagation of a function can be simply understood as passing the gradient back only along the largest number. As a result, the index of the largest element in the pool is usually recorded (sometimes called switches) as it propagates forward through the aggregation layer, so that gradient routing is efficient when propagating back.

1.2.3 Normalization Layer

Many different types of normalization layers have been proposed in the structure of convolutional neural networks, sometimes in order to achieve the inhibitory mechanisms observed in biological brains. But these layers gradually fell out of fashion, as their effectiveness proved extremely limited, if at all.

1.2.4 Full connection Layer

In the fully connected layer, the neuron is fully connected to all the activation data in the previous layer, as in the normal neural network. They can be activated by matrix multiplication, plus bias.

1.3 Commonly used CNN model

1.3.1 LeNet

This is the first successful application of convolutional neural network, which was implemented by Yann LeCun in the 1990s. It is used for handwriting recognition, and it is also the “Hello World” of learning neural network. The network structure diagram is as follows:

C1 layer: convolution layer, this layer contains 6 feature convolution kernels, the size of the convolution kernels is 5 * 5, and then can get 6 feature graphs, the size of each feature graph is 32-5+1=28.

S2 layer: this is the down-sampling layer, which uses maximum pooling for down-sampling. The pooling size is 2×2, so we can get six 14×14 feature maps.

C3 layer: convolution layer, this layer contains 16 feature convolution kernels, the size of the convolution kernels is still 5×5, so 16 feature graphs can be obtained, the size of each feature graph is 14-5+1=10.

S4 layer: the downsampling layer, which is still the maximum pooled downsampling of 2×2, resulting in 16 5×5 feature maps.

C5 layer: convolution layer, which uses 120 5×5 convolution kernels and finally outputs 120 1×1 feature graphs.

And then there’s the full connection layer, and then there’s the classification.

1.3.2 AlexNet

Alexnet is of great significance. He proved the effectiveness of CNN under complex models, making neural networks shine in the field of computer vision. It was implemented by Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. AlexNet won the ImageNet ILSVRC competition in 2012, far outperforming the runner-up (16 percent top5 error rate vs. 26 percent top5 error rate). The structure of this network is very similar to LeNet, but deeper and larger, and it uses layered convolutional layers to obtain features (previously, only one convolutional layer was used, followed immediately by an aggregation layer). The structure diagram is as follows:

The details of each layer are shown below:

AlexNet input size is 227x227x3.

The first volume base consists of 96 filters of size 11×11 and step size 4. Output size is (227-11)/4 +1 = 55 and depth is 64. There are approximately 11x11x3x6.435,000 parameters.

The second layer of pooling is 3×3 and the step size is 2, so the output size is (55-3)/2+1=27.

This was the first time ReLU was used, or at least the first time ReLU was promoted.

Normalize with data that is not used much anymore.

A lot of data enhancement is used.

1.3.3 ZFNet

The network created by Matthew Zeiler and Rob Fergus won the ILSVRC 2013 competition and is called ZFNet (short for Zeiler & Fergus Net). It improves AlexNet by modifying the hyperparameters in the structure. Specifically, it increases the size of the middle convolution layer and makes the step size and filter size of the first layer smaller. They made some parameter adjustments based on the experiment. The structure diagram is as follows:

1.3.4 VGGNet

VGGNet does not have a crazy architecture choice, there is no much work on how to set the number of filters, size and size. The whole VGG network uses only 33 convolution kernels with sliding step 2 and 22 pooled Windows with sliding step 2. This parameter setting is maintained throughout the process. The key to VGG is how many layers of the operation you repeat, and finally the same set of parameters repeats the network structure for 16 layers. The reason for choosing this layer may be that they found it to have the best performance. With 200M ram per image, all the parameters add up to a total of 140 million parameters.

Why use 33 filter? 33 is the smallest receptive field that can capture top, bottom, left, right and center. Multiple convolution layers of 33 have more nonlinearity than a larger filter convolution layer. In the case of step size 1, the maximum receptive field region of two filters of 33 is 55, and the maximum receptive field region of three filters of 33 is 77, which can replace the larger filter size. Multiple convolutional layers of 33 have fewer parameters than a filter of large size. It is assumed that the characteristic graphs of input and output of the convolutional layer have the same size of 10. Then the number of convolution layer parameters 3(331010) of filters containing 3 33 =2700, because the three filters of 33 can be regarded as the decomposition of a filter of 77 (there is nonlinear decomposition in the middle layer), However, a convolutional layer parameter of 77 is 771010=4900. 11 The filter is used to perform linear deformation on the input line without affecting the input and output dimensions, and then perform nonlinear processing through Relu to increase the nonlinear expression ability of the network.

The downside of VGGNet is that it takes more computing resources and uses more parameters, as shown below:

1.3.5 GoogLeNet

VGGNet performs well, but has a large number of parameters. Generally speaking, the most direct way to improve network performance is to increase network depth and width, which means a large number of parameters. However, a large number of parameters are easy to produce overfitting, which will greatly increase the amount of calculation.

It is generally believed that the fundamental method to solve the above two shortcomings is to transform full connection and even general convolution into sparse connection. On the one hand, the connections of real biological nervous systems are also sparse. On the other hand, literature shows that for large-scale sparse neural networks, an optimal network can be constructed layer by layer by analyzing the statistical characteristics of activation values and clustering highly correlated outputs. The understanding of “sparse connection structure” is that it uses as many “small” and “scattered” stackable network structures as possible to learn complex classification tasks. Inception may improve the accuracy of network because it has multiple kernels of different scales, and each scale kernel will learn different features. These features learned by Different Kernels can be aggregated to the next layer to better achieve all-round deep learning.

A common Inception structure is shown below:

Inception for dimensional-reduction implementation is shown below, reducing 256 dimensions to 64 dimensions and reducing the number of parameters:

Why are the parameters of the VGG network so many? Just because it has two fully connected 4096 layers at the end. Szegedy learned his lesson, and in order to compress GoogLeNet’s network parameters, he removed the entire connection layer. The complete structure of GoogLeNet is shown below:

1.3.6 ResNet

The residual network, completed by Kaiming He and his colleagues, not only won ImageNet in 2015, but also won quite a few contests, almost all the important ones. There is a famous obstacle in deep network optimization: gradient disappearance and gradient explosion. This obstacle can be solved by reasonable initialization and some other technologies. However, with the increase of network depth, accuracy saturates and rapidly decreases, a phenomenon called degradation, which widely exists in the deep network, indicating that not all systems can be easily optimized.

Plainnet is useless if you simply increase the number of network layers. As shown in the figure above, on CIFAR-10, the solid line is the error rate on the test set and the dotted line is the error rate on the training set. It is unscientific that the network with deeper layers has a higher error rate. Logically speaking, the network with deeper layers has a larger capacity. That is because we have not done a good enough job in optimizing parameters to choose a better one. The training error rate and test error rate of residual network model improve continuously with the increase of network depth. Training ResNet takes 2-3 weeks of 8 GPU training.

He Kaiming proposed the concept of deep residual learning to solve this problem. First let’s assume that the mapping we want is H(x). From the above observation we realize that it is not that easy to find H(x) directly, so let’s instead find the residual form of H(x), F(x)=H(x)-x. Let’s assume that the process of finding F(x) is simpler than H(x), so that by F(x)+x we can achieve our goal. In a nutshell, this is the picture above. We call this structure a residual block. I’m sure a lot of people are confused about the second hypothesis, which is why F(x) is easier to find than H(x), and this is not clearly explained in the paper. However, this conclusion can be drawn according to the later experimental results.

There are these interesting jump connections in ResNet, as shown in the figure above. In the backward propagation of the residual network, in addition to the gradient propagating backwards through these weights, you also have these jump connections, and these jump connections are additive processes that disperse the gradient so that the gradient flows to the previous part, so you can train for features that are very close to the image.

According to the depth and error rate statistics of the neural network algorithm in ImageNet below, we can see that the neural network layers are getting deeper and the error rate is getting lower and lower.

Tensorflow actual combat

2.1 Comparison of open source frameworks for deep learning

There are many deep learning frameworks, and we will introduce only two that are widely used:

Tensorflow: Tensorflow is an open source software library that uses data Flow Graphs for numerical calculations. TensorFlow is a product of high quality code, supported by Google’s development and maintenance capabilities, and well architected. Google, as a giant company, has far more resources to invest in TensorFlow than universities or individual developers, and it is expected that TensorFlow will grow rapidly in the future, leaving deep learning frameworks maintained by universities or individuals far behind. TensorFlow is a relatively high-level machine learning library that allows users to design neural network structures without having to write C++ or CUDA code for efficient implementation. TensorFlow also has built-in upper-layer components such as Tf.Learn and TF.Slim to help quickly design new networks. Another important feature of TensorFlow is its flexibility to be portable. You can easily deploy the same code with little modification to a PC, server, or mobile device with any number of cpus or Gpus. Besides supporting Convolutional Neural Network (CNN) and Recurent Neural Network (RNN), TensorFlow also supports deep reinforcement learning and other computationally intensive scientific computations, such as solving partial differential equations.

Caffe: Caffe is a widely used open source deep learning framework that was Github’s top star project in deep learning until Tensorflow came along. Caffe has the following advantages: 1. It is easy to use. The network structure is defined in the form of configuration files, and no code is required to design the network. Training speed, modular components, can be easily expanded to new models and learning tasks. However, Caffe’s initial design target is only for images, without considering text, speech or time series data. Therefore, Caffe has good support for convolutional neural network, but not sufficient support for time series RNN and LSTM.

2.2 Tensorflow Environment Construction

2.2.1 Operating System

Tensorflow runs on Windows, Linux, and MAC. I used Ubuntu16.04 64-bit. Tensorflow relies on the Python environment, where Python 3.5 is used as the base version of Python by default. It is recommended to use Anaconda as a Python environment because you can avoid a number of compatibility issues.

2.2.1 installation Anaconda

Anaconda is a scientific computing release of Python that comes with hundreds of built-in libraries that Python uses frequently, some of which may also be dependent on Tensorflow. The Python version of Anaconda must be consistent with the Tensorflow version, otherwise there will be problems. Anaconda3-4.2.0-linux-x86_64. Perform:

Bash Anaconda3 4.2.0 – Linux – x86_64. Sh

After installing Anconda, we will be prompted to add Anaconda3’s binary path to.bashrc, which is recommended so that python commands will automatically use Anaconda Python3.5.

2.2.2 Tensorflow installation

Tensowflow comes in both CPU and GPU versions. If you have an NVIDIA graphics card on your computer, the GPU version is recommended to speed up your training. CPU version installation is relatively simple, here will not repeat, mainly about the GPU version installation. First use:

lspci | grep -i nvidia

View the nvidia graphics card types, then go to developer.nvidia.com/cuda-gpus check to see if your graphics card supports cuda, only support cuda gpu can install tensorflow gpu version. My computer is GeForce 940M support CUDA.

(1) Install CUDA and cuDNN

Download the CORRESPONDING CUDA version from nVIDIA’s official website. This download is CUDA_8.0.61_375.26_linux.run. Download here will be slow, it is recommended to use thunderbolt download. Before installation, you need to pause the NVIDIA driver X Server, first use CTRL + Alt + F2 to access the Ubuntu command interface, if not, some computers need to use Fn + CTRL + Alt + F2 and then run

sudo /etc/init.d/lightdm stop

Pause the X Server. Then run the following command to install:

Chmod u + x cuda_8. 0.61 _375. 26 _linux. Runsudo. / cuda_8. 0.61 _375. 26 _linux. Run

Press Q to skip the license at the beginning, enter Accept to accept the protocol, and then press Y to install the driver. In the following selection, OpenGL is not installed. Otherwise, the login page may be displayed repeatedly. Then press the N key to choose not to install samples.

CuDNN is a highly optimized implementation of CNN and RNN for deep learning from NVIDIA. It uses a lot of advanced technologies and interfaces at the bottom, so it has much higher performance than other neural network libraries on gpus. To download cuDNN from the official website, you need to register your NVIDIA account and wait for approval. perform

CD /usr/local sudo tar -xzvf ~/Downloads/cudnn-8.0-linux-x64-v6.0.tgz

CuDNN installation is complete.

(2) install Tensorflow

Download tensorFlow_GPU-1.2.1-cp35-cp35m – on Github

Linux_x86_64 WHL. Then run the following command to complete the installation.

PIP install tensorflow_gpu 1.2.1 – cp35 – cp35m – linux_x86_64. WHL

2.2 Hello World-MINST Handwritten digit recognition

2.2.1 Use linear model and Softmax classifier

import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)print(mnist.train.images.shape, mnist.train.labels.shape)print(mnist.test.images.shape, mnist.test.labels.shape)print(mnist.validation.images.shape, mnist.validation.labels.shape) sess = tf.InteractiveSession() x = tf.placeholder(tf.float32, [None, 784]) W = tf.Variable(tf.zeros([784, 10])) b = tf.Variable(tf.zeros([10])) y = tf.nn.softmax(tf.matmul(x, W) + b) y_ = tf.placeholder(tf.float32, [None, 10]) cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), Reduction_indices = [1])) train_step = tf. Train. GradientDescentOptimizer (0.5). Minimize (cross_entropy) tf.global_variables_initializer().run()for i in range(1000):
    batch_xs, batch_ys = mnist.train.next_batch(1000)
    train_step.run({x: batch_xs, y_: batch_ys})

correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))print(accu
Copy the code

2.2.2 CNN model is used

from tensorflow.examples.tutorials.mnist import input_dataimport tensorflow as tf
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True) sess = tf.InteractiveSession()def weight_variable(shape): Initial = tf. Truncated_normal (shape, stddev = 0.1)returnDef bias_variable(shape): initial = tf.constant(0.1, shape=shape) def bias_variable(shape): initial = tf.constant(0.1, shape=shape)return tf.Variable(initial)def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')


x = tf.placeholder(tf.float32, [None, 784])
y_ = tf.placeholder(tf.float32, [None, 10])
x_image = tf.reshape(x, [-1, 28, 28, 1])

W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])
h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])
y_conv = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)

cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y_conv), reduction_indices=[1]))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

tf.global_variables_initializer().run()for i in range(10000):
    batch = mnist.train.next_batch(50)    ifI % 100 == 0: train_accuracy = accurate. eval(feed_dict={x: batch[0], y_: Batch [1], keep_prob: 1.0})print("step %d, training accuracy %g"% (I, train_accuracy) train_step. Run (feed_dict={x: Batch [0], y_: Batch [1], keep_prob: 0.5})print("test accuracy %g"Eval (feed_dict={x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))Copy the code

Iii. Hero identification of King of Glory

In the penguin esports business scenario, you need to identify the hero the anchor is currently using in King of Glory. This is the first attempt to use deep learning for classification, using Caffe as the framework.

3.1 Design Features

The first is to prepare the training data, we need to design the characteristics that need to be classified according to the business scenario. We can choose the whole game screen, hero screen or skill key as the classification feature, as shown below. When using the entire game screen, the image contains more extraneous elements and has a smaller feature area that identifies the hero. Hero screenshots can be used as an obvious feature, but due to the existence of multiple skins for each hero, it is difficult to prepare data and will affect the accuracy of recognition. Therefore, it is a good choice to use the skill key area as the classification feature.

In the design of classification features, features should be as clear as possible, and features of different types differ greatly.

3.2 Collecting Data

Collecting data and labeling is a tedious but important task. The same model may have different effects under different data training. Download a 15-minute video for each hero on Youtube and use the ffmpeg command below to capture the image of the skill key area and reduce its size.

ffmpeg -i videos/1/test.mp4 -r 1 -vf “crop=380:340:885:352,scale=224:224” images/1/test_%4d.png

The data set image folder is shown below. Filter the noise images that do not contain the skill key and place them in the 0 folder. The 0 folder serves as the background category, corresponding to the category of unrecognized heroes.

Each of these categories contains about 1000-2000 images. The larger the dataset, of course, the better, and covering as many scenarios as possible, increasing the generalization power of the model.

3.3 Processing Data

Once the data set is generated, it needs to be converted into a format Caffe can recognize. First use the Python code on the left side of the image below to generate the images contained in the training set and test set. The generated file is shown on the right side of the image below, containing the image path and classification. The test set is 1/5 of all images.

The images were then converted to LMDB data, and Caffe provided tools to help us with the conversion. The processing script is shown below.

3.4 Selecting a model

The six models introduced above are commonly used image classification models. Here, we choose GoogLeNet as the model network. The Caffe project has these network models in the models folder. Look at the Caffe/Models/bvLC_googlenet folder, where the files are shown below:

3.4.1 track solver. Prototxt

The solver.prototxt folder defines the parameters used in the training model, and the specific meanings of the parameters are shown as follows:

3.4.2 train_val.prototxt

Train_val. prototxt specifies the structure of the GoogLeNet network. There are many network files. The first is the data layer, whose parameters are shown in the figure below:

3.4.2 train_val.prototxt

Train_val. prototxt specifies the structure of the GoogLeNet network. There are many network files. The first is the data layer, whose parameters are shown in the figure below:

The structures of convolutional layer, ReLU layer and pooling layer are shown in the figure below:

The structure of LRN layer is shown in the figure below:

The Dropout layer and the fully connected layer are shown below:

Rule 3.4.3 deploy. Prototxt

The deploy file is the network structure used for deployment. Most of the content is similar to the train_val.prototxt file, but some content of the test layer is omitted.

3.5 Training Model

With the data and model ready, it’s time to train the model. Here call the following code to start the training:

caffe train -solver solver.prototxt

You can add the -snapshot parameter to continue the model training from the previous training, as shown in the following figure:

When the model converges or the accuracy reaches our requirements, the training can be stopped.

3.6 Clipping model

GoogLeNet itself is a large network that we can tailor to our own needs. The LRN layer in GoogLeNet has little influence and can be removed or some convolutional layers can be deleted to reduce the number of layers in the network.

3.7 finetuning

When our data volume is relatively small, it is easier to overfit the training of complete network parameters. We can solve this problem by training only certain layers of the network. Train_val. prototxt (loss1/classifier, loss2/classifier, loss3/classifier) We need to change it to our own total number of species. When we reload the trained Model, the three layers of parameters will be reinitialized. Then set the lR_mult of all other layers to 0, so that the parameters of the other layers do not change, and use the pre-trained parameters. Download bvLC_googlenet. caffemodel, which is a parameter Google has trained on ImageNet. Call caffe train -solver solver.prototxt -weights bvlc_googlenet.caffemodel.

Tips

Local minimum problem: A feasible trick is to reduce the batchsize at the beginning.

Loss explosion: The learning rate is reduced.

Loss never converges: it is doubtful whether there is a problem with the data set or label.

Overfitting: data enhancement, regularization, Dropout, BatchNormalization, and early stopping strategies.

Weight initialization: General Xavier or Gaussian.

Finetune: Use GoogLeNet or VGG to Finetune existing models.

Reference Documents:

1. This paper is mainly compiled from the course notes of CS231n.

Chinese translation link, Chinese open course link. I highly recommend you check it out.

2. Yoshua Bengio’s Deep Learning Book is a great recommendation and well worth checking out.

3, Zhou Zhihua “Machine learning” suitable for machine learning introduction.

Tensorflow practice is good.

Stanford Deep Learning Course.

Li tao editor

reading

When deep learning meets automatic text summarization

A review of CNN model compression and acceleration algorithms

Cloud server from 20 yuan/month, and enjoy a thousand yuan renewal package


This article has been published by Tencent Cloud Technology community authorized by the author