On May 18, 2018, Liu Fanping, a senior R&D engineer of Baidu, delivered a speech titled “Deep Learning Model Design Experience Sharing” in “Baidu Deep Learning Open Course · Hangzhou Station: The Road to rapid Advancement for AI Engineers”. As the exclusive video partner, IT mogul Said (wechat ID: Itdakashuo) is authorized to release the video through the review and approval of the host and the speaker.

Read the word count: 3633 | 10 minutes to read

Access to guest speech videos and PPT:
suo.im/5o3jWS

Abstract

This lecture will discuss some experience of deep learning model design from four aspects: data preparation, model structure, optimization method and computational performance.

Brief introduction of R&D process

Based on personal experience, the general R&D process can be roughly divided into 8 steps.

  1. Problem analysis to determine requirements

  2. Data analysis to determine the value of existing data (mainly based on characteristics and distribution)

  3. Feature extraction, which determines features according to the value of data and the problem to be solved

  4. Data preparation: Prepare the training set, test set, and validation set after completing the first three steps

  5. Model analysis, which determines the selected model based on the input and output of the problem

  6. Parameter training, iterative model until convergence

  7. Verify tuning, verify the model of each indicator, so that the model to achieve the best state

  8. When an application goes online, it provides services offline or online

Motivation and goals for model design are very important throughout the process. It includes the definition of requirements and problems, the establishment of mathematical models of problems, the determination of the relationship between data and problem solving, the exploration of the possibility of problem solving, the realizability and assessability of goals.

Data preparation for experience

Sufficient data

For model design, there should be sufficient data first, divided into two levels. The first is data features, which are used to determine whether the goal of model design can be achieved. Features should have a certain “causal relationship” and distribution should have “directivity”. The second data set should be as large as possible. DNN requires a large amount of data, and the model is easy to overfit on small data sets. It is recommended that you try to extend the original data set if possible.

Data preprocessing

Data preprocessing is a headache for many people in the industry, and there are different solutions for different scenarios.

Briefly introduce some common ways. The first is de-mean processing, which means subtracting the mean of all data from the original data and centralizing the data of all dimensions of input data to 0. After mean removal, although the features are obvious, the comparison between features is not clear, so normalization is used, dividing the data in each dimension by the standard of the data in this dimension. In addition, PCA/Whiteing is suitable for image processing. The features of adjacent pixels in an image are very similar, so convergence cannot be easily achieved. PCA can remove the correlation of these adjacent features to achieve fast convergence.

Data in the shuffle

Each epoch will have a lot of batches. In general, these batches are the same, but the ideal situation is that each epoch will have different batches. Therefore, if conditions permit, shuffle (randomization) should be performed once during each epoch to obtain different batches.

Model structure of experience

Hidden layer neuron estimation

BP neural network adds a hidden layer to the single-layer perceptron network, but there is no theoretical support for the selection of the number of hidden layers.

In the multi-layer perceptron, the number of nodes in the input and output layers is determined, while the number of nodes in the hidden layer is uncertain and affects the performance of the neural network. We can use empirical formula to calculate this

H indicates the number of hidden layers, m indicates the input layer, N indicates the output layer, and A ranges from 1 to 10. The final result of h is also in a range, and it is generally recommended to take the value of 2 to the n in that range.

Weight initialization policy

Reasonable initialization of weights can improve performance and speed up training speed. For weights, it is recommended to unify within a certain range.

The interval of the linear model is suggested to be [-v, v], v=1/ SQRT (input layer size), SPRT represents the square root sign. The interval of convolutional neural network is similar to the formula, but the final input layer size is changed to the width of the convolution kernel * the height of the convolution kernel * the input depth.

Adjust the activation function effectively

Sigmoid and Tanh functions are very expensive, tend to saturate and cause the gradient to disappear, and may stop directional propagation. In fact, the deeper the hierarchy of the network model, the more you should avoid Sigmoid and Tanh functions. Simpler and more efficient activation functions for ReLU and PReLU can be used.

ReLU is a very useful nonlinear function that can solve many problems. However, since ReLU blocks back propagation and initialization is poor, fine-tuning the model with it does not yield any fine-tuning effect. It is recommended to use PReLU and a very small multiplier (usually 0.1), so convergence is faster and you don’t get stuck during initialization like ReLU does. In addition, ELU is also very good, but the cost is higher.

ReLU is actually a special case of Maxout. Maxout is a learnable activation function that works by selecting the largest number in each group as the output value. In the picture below, take a group of two elements, 5 and 7, -1 and 1, and end up with 7 and 1.

Validation of model fitting ability

I believe that many people have encountered model overfitting, which leads to a lot of problems, but from another point of view, it is necessary to exist — you can use it to do model verification. Because the training of large-scale samples under complex model needs a lot of time, it leads to the increase of development cost.

We can perform fitting verification on the subset of random sampling before training on the full data set. If over-fitting occurs, it can be inferred that the network model has a high possibility of convergence.

Focus on the design of Loss

The design of Loss should pay attention to its rationality, simply and directly reflect the ultimate goal of the model, with a reasonable gradient, and can be solved. In addition, do not pay too much attention to Accuracy (evaluation index) and neglect the design of Loss in the training process.

Empirical optimization methods

Learning rate optimization method

It is known that adjusting the learning rate can also improve the effect. The figure above shows the situation of Loss curve at different rates. It is obvious that the optimal rate should be to smooth the Loss curve forward (red curve). You can refer to this figure when tuning the learning rate.

Convolution kernel size optimization

When several small convolution kernels are superimposed together, compared with a large one, the connectivity with the original graph is unchanged, but the number of parameters and computational complexity are greatly reduced. Therefore, we recommend the “small and deep” model to avoid the “big and short” model. Small convolution kernels can also replace large convolution kernels. For example, a 5*5 convolution kernel can be replaced by two 3*3 convolution kernels, and a 7*7 convolution kernel can be replaced by three 3*3 convolution kernels.

General optimization method selection

There is a coupling relationship between learning rate and training steps, batch size and optimization method. The common optimization scheme is adaptive learning rate (RMSProb, Momentum, Adam, etc.). Adaptive optimization algorithm is used to automatically update the learning rate. The learning rate and momentum parameters can also be manually selected using SGD, where the learning rate decreases over time. In practice, adaptive optimization tends to converge faster than SGD, and the final performance is usually relatively poor.

In general, it is a good practice for high performance training to switch from Adam to SGD. However, since the early stages of training are when SGD is sensitive to parameter tuning and initialization, Adam can be used for initial training, saving time and not having to worry about initialization and parameter tuning. When Adam runs for a while, it switches to SGD plus momentum optimization for optimal performance.

Of course, for sparse data, it is suggested to use learning adaptive optimization methods as far as possible.

Effect visualization

Normally, the weight of the convolutional layer of the convolutional neural network is visualized to have the properties of smooth. The following two figures are the visualized results of filter weight of the first convolution layer of a neural network. If you see something like this on the left, you might want to think about where the model was designed to be wrong.

Computational performance of experience

Computational performance analysis methods

Computing platform has two important indicators, computing power and bandwidth. Computation power is the upper limit of a computing platform’s performance, defined as the number of floating-point operations, measured in flops per second, that a computing platform can perform at its best. Bandwidth is the amount of memory exchanged per second, in bytes per second, that a computing platform can do at its best. The calculation force can be divided by the bandwidth to obtain the upper limit of the strength of the computing platform, that is, the maximum number of computations per unit of memory swap. The unit is FLOP/Byte.

The model also has two important indexes, computational quantity and inventory. Computation quantity refers to the number of floating point operations when a single sample is input and a complete forward propagation of the model is carried out, which is also the time complexity of the model, expressed in FLOPS. The number of visits refers to the amount of memory exchanged in the process of inputting a single sample to complete a forward propagation, that is, the spatial complexity of the model. The data type is usually FLOAT32.

Let’s look at an example of model computing performance. Suppose we have two matrices A and B, both of which are 1000*1000 and whose data type is float32. Calculate C=A*B. Then, the calculation will perform the floating point multiplication and addition of 1000*1000*1000, which is about 2G FLOPS. This process will read A, B two matrix, write C matrix, at least three times to access the memory, about 12MB.

Such a simple 1000 by 1000 matrix would cost 12MB, so you can imagine how much resources a more complex model would consume. Based on this consideration of computing performance, it is not necessary to choose deep learning when selecting a model. The complexity of space and time also has a great influence on the model. In terms of model selection, we can try linear model and tree correlation model first, and then use the classical model in traditional machine learning if it is not suitable, and finally the neural network model.

Dropout & Distributed Training

Based on our experience, Dropout is recommended for both fully connected and convolutional layers when the number of nodes in a single layer is greater than 256.

According to the developers I have met personally, many of them do not attach great importance to distributed training. Most of them are running high-performance graphics cards on a single machine, but distributed training may be more efficient, so you can consider moving closer to this aspect.