The wave of Artificial Intelligence is sweeping the world, and we hear a lot of words: Artificial Intelligence, Machine Learning, Deep Learning. This paper is mainly to comb the notes of Li Hongyi’s course content, and the reference links have been given at the end of the paper.

6- Introduction to deep learning

The three steps of deep learning

  • Step1: Neural network
  • Step2: evaluation of model
  • Step3: Pick best function

The neural network

Nodes in Neural networks, like our neurons.

Neural network can also have a lot of different connection mode, it will produce different structure (structure) in the neural network, we have a lot of logistic regression function, in which each logistic regression has its own weight and their deviation, the weight and the deviation is parameter, these neurons connections are manually to design.


Fully connected feedforward neural network

Concept: Feedforward can also be called forward. In terms of signal flow direction, after input signal enters the network, the signal flow is one-way, that is, the signal flows from the previous layer to the back layer until the output layer. There is no feedback between any two layers, that is, the signal does not return from the later layer to the previous layer.

After the corresponding value is input, the neural network multiplications the corresponding weight at each layer directly, and then inputs into the set activation function to obtain the input of the next layer, and then repeats the previous operation again, passing the signal down layer by layer, and finally obtains the output of the network.

A deep learning network structure should be divided into the following layers:

  • Input Layer: 1
  • Hidden Layer: N layers
  • Output Layer: 1 Layer

Fully Connect: Layer1 is Fully connected to Layer2

Understanding Feedforward: The direction of transmission is now forward, so it is called Feedforward

Deep understanding: Deep = Many hidden layers refers to having Many hidden layers

Why matrix computing is introduced: As the number of layers increases, the error rate decreases and the amount of computation increases, often over billions of calculations. For such a complex structure, we must not calculate one by one, for billions of calculations, the use of cycle efficiency is very low. Here we introduce Matrix Operation, which can make our Operation speed and efficiency much higher. The nice thing about writing matrix operations like this is that you can use GPU acceleration.


nature

Feature transformation through hidden layers. The hidden layer is replaced by the original feature project through feature extraction, so that the output of the last hidden layer is a set of new features (equivalent to black box operation). For the output layer, In fact, the output of the previous hidden layer is taken as input (a set of the best features obtained through feature extraction) and then the final output Y is obtained through a multi-classifier (which can be softmax function).

Model to evaluate

In terms of loss, we do not calculate just one amount of data, but the total loss of all the training data, and then add up all the training data losses to get a total loss L. The next step is to find a set of functions in the function set that minimize the total loss L, or a set of parameters θ\theta theta theta for the neural network that minimize the total loss L.

The best way to calculate loss in neural network is back propagation. We can use many frameworks to calculate loss, such as TensorFlow, Theano, Pytorch and so on

Choose the best function

How do we find the best function and the best set of parameters, we use gradient descent.

Specific process: θ\thetaθ is a set of parameters containing the weight and deviation, find an initial value at random, and then calculate the partial derivative corresponding to each parameter, and the set of a partial derivative that becomes ∇L\nabla{L}∇L is the gradient, with these partial derivatives, we can constantly update the gradient to get new parameters, and so on and on and on. You get the best set of parameters that minimize the value of the loss function.

Project Setup Steps

To build a deep learning project from 0, it should be divided into the following steps:

  • Part ONE: Launching a deep learning project
  • Part two: Creating a deep learning data set
  • Part three: Designing deep models
  • Part four: Visual deep network model and metrics
  • Part V: Debugging in deep learning networks
  • Part vi: Improving deep learning model performance and network tuning

These six parts will be explained below.

Part ONE: Launching a deep learning project

Project research: We will first conduct research on existing products to explore their weaknesses.

Theoretical project research: Next, we need to learn about relevant research and open source projects, and many of us have to read at least dozens of papers and projects before we start practicing. Deep learning (DL) code is concise, but it is difficult to detect defects, and many research papers often omit implementation details. Many projects start with open source implementations and solve similar problems, so search for open source projects.


Part two: Creating a deep learning data set

We can use public or custom datasets. Open datasets can provide more neat samples and baseline model performance. If you have multiple open datasets available, select the best quality sample that is most relevant to your question; If there is no public data set for the corresponding domain, we can collect the data and create the data set according to the actual needs of the project.

High-quality data sets should include the following characteristics:

  • Category is balanced
  • The data is enough
  • There is high-quality information in the data and tags
  • Data and markup errors are very small
  • Related to your question

Notes:

  • Use public data sets whenever possible;
  • Look for the best sites for high quality, diverse samples;
  • Analyze errors and filter out samples that are not relevant to the actual problem;
  • Create your sample iteratively;
  • Balance sample numbers for each category;
  • Organize samples before training;
  • Collect enough samples. If the sample is insufficient, apply transfer learning.

Part three: Designing deep models

Start with a flexible and simple model: start with fewer network layers and customizations, and then do some necessary hyperparametric fine-tuning. All of these need to be verified that the loss function is decreasing all the time, and don’t waste time on the larger model at first.

Prioritization and incremental design: Break down complex problems into smaller ones and solve them step by step. In the process of designing a model, we will encounter many surprises. Priorities-driven planning is better than having a long-term plan that is constantly changing. Use shorter, smaller design iterations to ensure project manageability.

Avoid random improvements: Analyze the weaknesses of your model first, rather than making random improvements. Haphazard improvement is counterproductive, increasing training costs proportionally with minimal return.

Limitations: We apply limitations to network design to ensure more efficient training. Building deep learning is not simply a matter of stacking network layers together. Adding good constraints can make learning more efficient, or even more intelligent.

Design details: Select deep learning software framework transfer learning, many pre-training models can be used to solve deep learning problems, we can first simulate our ideas on the original model, see how the effect is.

Cost function: Not all cost functions are equal and can affect the training difficulty of the model. Some cost functions are fairly standard, but some problem domains need careful consideration.

  1. Classification problems: Cross entropy, hinge loss function (SVM)
  2. Regression: Mean square error (MSE)
  3. Object detection or segmentation: Intersection and Parallel ratio (IoU)
  4. Strategy optimization: KL divergence
  5. Word embedding: Noise comparison Estimation (NCE)
  6. Word vectors: cosine similarity

Metrics: Good metrics help to better compare and adjust models.

Regularization: Both L1 regularization and L2 regularization are common, but L2 regularization is more popular in deep learning. What are the advantages of L1 regularization? L1 regularization produces more sparse parameters, which helps unravel the underlying representation. Since each non-zero parameter adds a penalty to the cost, L1 prefers zero parameters to L2 regularization, that is, it prefers zero parameters to many tiny parameters in L2 regularization. L1 regularization makes the filter cleaner and easier to interpret, making it a good choice for feature selection. L1 is also less vulnerable to outliers and works better if the data is less clean. However, L2 regularization is still preferred because the solution may be more stable.

Gradient descent: Always closely monitor whether the gradient disappears or explodes. Gradient descent problems have many possible causes which are difficult to prove. Do not jump to learning rate adjustments or change the model design too quickly. Small gradiencies may simply be caused by programming bugs, such as input data not being scaled correctly or weights being all initialized to zero. If other possible causes are eliminated, then gradient truncation is applied at gradient explosions (especially for NLP). Skipping connections is a common technique for alleviating gradient descent problems. In ResNet, the residual module allows input to bypass the current layer to reach the next layer, effectively increasing the depth of the network.

Scale: Scale the input feature. We usually scale the feature to mean with zero over a specific range, such as [-1, 1]. Improper scaling of features is one of the most common causes of gradient explosion or descent. Sometimes we calculate the mean and variance from the training data to bring the data closer to a normal distribution. If you scale validation or test data, use the mean and variance of the training data again.

Dropout: You can apply Dropout to layers to normalize the model. After the rise of batch normalization in 2015, the dropout heat decreased. Batch normalization uses mean and standard deviation to re-scale node output. This, like noise, forces the layer to learn more robustly about the variables in the input. Since batch normalization also helps solve the gradient descent problem, it gradually replaces Dropout. The benefits of combining Dropout and L2 regularization are domain specific. Typically, we can test dropout during tuning and collect empirical data to prove the benefit.

Activation function: ReLU is the most commonly used nonlinear activation function in DL. If the learning rate is too high, many nodes may have an activation value of zero. If changing the learning rate doesn’t help, we can try Leaky ReLU or PReLU. In Leaky ReLU, when x < 0, it does not output 0, but instead has a small predefined downward slope (such as 0.01 or set by the hyperparameter). The parameter ReLU (PReLU) is pushed forward a step. Each node will have a trainable slope.

Split the data set: To test actual performance, we split the data into three parts: 70% for training, 20% for validation, and 10% for testing. Ensure that the samples are adequately disrupted in each data set and each batch of training samples. During training, we use the training data set to build models with different hyperparameters. We run these models using validated data sets and select the models with the highest accuracy. If your test results are significantly different from your verification results, scramble the data more or collect more data.

Custom layers: Built-in layers in deep learning packages have been better tested and optimized. However, if you want to customize the layer, you need:

  1. The forward propagating and back propagating codes are tested with non-random data.
  2. The results of back propagation were compared with naive gradient test.
  3. Add small quantities to the denominator or use logarithmic calculations to avoid NaN values.

Normalization: One of the challenges of deep learning is reproducibility. During debugging, it is difficult to debug if the initial model parameters are kept changing between sessions. Therefore, we explicitly initialize seeds for all random generators. We initialized seeds for Python, NumPy, and TensorFlow in the project. During fine-tuning, we turned off seed initialization to generate a different model for each run. To reproduce the results of the model, we checkpoint it and reload it later.

Optimizer: The Adam optimizer is one of the most popular in deep learning. It is suitable for many kinds of problems, including models with sparse or noise gradient. It is easy to fine-tune the characteristics of the fast to get good results. In fact, the default parameter configuration usually works just fine. The Adam optimizer combines the best of AdaGrad and RMSProp. Adam uses the same learning rate for each parameter and ADAPTS independently as learning goes on. Adam is an algorithm based on momentum, which makes use of the historical information of gradient. As a result, the gradient descent can run more smoothly and the problem of parameter oscillation due to large gradient and high learning rate is suppressed.

Adam optimizer adjustment

Adam has four configurable parameters:

  1. Learning rate (0.001 by default);
  2. β1: the exponential decay rate of the first moment estimate (default 0.9);
  3. β2: the exponential decay rate of the second moment estimate (default 0.999), which should be set close to 1 in the sparse gradient problem;
  4. (default: 1e^-8) is a small value to avoid dividing by zero.

β (momentum) smoothes the gradient descent by accumulating historical information about the gradient. Usually for the early stages, the default Settings already work fine. Otherwise, the most likely parameter to change would be the learning rate.


Part four: Visual deep network model and metrics

When it comes to troubleshooting deep neural networks, people tend to jump to conclusions too soon, too soon. Before we know how to fix the problem, we first think about what to look for and then spend hours tracking it down. In this section we will discuss how to visualize deep learning models and performance metrics.

TensorBoard:

It’s important to track every movement and check the results at every step. With the help of pre-set packages such as TensorBoard, visualizing models and performance metrics is simple, and the rewards are almost simultaneous.

Data visualization (input and output) :

Validate the input and output of the model. Some training and validation samples are saved for visual validation before data is fed to the model.

Index (loss & accuracy) :

In addition to recording losses and accuracy on a regular basis, we can also record and plot them to analyze their long-term trends. Below is the accuracy and cross entropy loss shown on the TensorBoard.

Mapping loss can help us adjust the learning rate. Any long-term rise in losses indicates that the learning rate is too high. If the rate of learning is low, the rate of learning is slower.

Here’s another real example of a learning rate that’s too high. We can see a sudden rise in the loss function (possibly caused by a sudden rise in the gradient).

We used accuracy graphs to adjust regularization factors. If there is a big gap between the accuracy of verification and training, the model will appear overfitting. To mitigate overfitting, we need to increase the regularization factor.


Part V: Debugging in deep learning networks

Deep learning problem solving steps

In early development, we had multiple problems at the same time. As mentioned earlier, deep learning training consists of millions of iterations. Finding bugs is very difficult and prone to crashes. Start simple and make changes gradually. Regularization such model optimizations can be done after code deGUg. Examine the model in a feature-first manner:

  1. Set the regularization factor to 0;
  2. No other regularization (including dropouts);
  3. Use the default Settings for the Adam optimizer;
  4. Use the ReLU;
  5. No data enhancement;
  6. Fewer deep network layers;
  7. Expand input data without unnecessary preprocessing;
  8. Don’t waste time on long training iterations or large batch sizes.

Overfitting a model with a small amount of training data is the best way to debug deep learning. If the loss value does not decrease over thousands of iterations, further debgug the code. Go beyond the notion of a shot in the dark, and you’ve reached your first milestone. Then make subsequent modifications to the model: add network layer and customization; Start training with full training data; Regularized control overfitting is increased by monitoring accuracy differences between training and verifying data sets.

Initialize the hyperparameter

Many hyperparameters are more relevant to model optimization. Turn off the hyperparameter or use the default values. Using the Adam optimizer, it is fast, efficient and has a good default learning rate. The early problems were mostly bugs, not model design and fine-tuning problems. Before you do any fine-tuning, go through the following checklist. These problems are more common and easy to check. If the loss value does not decline, adjust the learning rate. If the loss falls too slowly, the learning rate increases by 10. If the loss goes up or the gradient explodes, the learning rate is reduced by 10. Repeat this process until the loss value gradually decreases. The typical learning rate is between 1 and 1e-7.

Check the list

Data:

  1. Visualization and inspection of input data (after data preprocessing and before feeding to the model);
  2. Check the accuracy of input labels (after data disturbance);
  3. Don’t feed the same batch of data over and over;
  4. Scale the input data appropriately (typically between (-1, 1) and with a zero mean);
  5. Check the range of output (e.g., in the range (-1, 1));
  6. Always use the mean/variance of the training set to readjust the validation/test set;
  7. All input data of the model have the same dimension;
  8. Get the overall quality of the data set (whether there are too many outliers or bad samples).

Model:

  1. Model parameters are initialized accurately, and the weights are not all set to 0;
  2. Debug (from far right to far left) the active or gradient disappear/explosion network layer.
  3. Debug the network layer where the weight is mostly 0 or too much.
  4. Check and test the loss function;
  5. For the pre-training model, the range of input data should match the range used in the model.
  6. Dropout should always be turned off for reasoning and testing.

Weight initialization

Initializing all weights to zero is the most common mistake, and the deep network doesn’t learn anything. The weight is initialized according to the Gaussian distribution:

Scaling and normalization

Scaling and normalization are both well understood, but it is still one of the most underappreciated problems. It is easier to train the model if the input features and node outputs are normalized. If you do this inaccurately, the loss will not decrease with the learning rate. We should monitor the input characteristics and the output histogram of each layer node. Scale the input appropriately. For the output of nodes, the perfect shape is zero mean and the value is not too large (positive or negative). If not and there is a gradient problem in this layer, then batch normalization is performed on the convolution layer and layer normalization is performed on the RNN element.

Loss function

Check and test the accuracy of the loss function. The loss value of the model must be lower than a random guess.

Analysis of the error

Check and improve poor performance (errors), and visualize errors.

Regularize fine tuning

Turn off regularization (which overfits the model) until a reasonable prediction is made.

Once the model code is working, the next parameter to adjust is the regularization factor. We need to increase the volume of training data and then add regularization to narrow the difference between training and validation accuracy. Don’t overdo it, because we want to overfit the model a little bit. Closely monitor data and regularization costs. Regularization losses should not control data loss over long time scales. If you can’t close the gap with large regularization, deGUg regularizes the code or method.

Similar to the learning rate, we change the test values in a logarithmic scale, for example by 1/10 at the beginning. Note that each regularization factor may be a completely different order of magnitude, and we can adjust these parameters repeatedly.

Multiple loss function

In the first implementation, avoid using multiple data loss functions. The weight of each loss function may be a different order of magnitude, which requires some effort to adjust. If we only have one loss function, we can only care about the learning rate.

Fixed variable

When we use a pre-training model, we can fix the model parameters for specific layers, thus speeding up the calculation. Be sure to double check for fixed variable errors.


Part vi: Improving deep learning model performance and network tuning

Increasing model capacity

To increase model capacity, we can gradually add layers and nodes to the deep network (DN). Deeper layers produce more complex models. The process of parameter adjustment is more practical than theoretical. We gradually add layers and nodes that can overfit the model because we can regularize them and then turn them down. The iterative process is repeated until the improvement in accuracy is no longer worth training, and the degradation of performance is calculated.

Gradient disappearance is a serious problem for very deep networks. We can mitigate this problem by adding skip connections (similar to residual connections in ResNet).

Model & data set design changes

Here is a checklist for improving performance:

  1. Analyze errors (bad predictions) in the validation data set;
  2. Monitor the activation function. When the activation function is not zero-centered or not normally distributed, batch normalization or layer normalization is considered.
  3. Proportion of monitor invalid nodes;
  4. Using gradient truncation (especially in NLP tasks) to control the gradient explosion problem;
  5. Shuffle data set (manually or programmatically);
  6. Balanced data sets (with similar numbers of samples for each category).

We should closely monitor the activation histogram before activating the function. If they differ greatly in size, then gradient descent will be ineffective. Use normalization. If the deep network has a large number of invalid nodes, then we should track the problem further. It can be caused by bugs, weight initialization, or disappearing gradients. If neither, then try some advanced ReLU functions, such as Leaky ReLU.

Data set collection & cleaning

If you want to build your own data set, the best advice is to carefully study how you collect your samples. Find the best sources, filter out all data that is not relevant to your question, and analyze errors.

Data to enhance

Collecting tagged data is an expensive task. For images, we can use data enhancement methods such as rotation, random cropping, shift and so on to modify the existing data and generate more data. Color distortion includes hue, saturation, and exposure offset.

Supervised learning

We can also supplement the training data with unlabeled data. Use models to classify data. The samples with high confidence prediction were added to the training data set with corresponding label prediction.

Adjust the

Adjustment of learning rate

Let’s review briefly how to adjust the learning rate. In early development, we turned off any non-critical hyperparameter or set it to 0, including regularization. In the case of the Adam optimizer, the default learning rate usually performs very well. If we are confident in our code, but the losses are not decreasing, we need to adjust the learning rate. Typical learning rates are between 1 and 1e-7. Reduce the learning rate by 10% at a time and test in short iterations, closely monitoring the losses. If it keeps going up, then the learning rate is too high. If it does not fall, the learning rate is too low. Increase the rate of learning until the loss slows down earlier.

Hyperparameter adjustment

After the model design is stabilized, we can further adjust the model. The most frequently adjusted hyperparameters are:

  1. The mini – batch size;
  2. Vector;
  3. Regularization factor;
  4. Layer-specific hyperparameters (such as Dropout).

The Mini – batch size

Typical batch sizes are 8, 16, 32, or 64. If the batch size is too small, gradient descent will not be smooth, model learning will be slow, and losses may oscillate. If the batch size is too large, it will take too long to complete a training iteration (a round of updates), resulting in a smaller return. We closely monitor the overall learning speed and loss. If the loss oscillates sharply, we know that the batch size reduction is too large. Batch size influences superparameters such as regularization factor. Once we have determined the batch size, we usually lock the value.

Learning rate & regularization factor

We can use the above method to further adjust the learning rate and regularization factor. We monitor losses to control the learning rate and the gap between validation and training accuracy, thus adjusting regularization factors. Tuning is not a linear process. The hyperparameters are correlated, and we will adjust the hyperparameters repeatedly. The learning rate and regularization factor are highly correlated and sometimes need to be tuned together. Don’t start fine-tuning too early; you may lose time. Those efforts will be wasted if the design changes.

Dropout

Dropout rates are usually between 20% and 50%. Let's start with 20%. If the model overfits, the value is increased.

Other adjustments

Sparsity activation function

The sparsity of the model parameters simplifies computational optimization and reduces power consumption (which is critical for mobile devices). If desired, we can substitute L1 regularization for L2 regularization. ReLU is the most popular activation function. For some deep learning competitions, people use more advanced variants of ReLU to improve accuracy. In some scenarios it can also reduce the number of invalid nodes.

Advanced tuning parameter

Some advanced fine-tuning methods:

  1. Learning rate attenuation scheduling
  2. Momentum
  3. Early stop

Instead of using a fixed learning rate, we periodically reduce it. The hyperparameters include the frequency and magnitude of the learning rate decline. For example, you can reduce the learning rate by 0.95 per 100,000 iterations. To adjust these parameters, we need to monitor costs to make sure that the parameters are falling faster but not too quickly.

The advanced optimizer uses momentum to smooth the gradient descent process. There are two momentum Settings in the Adam optimizer, which control first-order (0.9 by default) and second-order (0.999 by default) momentum. For problem domains with steep gradient drops such as NLP, we can increase the momentum slightly.

When validation errors continue to rise, overfitting can be alleviated by stopping training.

However, this is just a visualization of the concept. The real-time error may rise temporarily and then fall again. We can periodically check the model and record the corresponding validation errors. We will select the model later.

The grid search

Some hyperparameters are highly correlated. We should adjust them together using a grid of possibilities on a logarithmic scale. Grid search is very computationally intensive. For smaller projects, they are used sporadically. We began to adjust coarser grained parameters with fewer iterations. In the later fine-tuning phase, we use longer iterations and set the value to 3 (or lower).

Model collection

In machine learning, we can make predictions by voting from the decision tree. This approach works well because misjudgments are often local in nature: the chances of two models making the same mistake are small. In deep learning, we can start training with a random guess (submit a random seed that is not explicitly set), and the optimization model is not unique. We can use the validation data set to test multiple times to select the best performing model, or we can have multiple models vote internally and finally output the prediction results. This approach requires multiple sessions, which must be costly to the system. We can also train once, examine multiple models, and then select the best performing model in the process. Through the set model, we can make accurate predictions based on these:

  1. The “vote” predicted by each model;
  2. Weighted voting is conducted based on the forecast confidence.

Model sets are very effective at improving the prediction accuracy of some problems and are often used by teams in deep learning data competitions.

Model of ascension

In addition to fine-tuning the model, we can also try different variations of the model to improve performance.

7- Back propagation

  • To θ \theta θ (weight and bias)
  • First select an initial θ0\theta^0θ0, calculate the Loss Function of θ0\theta^0θ0, set a partial derivative of a parameter
  • After calculating the partial of this vector, you can update your theta \theta theta
  • And millions of parameters
  • Backpropagation is a more efficient algorithm, which allows you to calculate the Gradient Vector efficiently.

The chain rule

  • Knock-on effects (you can see that x affects Y, y affects Z)
  • BP mainly uses the chain rule

Back propagation

  1. Loss function is defined on a single training sample, that is, even if the error of a sample, for example, if we want to classify, it is the difference between the predicted category and the actual category, which is represented by L of a sample.
  2. Cost function is defined on the whole training set, that is, the average of the sum of the errors of all samples, that is, the average of the sum of the loss function. In fact, whether there is this average will not affect the final parameter solution result
  3. Total Loss function is defined on the whole training set, that is, the sum of the errors of all samples. That’s what we normally minimize when we propagate back.

Our goal is to ask to compute ∂z∂w\frac{\partial z}{\partial w}∂w∂z (part of the Forward pass) and compute ∂l∂z\frac{\partial L}{\partial z}∂z∂ L (Backward) Partial z∂w\frac{\partial z}{\partial w}∂w∂z \frac{\partial Z}{\partial w}∂w∂z \frac{\partial L}{\partial z}∂z∂ L \frac{\partial L}{\partial z}∂z∂ L We can get ∂l∂w\frac{\partial L}{\partial w}∂w∂ L, so we can get all the parameters in the neural network, and then we can update it with gradient descent, and get the function with the least loss