New Year’s Day: The official account of dove has finally reopened after a long time. This time, the theme of the official account will be about deep learning, referring to the development experience of Professor Andrej and some big names. Firstly, the overall architecture of neural network will be discussed theoretically and technically. This section will describe the development system of neural network model construction for learning.

1. Read the guide

There are a lot of loopholes when training neural network, it is not a simple switch in our cognition, can be used. In many cases, although the wrong network model is constructed (the training image forgets to detect the inverted image, the autoregressive model takes the data it predicts as input, or the configuration of weights, regularization, etc.); Most of the time it’s still training and we can’t detect what’s wrong with it. So the most important thing to successfully develop a neural network is a complete system, patience, and attention to detail.

your_data = # Import your dataset
model = SuperCrossValidator(SuoerDuper.fit, your_data, ResNet50, SGDOptimizer) # Set up your network
Copy the code

We think it’s easy to start training neural networks. Because many libraries and frameworks allow us to solve our data problems in as little as 20 or 30 lines of code. This creates a false impression that a lot of things are plug and play. In fact, neural networks are not, and when we deviate from training ImageNet classifier, it is not an off-the-shelf technology. If you don’t understand how the technology works, there will be a lot of unexpected failures…

2. Silent failure of neural network training

When we misconfigure code, we usually get some exceptions. **You plugged in an integer where something expected a string. The function only expected 3 arguments. This import Failed. That key does not exist. The number of elements in The two lists isn’t equal. This is just the beginning of training the neural network. Some code may be syntactically correct, but it may not be correct in the whole network, and these problems are hard to find. For example, back propagation is a leaky abstraction, and trying to ignore how it works will fail to deal with its problems, and the neural network model built and debugged will be much less effective.

Such as:

  • When the gradient on the Sigmoid disappears, the nonlinearity may saturate completely and stop learning resulting in training losses that are flat and reject downward. Perhaps because your weight initialization is too large, the output of the matrix multiplication has a wide range, where z*(1-z) is the local gradient of the Sigmoid nonlinear, thus making the gradients of x and W both 0.

  • ReLU: Non-linear ReLU sets the neuron threshold to 0. The forward and backward delivery cores of the full connection layer using ReLU include:

    Maximum (0, np.dot(W, x)) # dW = Np.outer (z > 0, x) # dW = np.dot(W, x) #Copy the code

    If one of its neurons is observed to be set to 0 in the forward pass (i.e. Z = 0, it will not fire), then its weight will be zero gradient. This is known as the Dying ReLUs problem, where if a ReLU neuron is unfortunately initialized it will never set off, or the weight of a neuron will be eliminated by a large update during training to the mechanism, and the neuron will “die permanently”. It’s like a permanent, irreversible brain injury. These neurons never open for any instance in our entire training set and will remain dead forever.

  • Gradient explosion in RNN: Refer to CS231n for an example shown below:

This RNN expands T time cloth. When we look at the effect of back propagation, we see that the total time of the gradient signal through all the hidden states and back propagation is multiplied by the same matrix (recursive Whh) and interspersed with nonlinear back propagation. When we take a genus A and start multiplying it by another number B (i.e. a * B * B * B * B…) . If | | b * * * * < 1, so the sequence into either 0, or | | b * * * * > 1 blast to infinity. The same thing happens with the back-propagation of RNN (except that B is a matrix rather than a number).

Everything that might be syntactically correct, the constructed neural network is very bad. This is a very annoying problem, perhaps because you forgot to flip the label while flipping the image left and right in the data enhancement section. Our network still works fine, because our network learns internally to detect flipped images, and then it flips the predicted values left and right. Or in an autoregressive model, the predicted thing is the input. Or, if we try to crop gradients, we crop losses and ignore outliers during training and so on, we’re lucky if the model we’re building is wrong, because it trains most of the time, just badly…

3. Develop recipes

In view of the above problems, if we want to apply neural networks to a new problem, we should build a process system. Pay attention to its rules, build from simple to complex, make specific assumptions about what’s going to happen out there, test them experimentally or visualize them until we find some problems. If we were to test unproven models, it would take a long time to find the problem. Start describing the entire development process.

3.1 Data desensitization

The first step in training a neural network does not involve touching any neural network code, but rather starts with a thorough examination of the data. This step is very critical. Often, a certain step in data processing will affect the experimental results to a certain extent. Examine data repeatability issues, corrupted image labels, data imbalances, and consider how to define the classification process. We need to know whether the local or global characteristics of the sample can be preprocessed and averaged. Image noise problem. Once we have a good grasp of the data, we can search/filter/sort the data we need (label types) and visualize their distribution by observing the outliers on each axis, which may affect the quality of the data or some errors in the preprocessing.

3.2 Build a complete training-evaluation framework

Once the data is processed, the next stage is to build the complete train-evaluation framework and verify its reliability through a series of experiments. We can start with some simple models, or very small networks (models that are not prone to error) for training, visualizing loss, accuracy, model prediction, and conducting ablation experiments using explicit assumptions in the process.

Tips and tricks for this stage:

  • Fix Random seed: Using a fixed random seed ensures that when we run the code twice, we get the same result.
  • Simplify: Data enhancement is not required at this stage. It is a regularization strategy that does not need to occur at this time
  • Do not care about training time when plotting assessment curves: When plotting test losses, estimate test losses across the entire dataset rather than plot test losses in batches
  • For example, if you correctly initialize your last layer, you should -log(1/n_classes) the softmax at initialization time. The same defaults can be exported for L2 regression, Huber loss, and so on.
  • Init well: Properly initializing the final layer weights, setting these correctly will speed up convergence and eliminate the ** “hockey stick” ** loss curve, and in the first few iterations, build networks that are basically just learning biases.
  • Human Baseline: Compare with self-assessment accuracy as far as possible, in addition to interpretable and verifiable indicators such as monitoring loss value. Or you can annotate the test data twice, with one annotation as a prediction and the second as a true value for each example.
  • Input-indepent baseline: The baseline is irrelevant to the input. See if our model learns to extract information from the input.
  • Overfit one Batch: Overfit several batches to increase the capacity of the model and verify the minimum achievable loss that we can achieve.
  • Decreasing training loss. If there is an insufficient fit on the dataset, increase the size appropriately, or use a number of methods to deal with it.
  • Visualize data and label decoding before Y_hat = Model (x)(or sess. Run before TF)
  • Visualize Prediction Dynamics: Visualize model predictions for fixed test batches during training. These predictions will dynamically provide us with the entire training process, and if the network wobbles in a certain way, it may be felt that the network does not fit the data set, thus showing instability. Too high or too low learning rate is also prone to jitter problems.
  • Use backprop to chart dependencies: Using back propagation to draw dependencies, avoid the vector quantization, broadcast operations such as the calculation of the error (the type error is hard to find, the network will still be normal training), the debugging method is to set the loss is very small, such as instance I sum all output, input into the back propagation calculation, to ensure that the input I get a non-zero gradient. The same strategy can be used to ensure that your autoregressive model only depends on 1 at time t… T – 1. More broadly, gradients give you information about what depends on what in the network.
  • Generalize a special case: To write a model in functional form, write a complete training version first and then add vectorization instructions such as loops to convert it into complete model code
3.3 the fitting

At this stage, we have fully mastered the entire data set and have a complete training + evaluation model that we can repeat for any given model. On the basis of fully understanding the problem, the results can be compared with the predicted results. At this point we are ready to update the next model. Finding a good model usually involves two stages: first get a model large enough to overfit (focus on training losses), and then properly regularize it (forgo training losses to improve validation losses).

Tips and tricks for this stage:

  • Picking the model: select the appropriate model is a model is the key point is the easier it is often better, try to avoid in the building model of some wonderful structures, train of thought, the best way is to refer to the most relevant papers, and copy and paste them to obtain the good performance of the most simple structure, and on the basis of the training. Standing on the shoulders of giants is even so.
  • Adam is safe: 3E-4 Adam is recommended when setting hyperparameters. Because Adam is generally more tolerant of overparameters, adjusted SGDS will always be slightly better for ConvNets than Adam, but the best range of learning rates is very small and mostly problem specific. It is usually wise to use Adam in the initial phase (RNN and associated sequence models).
  • Complexify only one at a time: If you want to optimize your model, it is recommended to try one optimization at a time to ensure that each optimization results in the desired performance improvement.
  • Do not trust learning rate decay defaults: Disable learning rate decay and manually debug to avoid learning rate decay to 0 prematurely.
3.4 regularization

At this stage, some validation accuracy needs to be obtained by giving up the accuracy of some training sets for regularization.

Tips and tricks for this stage:

  • Get more data: The best and preferred way to regularize a model in any environment is to add more realistic training sets. It is unwise to spend a lot of time “juicing” from small data sets. Adding more data is the only sure way to improve the performance of neural network models indefinitely.
  • Data Augment: Data augmentation is the second best way to regularize.
  • Deformation of data is also a way to extend the creativity of data sets: domain randomization, for example, uses simulations to insert data into scenarios. CV image flipping and a series of means.
  • Pretrain: Pre-train the network.
  • Stick with supervised learning: not unsupervised learning (at least for now).
  • Smaller input Dimensionality: If the data set is small, any added false input will cause overfitting problems.
  • Smaller model size: Constraints model size, eliminate a lot of parameters.
  • Decrease the batch size: Decrease the batch size, a smaller batch somehow corresponds to a stronger regularization. Because the batch empirical mean/sd is a more approximate version of the full mean/SD.
  • Drop: Add dropout (not applicable to batch standardization).
  • Weight decay: Increased weight decay penalty.
  • Training based on the validation loss of measurements to get the best ratio of parameters when the model is about to be over-fitted.
  • 2. Try a larger model: Often you will get better training results than a smaller one.

Finally, to ensure that our model is valid, we can visualize the first layer weights of the network to ensure that the model gets meaningful good edges. If the first layer looks like noise, there may be a problem. Similarly, noise problems with hidden layers can be problematic.

3.5 adjustable parameter

This step is in the network training cycle to achieve low validation loss for our model.

Tips and tricks for this stage:

  • Random over Grid Search: To adjust multiple super and at the same time, using a grid search can make sure we cover all parameters, it is best to use random search, because neural network is usually more sensitive to certain parameters, if the parameter is important but more changes b no impact, at this time we prefer to a more thorough sampling, instead of on a few fixed point sampling many times.
  • Hyper-parameter Optimization: Use some Bayesian hyperparameter optimization toolbox
3.6 Final Result

When we find the best parameters and the best model architecture, there are still ways to improve the accuracy

Tips and tricks for this stage:

  • Ensemble: Model integration improves accuracy
  • Leave it Training: When the accuracy of the network remains stable, you can try to use long time hold training.

4. Conclusion

Referring to the experience of Professor Andrej and some big guys, all the successful elements of constructing neural network are summarized. I believe this will be of great help to us in further exploring complex models, model improvement and paper reoccurrence. Master the whole theoretical system, help us to further development, in the “building blocks” on the road to go further and further.

Recommended reading

  • Differential operator method
  • PyTorch is used to build a neural network model for handwriting recognition
  • PyTorch was used to build neural network models and back-propagation calculations
  • How to optimize model parameters and integrate models
  • TORCHVISION Target detection fine tuning tutorial
  • Principal component analysis (PCA) method steps and code details
  • Neural network coding categorical_embdedder