Take you to read paper series on computer vision –Inception V2 /BN-Inception

Our whole life is about escaping what others expect and finding who we really are. — Silent Confessions

An overview of the

Thesis: Batch Exploratory: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Review GoogLeNet

The improvement of the Inception-v2 structure is to modify the 5✖️5 convolutional layers in the original Inception-v1 structure and replace them with two 3✖️3 convolutional layers.

Batch Normalization is an optimization technique of deep learning proposed by Google in 2015.

It can not only accelerate the convergence speed of the model, but also alleviate the problem of “gradient dispersion” in the deep network to some extent, so that the training of the deep network model is easier and more stable.

During the training of neural network, the network parameters of each layer will be updated, that is to say, the input distribution of the next layer is changing, which requires that the initial weight of the network cannot be set arbitrarily, and the learning rate is relatively low. Therefore, it is difficult to use the saturated nonlinear part to conduct network training, which the authors call the internal covariate shift.

Batch Normalization is proposed to address changes in data distribution in the middle layer during training.

Because most neural network training uses Min-Batch SGD, batch Normalization also focuses on one min-batch. This is the name of batch Normalization in Batch Normalization. In the training of neural network, if the input data is whitened (mean = 0, variance = 1, de-correlation), the convergence speed can be accelerated.

But the whitening treatment is too computationally expensive and not differentiable everywhere, so the author makes two simplifications. First, each dimension is calculated separately; second, mini-batch is used to estimate the estimated mean and variance.

Abstract

  1. Changes in data distribution lead to training difficulties;
  2. Low learning rate, solvable initialization, but slow training;
  3. The phenomenon of data distribution variation is called ICS;
  4. The Batch Normalization layer is proposed to solve the problem.
  5. The advantage of Batch Normalization is that large learning rate does not care about initialization and regular items do not need Dropout.
  6. Results: Best published results for ImageNet classification: 4.9% top-5 validation errors (and 4.8% test errors), exceeding the accuracy of human evaluators.

Bn-inception Network — Key points

  • There is Batch Normalization. At present, BN has become the standard collocation technique of almost all convolutional neural networks.
  • 5 by 5 convolution kernels go to 2 3 by 3 convolution kernels. Same receptive field.

Benefits of BN:

  • BN reduces the internal covariance, improves the gradient flow in the network, and speeds up the training of the network.
  • BN makes it possible to set higher learning rates.
  • BN regularizes the model.

The paper details

Deep learning has greatly improved technology in vision, speech, and many other areas. Random gradient descent (SGD) has proven to be an effective way to train deep networks, and SGD variants such as Momentum and Adagrad have been used to achieve state-of-the-art performance. SGD optimizes the parameter θ of the network to minimize losses.

Using SGD, the training was carried out in stages, and mini-batch was considered in each step. Mini-batch is used to approximate the gradient of the loss function relative to the parameter.

Using mini-Batch has two advantages: stability and improved accuracy; Efficient, speed up.

SGD parameters to be adjusted: Learning Rate Weight initialization

Disadvantages: The input of each layer is affected by all the previous layers, and small changes in the previous layer can be magnified.

The Internal Covariate Shift(ICS) is defined as the change of network activation distribution caused by the change of network parameters during training. To improve training, we seek to reduce internal covariate offset. By fixing the distribution of layer inputs to the training schedule, we expect to increase the training speed. It is well known that network training converges faster if the input is whitened — that is, the linear transformation has zero mean and unit variance, and is de-correlated. Since each layer looks at the inputs produced by the layers below, it would be advantageous to achieve the same whitening for the inputs of each layer. By whitening the input for each layer, we will take a step towards achieving a fixed distribution of the input, thereby eliminating the ill effects of internal covariate shifts.

Known experience: when the input data is whitened, that is, 0 mean and 1 variance, the training convergence speed will be fast.

Whitened:

  1. Direct modification network
  2. Based on the activation value, optimize the parameters so that the output is whitened.

We can consider whitening activation at each training step or at some time interval, either by modifying the network directly or by changing the parameters of the optimization algorithm to depend on the network activation value. However, if these modifications are interspersed with the optimization step, then the gradient descent step may try to update the parameters in a way that requires normalization, which reduces the impact of the gradient step.

Where the expectation and variance are calculated on the training data set. Even if the characteristics are not related, this normalization will accelerate the harvest. The features of each layer are whitened by the above formula and are dimensional-by-dimension. Simple normalized eigenvalues change the expressiveness of the network, such as sigmoID, to the linear region.

Where Gamma and beta are learnable to allow the model to automatically control whether the original representation needs to be retained. The consensus above is to preserve the representational ability of the network.

In batch setup, each training step is based on the entire training set, which we will use to manage activation. However, this is not realistic when using random optimization. Therefore, we made a second simplification: since we used mini-batch in random gradient training, each mini-batch produces an estimate of the mean and variance of each activation. In this way, the statistics used for normalization can be fully involved in gradient back propagation. Note that the use of mini-batch is achieved by calculating the variables for each dimension rather than the joint covariance; In the union case, since the size of the mini-batch may be smaller than the number of activations whitened, it needs to be adjusted, resulting in a singular covariance matrix.

BN is specified in algorithm 1. In this algorithm, ϵ is a constant added to the Mini-batch variance for numerical stability.

It is important to note, however, that the BN transform does not independently handle the activation in each training example. Instead, BNγ,β(x) depends on the training example and other examples in the mini-batch.

In traditional deep networks, high learning rates may result in gradient explosion or disappearance, as well as falling into poor local minima. Batch standardization helps solve these problems. By standardizing the activation of the entire network, it can prevent small changes in parameters from amplifying into larger and suboptimal changes in gradient activation. For example, it prevents training from sinking into nonlinear saturation.

When training with batch normalization, you can see that one training example is combined with other examples in a small batch, and the training network no longer generates deterministic values for a given training example. In our experiments, we find that this effect favors network generalization. Dropout is often used to reduce overfitting, and in batch normalization networks we find that it can be removed or reduced in intensity.

The experiment

A) Test accuracy and number of training steps of MNIST networks with and without batch normalized training. Batch Normalization helps the network train faster and achieve higher accuracy.

(b,c) during the training, the input distribution evolves to a typical sigmoid, which is displayed as a percentile {15,50,85}. Batch normalization makes the distribution more stable and reduces the transfer of internal covariates.

A new variant of Inception network will be applied in batch normalization (2014) and trained on ImageNet classification tasks (2014). The network has a large number of convolution layers and pooling layers, and a Softmax layer to predict image classes from 1000 possibilities. The convolution layer uses ReLU functions as nonlinearities. The main difference from the network described in the new variant of Inception network is that the 5×5 convolutional layer is replaced by two consecutive 3×3 convolutional layers with up to 128 filters. The network contains 13.6·106 parameters and there is no full connection layer except for the topmost Softmax layer.

Simply adding batch normalization to the network does not take full advantage of our approach. To this end, we further changed the network and its training parameters as follows:

  1. Increase the learning rate;
  2. Remove the Dropout;
  3. Reducing weight decay by 5 times, reducing the size of the limiting weight because BN allows greater weight;
  4. The earlier decline in learning declined six times in total;

This graph shows why it’s 6.

  1. Remove LRN v1 from inception using LRN before entering Inception
  2. Thoroughly shuffle acts as a regular.
  3. Reduce light variation because of fewer training sessions and expect the model to see the real sample.

Initial comparison of batch standardization with previous state-of-the-art techniques is performed on a provided validation set containing 50000 images. * According to the test server, BN initial integration has reached 4.82% top-5 on the ImageNet test set of 100,000 images.

Bn-inception Ensemble, the results of integrated learning using multiple network models.

In order to enable the random optimization method commonly used in deep network training, normalization is performed for each small batch and the gradient is propagated back by normalization parameters. Batch normalization adds only two additional parameters per activation, and in doing so preserves the representation capability of the network. This paper presents an algorithm for building, training and performing reasoning using batch normalized networks. The resulting network can be trained with saturation nonlinearity, has greater tolerance for increased training rates, and generally does not require regularization for Dropout.

Adding only 2 parameters preserves the ability to represent.

Advantages:

  1. Trainable saturation activation function;
  2. Available large learning rate;
  3. Dropout is not required.

Inception+ BN + optimization of single-model SOTA multi-model SOTA

The goal of batch normalization is to achieve a stable distribution of activation values throughout the training process, and in our experiments we apply it before nonlinearity, when matching the first and second moments is more likely to produce a stable distribution. Instead, the standardization layer is applied to the non-linear output, which results in a more sparse activation. In our large-scale image classification experiments, we did not observe that the nonlinear inputs were sparse, with or without batch normalization.

Next step

  1. BN+RNN, RNN has more severe ICS;
  2. BN+domain adaptation only recalculates mean+ STD.