Dropout

Study Notes on Dropout. Main references:

  1. 12 Major Dropout Methods: How can they be applied to mathematical and visual interpretation in DNNs, CNNs, AND RNNs
  2. Random inactivation methods in neural networks

1. Introduction

One of the major challenges of deep learning training models is collaborative adaptation, which means that neurons are interdependent, that is, they are not independent from the inputs, and some neurons often have more important predictive abilities, which may lead to over-dependence of the model on the output of individual neurons.

But this situation should be avoided, the weight must have a certain distribution, so as to prevent overfitting. Overfitting can usually be avoided by regularization, which can regulate the cooperative adaptation and high predictive ability of some neurons. One of the most commonly used regularization methods is Dropout.

Dropout, random inactivation, is a simple and machine-efficient regularization method that can complement L1 regularization, L2 regularization, and maximum norm constraints.

Different Dropout methods will be introduced, and there will be differences between deep network structures, such as CNN and RNN:

  • The standard Dropout method
  • A variant of the standard Dropout
  • The dropout method used on CNNs: The standard dropout effect on the convolutional layer is not very good, because each point of the feature graph corresponds to a receptive field range, but the random discarding of a pixel cannot reduce the feature learning range of the feature graph. The network can also learn the corresponding semantic information by inactivation of adjacent pixels.
  • Dropout method used at RNNs
  • Other Dropout applications (Monte Carlo and Compression)

2. Standard Dropout method

The most common method of dropout is Standard Dropout, introduced by Hinton et al., in 2012. Often referred to simply as “Dropout,” we’ll refer to it as standard Dropout in this article for obvious reasons.

The standard Dropout method is mainly used in the training phase to avoid over-fitting problems, and a probability P is usually set to represent the probability that each neuron will be removed in each iteration, as shown in the figure above. P =0.5. Hinton’s paper suggests that p=0.2 for the input layer and P =0.5 for the hidden layer, and the Dropout layer does not need to be used for the output layer. After all, what is needed is the results of the output layer.

Mathematically, as shown above, the discarding probability of each neuron follows Bernoulli distribution with probability P, so a mask is used to perform an element-level operation on the neuron vector, where each element is a random variable following Bernoulli distribution.

During training, Dropout is randomly deactivated and can be thought of as training on some subset of the complete neural network, updating only the parameters of the subnetwork each time based on input data.

In the test phase, we don’t use Dropout. All neurons are active. To compensate for the additional information compared to the training phase, the weighted weights are balanced with the probability of occurrence. So the neuron has no probability of being ignored, it’s 1-P.

In the paper, the authors demonstrate experimentally that the reasons for Dropout effectiveness are twofold:

  1. Dropout can destroy the cooperative adaptability between neurons, making the features extracted by Dropout neural network more clear and increasing the generalization ability of the model.
  2. In terms of the relationship between neurons, Dropout can randomly make some neurons temporarily not participate in the calculation. In this case, the dependence between neurons can be reduced, and the update of weights will no longer depend on the joint action of implicit nodes in the inherent relationship, which forces the network to learn more robust features.

The corresponding Dropout implementation in PyTorch is as follows:

>>> m = nn.Dropout(p=0.2)
>>> input = torch.randn(20.16)
>>> output = m(input)
Copy the code

Torch. Nn. Dropout (p = 0.5, inplace = False)

  • P — probability of an element to be Zeroed. Default: 0.5
  • inplace– If set toTrue, will do this operation in-place. Default: False

There is no requirement for input, that is, Linear does, and so does the convolution layer.


3. The Dropout variants

3.1 DropConnect

DropConnect, introduced by L. Wan et al., does not apply dropout directly to neurons, but to the weights and biases that connect those neurons.

The main difference between Dropout and DropConnect is that the masks used are weights and biases, not the neuron itself. Dropout can be used at both the convolutional layer and the fully connected layer. DropConnect can only be used at the fully connected layer.

The same logic of Dropout can be used in the test phase, multiplying by the probability of occurrence, but in the DropConnect paper an interesting random approach is taken, using the Gaussian approximation of DropConnect. And then you take samples at random from this Gaussian representation.

3.2 Standout

Standout, introduced by L. J. Ba and B. Frey, is a standard Dropout method based on a Bernoulli mask(which makes it easier to name the masks according to the distribution they follow). The difference is that the probability of neuron missing p is not constant in this layer. It is adaptive according to the value of the weight.

You can use any g activation function, or you can use a separate neural network. Similarly, for Ws, it could be a function of W, which is then balanced in the test phase based on the probabilities that exist.

Here’s an example:

As shown in the figure below, the greater the weight, the greater the probability of neuron being discarded, which can limit the high predictive ability of some neurons to some extent.

3.3 Gaussian Dropout

Fast Dropout, variational Dropout, or Concrete Dropout, to name a few, are ways of explaining Dropout from a Bayesian perspective. Specifically, instead of using Bernoulli masks, we use a mask whose elements are random variables **(** normally distributed) that follow a Gaussian distribution.

As shown in the figure above, the use of Gaussian distribution for dropout actually differs from Bernoulli distribution in that it does not change the correlation between the co-adaptation and prediction abilities of neurons and overfitting. What changes are mainly the execution time required in the training phase.

Logically speaking, if some neurons are discarded randomly in the forward phase of training, they will not be updated during backpropagation, so they will not exist. Therefore, in the training phase, they will “slow down”. By using Gaussian Dropout method, all neurons will adopt it in each iteration, so they will not slow down.

The mathematical representation is shown above, with a multiplication of Gauss masks (e.g. the standard deviation p(1-p) of Bernoulli’s law centered on 1). Dropout is simulated by randomly weighting their predictive power by keeping all neurons active during each iteration. Another practical advantage of this approach centers on the test phase, where no modifications are required compared to a model without dropout.

3.4 Pooling Dropout

The problem with images and feature maps is that pixels are very dependent on their neighbors. Simply put, on a picture of a cat, if you take a pixel that corresponds to its appearance, all neighboring pixels will correspond to the same appearance. There is little difference between the two.

This is the limit of the standard Dropout method. In fact, if you randomly drop pixels on an image, almost no information is deleted, and the discarded pixels are almost the same as their neighbors, which means that overfitting performance is not good, but extra computing time is added.

Max-pooling Dropout is a Dropout method for CNNs proposed by H. Wu and X. Gu. It applies Bernoulli Mask directly to the kernel of the largest pooling layer before performing the pooling operation. Intuitively, this allows minimization of pooling results with high activation. It’s a good idea to limit the high predictive power of certain neurons. During the testing phase, you can weight the previous methods based on the probability of occurrence.

3.5 Spatial Dropout

Pooling layers can be utilized in CNN, but the Spatial Dropout method proposed by J. Tompson et al., can also be used. Since adjacent pixels are highly correlated, the idea is to use the classical dropout method to solve this problem. Instead of performing the dropout operation on a single pixel, we can consider performing the dropout operation on a feature graph, as shown below. For example, once again taking the cat as an example, we can remove the red channel from the image. And force it to summarize the blue and green channels in the image, and then randomly prevent other feature graphs in the next iteration.

In general, ordinary Dropout is rarely used to deal with the convolutional layer, and the effect is often not very good. The reason may be that the activation of the convolutional layer is spatially correlated, and information can still be transmitted through the convolutional network after Dropout. While the Spatial Dropout can reduce the interdependence between channels by randomly selecting channels in the feature map for Dropout.

In the training stage, Bernoulli Mask was applied to each feature map, and its discard probability was P. Then in the test phase, there’s no dropout, just a weight with a probability of 1-P.

The corresponding Spatial Dropout implementation in PyTorch is as follows:

Torch. Nn. Dropout2d (inplace = False *, p = 0.5 * * *)

  • p (python:float*, *optionalThe probability of an element to be zero-ed.
  • inplace (bool*, *optional(If we set toTrue, will do this operation in-place

There are certain requirements for input and output:

  • input shape: (N, C, H, W)
  • output shape: (N, C, H, W)
>>> m = nn.Dropout2d(p=0.2)
>>> input = torch.randn(20.16.32.32)
>>> output = m(input)
Copy the code

In addition, there is also a corresponding torch. Nn.Dropout3d function for 3D feature map.

3.6 Cutout

The Cutout method proposed by T. DeVries and G. W. Taylor chose another approach, applying Bernoulli masks in different areas, as shown below:

Again, take the previous cat image as an example: this method can limit overfitting by generalizing the hidden areas of the image. The last thing we see is the cat’s head being thrown away. This forced CNN to learn about the less obvious attributes that can describe cats.

The experimental results in this paper show that Cutout can improve the robustness and overall performance of neural networks, and this method can also be used with other regularization methods. However, there is a strong correlation between how to select the appropriate Patch and the data set. If Cutout is used for experiment, some experiments need to be conducted on the Patch Length.

3.7 Max – Drop

The max-drop method proposed by S. Park and N. Kwak is a mixture of Pooling Dropout and Gaussian Dropout. Dropout is performed at the maximum Pooling layer, but the Bayesian method is used.

In their paper, they show that this method gives results as effective as Spatial Dropout. In addition, all neurons remained active during each iteration, limiting deceleration during the training phase. These results were obtained using µ = 0.02 and σ² = 0.05.

3.8 RNNDrop

The above methods are mainly applied to DNNs and CNNs, but RNN is also a commonly used depth model structure. Therefore, some scholars have studied how to use Dropout on RNN, but it is dangerous to apply Dropout on RNN. The reason is that the purpose of RNN is to preserve memory of events for a long time. But traditional dropout methods are inefficient because they produce noise that prevents these models from retaining memories for long periods of time. Here are some ways to preserve your memories for the long term.

The RNNDrop proposed by T. Moon et al is the simplest method. A Bernoulli mask applies only to hidden cell states. But the mask remains the same from sequence to sequence. This is called sequential sampling of dropout. It simply means that in each iteration we create a random mask. And then from sequence to sequence, the mask stays the same. So the elements that are discarded are always discarded and the elements that remain remain. This is true of all sequences.

3.9 cycle Droput

Loop Dropout, proposed by S. Semeniuta et al., is an interesting variant. The cell state remains the same. Dropout only applies to parts that update cell state. So in each iteration, Bernoulli’s mask makes some elements no longer contribute to long-term memory. But the memories didn’t change.

3.10 Variational RNN dropout

RNN Dropout, introduced by Y. Gal and Z. Ghahramani, is a sequence-based application of Dropout prior to Internal Gates. This causes the LSTM to dropout at different points, which is simple and effective.

3.11 Monte Carlo Dropout

Dropout methods can also provide an indicator of model uncertainty. This is because a dropout model will have a different architecture in each iteration for the same inputs. This leads to variance in the output.

If the network is fairly generalized and the co-adaptation is limited, then the prediction is distributed throughout the model. This results in a lower variance of the output when the same inputs are used in each iteration.

Studying this variance gives an idea of the confidence that can be assigned to the model. This can be seen in the method of Y. Gal and Z. Ghahramani.

3.12 Model compression

K. Neklyudov et al. proposed a method of pruning DNNs and CNNs using variational dropout, that is, compression of the model. Intuitively, this is because by applying Dropouts randomly, it is possible to see whether a given neuron is effective for prediction. Based on this observation, model compression can be achieved by reducing the number of parameters, i.e. neurons that are ineffective for prediction can be deleted.

3.13 Stochastic the Depth

This was proposed before DenseNet and is used in ResNet to randomly inactivate a portion of Res blocks. The operation is similar to Dropout, i.e. a portion of Res blocks are randomly lost during training, but all resblocks are used in the test phase. This method can use shallow depth during training (random loss of ResBlock, equivalent to random skipping of some layers) and deep depth during testing, which can reduce training time and improve training performance.

Can refer to the article in detail: the convolutional neural network learning route (11) | Stochastic the Depth (the Depth of the random network)

3.14 DropBlock

This method is similar to Cutout, and its idea is to set inactivation randomly according to spatial blocks on each feature map.

DropBlock takes three important parameters:

  • Block size Controls the block size
  • gammaControls how many channels to DropBlock
  • P, in the Keep ProB category Dropout, is inactivated with a certain probability

Through experiments, it can be proved that the best block size control size is 7×7, and it is better for keep PROB to gradually decay from 1 to the specified threshold during the whole training process.