Summary of Gradient Descent Optimization Algorithm (1)

Optimizing -gradient-descent If you are familiar with English, it is highly recommended to read the original text. After all, there may be fallacies in the translation process because of your limited understanding. In addition, due to the length of the original text, the translation is divided into two parts. This paper is mainly a summary of the gradient descent optimization algorithm, and the next part will be a summary of the parallel and distributed random gradient, as well as the optimization strategy.

Gradient descent is one of the most popular algorithms in optimization and is also the most commonly used method for optimizing neural networks. At the same time, every good deep learning library contains implementations of multiple algorithms for optimizing gradient descent (for example, documentation from Lasagne, Caffe, and Keras). However, these algorithms are often packaged as optimizers, like black boxes, making it difficult to get an explanation of their actual capabilities and shortcomings.

The goal of this blog is to provide readers with an intuitive explanation of the different gradient descent optimization algorithms and hopefully put them to use. We’ll start by looking at the different variations of gradient descent. Then the problems in the training process will be summarized briefly. Next, we introduce the most commonly used optimization algorithms, show their motivation for solving these problems, and why they correspond to changes in update rules. We will briefly review algorithms and architectures for gradient descent optimization in parallel and distributed situations. Finally, we’ll talk about other strategies that can help optimize gradient descent.

Gradient descent is a way to minimize the objective function $J(\theta)$constructed with the model parameter $\theta \in \mathbb{R}^d$by updating the parameters in the opposite direction of the parameter gradient as the objective function $\nabla_\theta J(\theta)$. The learning rate $\eta$determines the size of the number of steps we need to reach the (local) minimum. In plain English, we follow the slope of the surface constructed by the objective function until we reach a valley bottom. If you are not familiar with gradient descent, you can refer to this introduction to optimizing neural networks.

Different versions of gradient descent

There are three different versions of gradient descent, and the difference between them is how much data we use to calculate the gradient of the objective function. Depending on the amount of data, there is a trade-off between the accuracy of parameter updates and the time it takes to update them.

Batch gradient descent

The most common gradient descent, known as batch gradient descent, uses the entire training data to calculate the gradient of the objective function according to the parameter $\theta$:

$\theta = \theta – \eta \cdot \nabla_\theta J( \theta)$

Batch gradient descent is time-consuming because we need to calculate the gradient of the entire dataset to update, and it is also tricky to deal with data sets that cannot be fully loaded. Batch gradient updates, which add new samples at run time for model updates, also don’t get us online.

In code, batch gradient descent takes the following form:

for i in range(nb_epochs):
    params_grad = evaluate_gradient(loss_function, data, params)
    params = params - learning_rate * params_grad
Copy the code

For the pre-set number of training iterations, we first calculate the gradient vector weight_grad of the loss function for the whole data set according to the parameter vector params. Note that the latest deep learning libraries provide automatic differentiation methods that efficiently compute gradients from parameters. If you differentiate gradients yourself, it’s a good idea to do a gradient check. (Here are some tips for checking gradients properly.)

Stochastic gradient descent

In contrast, stochastic gradient Descent (SGD) updates the parameters of training sample $x^{(I)}$and category $y^{(I)}$:

$\theta = \theta – \eta \cdot \nabla_\theta J( \theta; x^{(i)}; y^{(i)})$

For large data sets, batch gradient descent is redundant because it recalculates gradients for similar samples before each parameter update. SGD can be updated once, so this problem can be avoided. As a result, it is often faster and can be used for online learning.

Frequent updates of SGD with a relatively large variance lead to drastic fluctuations in the objective function, as shown in Figure 1.

Batch gradient descent converges to the minimum where the parameters are located, while SGD fluctuation makes it possible to skip local minima on the one hand and get better local minima on the other. On the other hand, because SGD will always fluctuate, this complicates the eventual convergence to a true minimum point. However, it has been proved that when we slowly reduce the learning rate, SGD and batch gradient descent have the same convergence process, for both non-convex and convex optimization, convergence to the corresponding local or global minimum.

The SGD snippet simply adds a loop while training the sample, making gradient estimates based on each sample. Note that we will randomly shuffle the training data every time we update the training, which will be explained later:

for i in range(nb_epochs):
    np.random.shuffle(data)
    for example in data:
        params_grad = evaluate_gradient(loss_function, example, params)
        params = params - learning_rate * params_grad
Copy the code

Mini-batch gradient descent

Mini-batch finally absorbed the advantages of the two methods and used the training data of mini-Batch to update each time:

$\theta = \theta – \eta \cdot \nabla_\theta J( \theta; x^{(i:i+n)}; y^{(i:i+n)})$

In this way, it can reduce the variance of parameter update and make the convergence more stable. In addition, the highly optimized matrix in the latest common deep learning library can be used for optimization, making the more mini-batch gradient calculation more efficient. Generally, the size of mini-batch ranges from 50 to 256, but can be selected according to different applications. Mini-batch gradient descent is generally the algorithm choice for training neural networks, and SGD will also be used when using mini-batch. Note: In SGD modifications later in this article, we use the parameter $x^{(I: I +n)}; Y ^{(I: I +n)}$for simplification.

In code, instead of iterating over samples, we now iterate at mini-Batch 50:

for i in range(nb_epochs):
    np.random.shuffle(data)
    for batch in get_batches(data, batch_size=50):
        params_grad = evaluate_gradient(loss_function, batch, params)
        params = params - learning_rate * params_grad
Copy the code

challenge

However, traditional mini-batch gradient descent does not guarantee good convergence, but there are some challenges that need to be highlighted:

Choosing an appropriate learning rate can be difficult. Too small learning rate will lead to extremely slow convergence, while too large learning rate will hinder convergence, resulting in loss function fluctuation in the minimum attachment, or even divergence.
Scheduling of learning rate 11 Try to use methods such as simulated annealing to adjust learning rate automatically according to pre-defined scheduling mode during training, or when the change of target in two trainings is below the threshold value. However, these scheduling methods and thresholds need to be defined in advance and therefore cannot be applied to the characteristics of the dataset 10.
In addition, the same learning rate is applied to all parameter updates. If our data is very sparse and features have completely different frequencies, we may not want to update them in the same way, preferring to update larger features that occur in small numbers.
Another key challenge is to minimize the nonconvex error functions common in neural networks without falling into a large number of local minima. Dauphin et al. [19] claims that difficulty is not actually caused by local minima, but by saddle points, which are points that are uphill in one dimension and downhill in the other. These saddle points are generally surrounded by stable and identical error values, which makes it difficult for SGD to escape from the saddle points because the gradient is close to zero in all dimensions.

Gradient descent optimization algorithm

Next, we’ll list some of the algorithms widely used by the deep learning community to tackle the aforementioned challenges. We will not discuss algorithms that cannot realistically handle high-dimensional data sets, i.e. second-order methods, such as Newton’s method.

The momentum

SGD has problems going through valleys, i.e. areas where the surface curve is steeper in one dimension than the other, which is very common at local optima. In these cases, SGD oscillates left and right within the range of the trough, as shown in Figure 2, and cannot be effective at the local optimum.

Momentum 2 is what helps SGD accelerate in the relevant direction, avoiding vibration, as shown in Figure 3. It will append the current updated vector to the updated value of the previous vector $\gamma$:

$v_t = \gamma v_{t-1} + \eta \nabla_\theta J( \theta)$

$\theta = \theta – v_t$

Note: Some implementation formulas have different symbols. The momentum $\gamma$is always 0.9 or an approximate value.

In essence, momentum is used as we push the next ball down the hill. The ball accumulates momentum as it falls and gets faster and faster along the way (until it reaches an extreme velocity of $\gamma < 1$under the action of air resistance). The same is true for parameter updates: momentum increases when the gradient points in the same direction and decreases when the gradient changes direction. That way, we can converge faster and we can reduce the oscillation.

Nesterov gradient acceleration

Still, like a ball thrown down a hill, blindly descending a slope, it’s not satisfying. We want to have a ball that’s smarter, that has an idea of where it’s going, so that it knows to slow down when it goes uphill again.

Nesterov Gradient (NAG) 7 is one way to bring this perception to momentum. We know that we will use the momentum $\gamma v_{t-1}$to move the parameter $\theta$. Calculating $\theta – \gamma v_{t-1} $in this way gives an approximation of the parameter at the next position (without a gradient for full updates) and a rough idea of what the parameter will be. We can now efficiently calculate gradients not from the current gradient, but from the approximate future positions of the parameters.

$v_t = \gamma v_{t-1} + \eta \nabla_\theta J( \theta – \gamma v_{t-1} )$

$\theta = \theta – v_t$

Again, we set the value of momentum to 0.9. Momentum first calculates the current gradient (the small blue arrow in Figure 4) and then takes a big step forward (the big blue arrow) in the direction of the updated accumulated gradient. NAG first takes a big step forward in the previous accumulated gradient method (the brown arrow) and then detects the gradient and makes corrections (the green arrow). This pre-update prevents us from updating too quickly and improves responsiveness, which improves RNN performance on some tasks 8.

G. Hinton’s lecture 6c)”>

Refer to this article for an alternative explanation of the thinking behind NAG, and Ilya Sutskever gives a more detailed summary in his doctoral dissertation 9.

Since we can speed up SGD with adaptive updates within the error function, we can also make adaptive updates to individual parameters, increasing or decreasing the update program according to their importance.

Adagrad

Adagrad3 is based on the gradient optimization algorithm: it adaptively updates the learning rate according to the parameters, and makes big updates for the parameters that are not frequently updated, while makes small updates for the parameters that change frequently. Therefore, this algorithm is very suitable for processing sparse data. Dean et al. 4 found that Adagrad could greatly improve the robustness of SGD and was used to train large-scale neural networks in Google, including the neural network that recognized cats in Youtube videos. Furthermore, Pennington et al. 5 used Adagrad to train word embedding in GloVe, since low frequency words require a greater degree of updating than high frequency words.

Previously, we updated all parameters $\theta$with the same learning rate $\eta$for each parameter $\theta_i$. Adagrad uses a different learning rate for each parameter $\theta_i$at each point in time $t$. Let’s first look at Adagrad’s single-parameter update and then vectorize it. $g_{t, I}$= g_{t, I}$= g_{t, I}$= g_{t, I}$

$g_{t, i} = \nabla_\theta J( \theta_i )$

SGD updates each parameter $\theta_i$at each point in time to $t$:

$\theta_{t+1, i} = \theta_{t, i} – \eta \cdot g_{t, i}$

In Adagrad’s update rule, it alters the learning rate $\eta$at each point in time $\t $for each parameter $\theta_i\t$based on the gradient $\ theta_I $calculated last time:

$\theta_{t+1, i} = \theta_{t, i} – \dfrac{\eta}{\sqrt{G_{t, ii} + \epsilon}} \cdot g_{t, i}$

$eg_ {t} \in \mathbb{R}^{d \times d} $eg_ {t} \in \mathbb{R}^{d \times d} $eg_ {t} $eg_ {t} \in \mathbb{R}^{d \times d} $ Prevents division by zero (typically $1E – $8). And the interesting thing is, if you don’t do the square root operation, the algorithm becomes very bad.

Because $G_{t}$contains the sum of squares of the past gradients of all diagonal parameters $\theta$, we can vectorize $G_{t}$and $G_{t}$by multiplying matrix vectors $\odot$:

$\theta_{t+1} = \theta_{t} – \dfrac{\eta}{\sqrt{G_{t} + \epsilon}} \odot g_{t}$

The main advantage of Adagrad is that there is no need to manually adjust the learning rate. Most implementations use a default value of 0.01 and leave it at that.

Adagrad’s main drawback is that it accumulates the square of the gradient in the denominator: because every value added is positive, the training summation keeps increasing. This leads to degradation of the learning rate, which eventually becomes so small that the algorithm can no longer acquire more knowledge. The following algorithm is designed to solve this problem.

Adadelta

Adadelta 6 is an extension of Adagrad to reduce the extent to which it radically and monotonously reduces learning rates. Instead of summing up all past gradients squared, Adadelta limits the window $w$for summing up previous gradients to a fixed size.

Instead of storing the previous gradient square $w$inefficiently, the sum of gradients can be recursively replaced by the average of all the previous gradients squared. Then $t$average $E[g^2]_t$depends only (as the fraction $\gamma$is similar to the momentum term) on the previous average and the current gradient:

$E[g^2]_t = \gamma E[g^2]_{t-1} + (1 – \gamma) g^2_t$

Let’s set $\gamma$to a similar value for the momentum term, about 0.9. For clarity, we now rewrite the traditional SGD argument to update the case of the vector $\Delta \theta_t $:

$\Delta \theta_t = – \eta \cdot g_{t, i}$

$\theta_{t+1} = \theta_t + \Delta \theta_t $

The Adagrad parameter update vector that we took the derivative of earlier is in the following form:

$ \Delta \theta_t = – \dfrac{\eta}{\sqrt{G_{t} + \epsilon}} \odot g_{t}$

$E[g^2]_t$= E[g^2]_t$

$\Delta \theta_t = – \dfrac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_{t} $

Since the denominator is just root mean squared (RMS) of the gradient error standard, we can use a brief standard substitution:

$ \Delta \theta_t = – \dfrac{\eta}{RMS[g]_{t}} g_t$

The authors suggest that the units in this update (also in SGD, impulse, or Adagrad) do not match, i.e. the update should have the same hypothetical unit as the parameter. To achieve this, they first defined another exponential delay average, this time using not gradient square but parameter square update:

$E[\Delta \theta^2]_t = \gamma E[\Delta \theta^2]_{t-1} + (1 – \gamma) \Delta \theta^2_t$

The root mean square error of parameter update is as follows:

$RMS[\Delta \theta]_{t} = \sqrt{E[\Delta \theta^2]_t + \epsilon} $

Because $RMS[\Delta \theta]_{t}$is unknown, we use the RMS of the update parameter as an approximation at the previous point in time. $\eta$= $RMS[\Delta \theta]_{t-1}$= $RMS[\Delta \theta]_{t-1}

$ \Delta \theta_t = – \dfrac{RMS[\Delta \theta]_{t-1}}{RMS[g]_{t}} g_{t}$

$\theta_{t+1} = \theta_t + \Delta \theta_t$

With Adadelta, we don’t even need to set a default learning rate, as it has been eliminated in the update rules.

RMSprop

RMSprop is an unpublished adaptive learning rate approach presented by Geoff Hinton in a Coursera lecture.

RMSprop and Adadelta were both independently studied simultaneously to solve the problem of Adagrad learning rate loss. In fact, RMSprop is the same as the first Adadelta update vector we decomposed earlier:

$E[g^2]_t = 0.9 E[g^2]_{t-1} + 0.1 g^2_t$

$\theta_{t+1} = \theta_{t} – \dfrac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_{t}$

RMSprop is also divided by the learning rate by the average square of an exponential delay gradient. Hinton recommends setting $\gamma$to 0.9, while a good default for the learning rate $\eta$is 0.001.

Adam

Adaptive Moment Estimation (Adam) [15] is another method to calculate the Adaptive learning rate of each parameter. In addition to storing the exponential moving average of the past gradient squared $v_T $like Adadelta and RMSprop, Adam also retains an exponential moving average of the momentum-like past gradient $M_T $:

$m_t = \beta_1 m_{t-1} + (1 – \beta_1) g_t$

$v_t = \beta_2 v_{t-1} + (1 – \beta_2) g_t^2$

$m_t$and $v_t$are the first order torques (mean) and second order torques (bias difference) corresponding to the gradient, corresponding to the name of the method. Because $m_t$and $v_t$are initialized as vectors with zero values, Adam’s authors find that they deviate toward zero, especially at initial time, and at very small rates of movement (i.e., $\beta_1$and $\beta_2$approach 1).

They reduce these biases by calculating biases to correct first – and second-order moment estimates:

$\hat{m}_t = \dfrac{m_t}{1 – \beta^t_1} $

$\hat{v}_t = \dfrac{v_t}{1 – \beta^t_2} $

Then, as with Adedelat and RMSprop, they use these values to update the parameters, resulting in Adam’s update rules:

$\theta_{t+1} = \theta_{t} – \dfrac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$

$\beta_1$defaults to 0.9, $\beta_2$defaults to 0.999, and $\epsilon$defaults to $10^{-8}$. Based on their experience, they believe that Adam has a better effect in practice than other adaptive learning algorithms.

Tags: Optimization, Gradient Descent, SGD, Mini-Batch, Neural Network