Directory:

  1. The introduction
  2. Stochastic gradient descent
  3. Classification of decision trees
  4. Batch gradient descent method
  5. Momentum gradient descent
  6. Nesterov Momentum gradient descent method
  7. AdaGrad gradient descent method
  8. Adam gradient descent method
  9. Comparison with other unconstrained optimization algorithms

1 introduction

Gradient descent is a commonly used algorithm in machine learning, but it is not a machine learning algorithm itself, but a solution optimization algorithm. The basic idea is to approach the optimal advantage constantly, and the optimization direction of each step is the direction of the gradient. Machine learning is the nature of “hello” to the model data, make model to study unceasingly, and the learning process is to use gradient descent method to optimize the process of continuously, at present the most common is the depth of the neural network using back propagation of gradient, repeated until convergence update model parameters, so as to achieve the goal of the optimization model. For the simplest linear model, e.g






Among them, so it is not difficult to get


namely


Stochastic gradient descent (SGD)
Batch gradient descent method
Momentum gradient descent
Nesterov Momentum gradient descent method
AdaGrad gradient descent method
RMSprop gradient descent method
Adam gradient descent method

Illustration of gradient descent

Random gradient descent

The so-called stochastic gradient descent method means that one sample pair is randomly selected from the sample set at a timeUpdate, the update formula is:


The learning rate α of the stochastic gradient descent method should not be set too large, otherwise it is easy to “oscillate” near the optimal solution, but it is always unable to get closer to the optimal solution


Standard gradient descent method

The standard gradient descent rule is to calculate the sum of the loss function in the sample set and then update the parameters:




Four batch gradient descent method

I’m going to take it at random each time(batch_size) sample for iteration:


Momentum gradient descent method

The algorithm is also called momentum gradient descent method, and the basic idea is: Loss function for the optimal solution can be viewed as the process of the ball from the solution surface (loss function value in the coordinate system of surface) fall somewhere along the surface until the surface minimum process, the loss function of the gradient can be thought of as the force imposed on ball, through has the speed, the action of forces through the speed can change the position of the ball. According to the mechanical law F= MA, the gradient is proportional to the acceleration, and the acceleration changes the speed, so the following updating process can be obtained:



Among them,Is the momentum coefficient,The size of the value, which can be determined by trian-and-error, is usually 0.9 in practice.

In this optimization method, the direction of gradient optimization will not be changed immediately. Instead, the direction after each calculation is the weighted sum of the direction of the previous optimization and the value obtained by this calculation. In other words, the direction of gradient optimization is changed by accumulating a little bit, and the more accumulated, the bigger the direction is. The advantage of this optimization method is that when the gradient is obtained from different training samples, the gradient value in the optimal direction will always be increased, so many oscillations can be reduced.

Nesterov Momentum gradient descent method

Nesterov Momentum gradient descent method is an improvement of Momentum gradient descent method. The update process is as follows:



In Momentum gradient descent, that’s already been solved, so you can “look ahead one more step,” and instead of solving for the gradient at the current position, solve for itThe gradient of theta. This position is incorrect, but better than the current position θ.

AdaGrad gradient descent method

In the gradient descent method, the selection of learning rate is very important, because it is related to the size of “miles” in each step of the optimization process. Too many miles will lead to oscillations near the optimal solution, while too few miles will easily fall into the local optimal solution. In addition, the learning rate applicable to different parameters is different. Some parameters may be close to the optimal and only need to be fine-tuned, requiring a relatively small learning rate. And some parameters need to be shifted significantly. In this scenario, AdaGrad can adapt to different learning rates. The update process is as follows:



So the core idea of RMSprop is to use exponentially decaying moving averages to discard the history of the distant past.

Adam gradient descent method

Adam considers gradients and the square of gradients and has the advantages of AdaGrad and RMSprop. Adam dynamically adjusts the learning rate according to the first and second order estimates of the gradient.




Among them,Is the average value of the gradient at the first moment,Is the non-central variance value of the gradient at the second moment, where,Generally set to 0.9,Generally set to 0.9999,General set to.

This method not only stores the exponential decay average of AdaDelta’s previous square gradient, but also preserves the exponential decay average of the previous gradient M(t), which is similar to momentum.

9 Comparison with other unconstrained optimization algorithms

Unconstrained optimization algorithms in machine learning include gradient descent, least square method mentioned above, Newton method and quasi-Newton method. Compared with the least square method, the gradient descent method requires the selection of step size, while the least square method does not. The gradient descent method is iterative solution and the least square method is analytical solution. If the sample size is not very large and analytical solutions exist, the least square method has advantages over the gradient descent method, and the calculation speed is fast. However, if the sample size is large, the least square method is difficult or slow to solve the analytical solution because it requires a super large inverse matrix, and the iterative gradient descent method is more advantageous. Both gDA and Newton/quasi-Newton methods are iterative solutions, but GDA is a gradient solution, while Newton/quasi-Newton method uses the inverse or pseudo-inverse of the second-order Hessian matrix. Comparatively, Newton/quasi – Newton method converges faster. However, the time of each iteration is longer than that of gradient descent.

Read the original article:https://mp.weixin.qq.com/s/qtnACz8Oqq5ziNy61zbovw