An overview of the

Gradient descent algorithms are widely used in machine learning. Whether in linear regression or logistic regression, its main purpose is to find the minimum value of the objective function through iteration, so as to minimize the loss function.

Principle of gradient descent

The valley problem

Gradient descent, to put it simply, is to find the smallest point, and the so-called finding the smallest point, is like walking into a valley, every time hoping to find the lowest point in the valley, and then how to determine the path to the lowest point.

Now suppose that we can’t see the bottom directly, we can only see a small part of our surroundings, and we have to find the bottom step by step.

So now take your position as a reference, find the steepest direction (the tangent direction), and take a step in the descending direction; Find the steepest direction and take another step. Until it reaches its lowest point.

The gradient

Gradients are in fact generalizations of multivariable derivatives:

The gradient is the derivative of each variable, enclosed by <>, to indicate that the gradient of the graph is a vector.

  • inSingle variableIn a function, the gradient is the derivative of the function, which is the slope of the tangent line.
  • inmultivariateIn the function, the gradient is a vector, the direction of the vectorDirection of gradient, i.e.,Drop the fastestThe direction of the.

Gradient descent

In the loss function, there are generally two parameters: the weight of the control semaphore (w) and the deviation between the adjustment function and the true value (b).

Gradient descent is the constant adjustment of the weight w and the deviation b to minimize the loss.

By analyzing the direction of the gradient vector, we know the direction of the descent, but we don’t know how much we have to go each time.

This requires the definition of a new concept: the learning rate (α)

Where ωi represents the initial weight value, and ωi+1 represents the updated weight value. In gradient descent, you repeat this multiple times until the loss function converges.

The selection of alpha should not be too large in case we miss the lowest point. Nor should it be too small to make the descent slow.

Gradient descent process

1. Circulate all sample data

(1) the first
iThree training data
Weights of omega.and
Deviation of bThe gradient with respect to the loss function. So we end up with a gradient of weights and deviations for each of the training data.


(2) Calculate all training data
Weights of omega.the
Sum of gradients.


(3) Calculate all training data
Deviation of bthe
Sum of gradients.

2. Update weights and deviations

(1) Use the above
(2), (3)Step to get the results, calculate all the samples
The weightand
deviationthe
Mean value of gradient.


(2) Using the following formula,
updateFor each sample
Weight valueand
deviation.





Repeat the process until
Convergence of loss functionThe same.

Gradient descent demo

1. Define data sets

From numpy import * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Reshape (m, 1) # reshape(m, 1) # reshape(m, 1) # reshape(m, 1) # reshape(m, 1) # reshape(m, 1) # is sample data corresponding to the y coordinate of y = array ([3, 4, 5, 5, 2, 4, 7, 8, 11, 8, 12, 11, 13, 13, 16, 17, 18, 17, 19, 21]), reshape (m, 1) # Learning rate alpha = 0.01

The reshape() function restructures the original array into a two-dimensional array with m rows and 1 column.

2. Cost function and gradient function

Def cost_function(theta, X, Y) def cost_function(theta, X, Y) Diff = dot(X, theta) -y # dot() Arrays need to be multiplied like matrices, Dot () return (1/(2*m)) * dot(diff.transpose(), diff) diff = dot(X, theta) - Y return (1/m) * dot(X.transpose(), diff)

3. Gradient descent calculation

Gradient_descent (X, Y, alpha): theta = array([1, 1]).reshape(2, 1) gradient = gradient_function(theta, X, Y) while not all(abs(gradient) <= 1e-5): theta = theta - alpha * gradient gradient = gradient_function(theta, X, Y) return theta optimal = gradient_descent(X, Y, alpha) print('optimal:', optimal) print('cost function:', cost_function(optimal, X, Y)[0][0])

When the gradient is less than 1e minus 5, the bottom has been reached. At this point, the iteration will not achieve much, so exit the loop and end the iteration.

4. Draw pictures

Def plot(X, Y, theta) def plot(X, Y, theta) Ax. Scatter (X, Y, s=30, c="red"); Theta [0] + Theta [1]* X ax.plot(X, 0), 0, 0, 0) y) plt.show() plot(X1, Y, optimal)

5. Rendering

A linear fitting of the sample data is obtained.


Reference:

https://zhuanlan.zhihu.com/p/68468520

https://blog.csdn.net/qq_41800366/article/details/86583789

https://www.w3cschool.cn/tensorflow_python/