Super detailed explanation of gradient descent algorithm (from downhill metaphor, mathematical derivation to code implementation)

If you have any questions about this blog, please leave a comment.

1. Directional derivative

Directional derivative: Analogous to the partial derivative of a function is the rate of change of the function along the direction of the coordinate axis, and the directional derivative is the rate of change of the function along the direction of a certain ray.

Theorem: If the function $f(x,y)$ In the point $P_0(x_0,y_0)$ Differentiable, so the function goes in either direction at that point $l$ The directional derivative of phi exists, and phi

\left. \frac{\partial f}{\partial l} \right|_{(x_0,y_0)} =f_x(x_0,y_0)\cos\alpha + f_y(x_0,y_0)\cos\beta\tag{1}

Among them $\cos\rho$ 和 $\cos\beta$ Is the direction $l$ Cosine of the direction of theta.

2. The gradient

Gradient: The gradient is a vector (vector), which indicates that the directional derivative of a function at this point reaches the maximum value along this direction, that is, the function at this point changes fastest along this direction (the direction of the gradient), and the rate of change is the largest (is the magnitude of the gradient).

In calculus, we take the partial derivatives of the parameters of a function of several variables, and write the partial derivatives of each parameter as a vector, which is called the gradient.

Definition: Let the binary function z=f(x,y)z=f(x,y)z=f(x,y) z=f(x,y) have a continuous partial derivative of the first order on the plane region D, {partial f∂x, partial f∂y}=fx(x,y) I ⃗+fy(x,y)j⃗\left \lbrace \frac{\partial f}{\partial x},\frac{\partial x},\frac{\partial F} {\ partial y} \ \ right rbrace = f_x (x, y) \ vec I + f_y (x, y) \ vec j {partial x partial f, partial y partial f} = fx (x, y) I + fy (x, y) j, This function is called a function z = f (x, y) z = f (x, y) z = f (x, y) at point P (x, y) gradient, as gradf (x, y) gradf (x, y) gradf (x, y) or ∇ f (x, y) \ nabla f (x, y) ∇ f (x, y), both:

gradf(x,y) = \nabla f(x,y)=\left \lbrace \frac{\partial f}{\partial x} ,\frac{\partial f}{\partial y}\right \rbrace=f_x(x,y)\vec i + f_y(x,y)\vec j \tag{2}

Note that the gradient is a function of several variables, and if you want to extend it to a function of one variable, think of it this way: it is a scalar, and the gradient at some point is equal to the derivative at that point.

3. Mathematical derivation of gradient descent algorithm

Gradient descent algorithm is aimed at the minimum optimization problem (i.e., minimizing the problem), the purpose is to make the objective function down to the minimum value along the fastest path.

The popular interpretation is to simulate a descent by taking one small step in the steepest and easiest direction at your current position, and then continuing one small step in the steepest direction at your next position. So we went on step by step until we felt we were at the foot of the hill.

Although that makes sense, but this is just a metaphor for those of you who don’t know the gradient descent algorithm, and ultimately it has to be implemented in the algorithm and the code, what is the specific process? You have to do the math.

The algorithm operates on the loss function (also known as the objective function, the cost function, and the error function) in order to find the weights (WWW) and bias (BBB) that minimize the loss function.

Let the loss function be L (x,w,b,y) L (x,w,b,y) L (x,w,b,y). In order to find the optimal WWW and BBB, the function description f(w,b)f(w,b)f(w,b) is abstracted for the convenience of calculation, where f= LF = LF = L, but the description form is different. Let θ=(w,b)T\theta =(w,b) ^Tθ=(w,b)T and TTT be the transpose sign, and then θ\thetaθ is a two-dimensional column vector.

Then the loss function is f(θ)f(\theta)f(θ), and the first order Taylor expansion is carried out on it, and the following is obtained:

F (\ theta) \ approx (\ theta_0) + f (\ theta – \ theta_0) \ nabla f (\ theta_0) \ tag {3}

The reason why we use a first order Taylor expansion is because it allows us to “replace the curve with a straight line,” which in mathematical terms is called a “local linear approximation,” where the line and the curve almost coincide over a very small interval, and instead of doing the calculations that are not easy to do on the curve, we can do the calculations on the line.

θ−θ0\theta-\theta_0θ−θ0 is this small interval, which can be expressed as δ θ\Delta \theta δ θ, but it is still a vector, which is decomposed into the form of modules and unit vectors, that is, length and direction:

\Delta \theta=\theta-\theta_0=\rho \vec v \tag{4}

V ⃗ vec vv = v⃗ vec vv = v⃗ vec vv = v⃗ vec vv = v⃗ vec vv

Then the first-order Taylor expansion of the loss function F (θ)f(\ Theta)f(θ) can be described as follows:

F (\theta)\approx f(\theta_0)+\rho \vec v·\nabla f(\theta_0) \tag{5}

Where f(θ0)f(\theta_0)f(\theta_0)f(θ0) is the value of the loss function now, f(θ)f(\theta)f(θ) is the value of the loss function to be updated, as mentioned above, our purpose is to find the weight (WWW) and bias (BBB) that minimize the loss function, So we each update to ensure that f (theta) < f (theta 0) f (\ theta) < f (\ theta_0) f (theta) < f (theta zero), namely:

F (\ theta) – (f \ theta_0), approx, rho, vec v. \ nabla (\ theta_0) < 0 f \ tag {6}

ρ\rhoρ > 0

(\ \ vec v. \ nabla f theta_0) < 0 \ tag {7}

Note here that v⃗\vec vv is a vector, and f(θ0)\nabla f(\theta_0) f(θ0) is the gradient of the function f(w,b)f(w,b) at the point (w0, B0)(w_0,b_0)(w0, B0), and it is also a vector. For the product of two vectors to be less than 0, the Angle between them has to be greater than 90 degrees. So again, our goal is to get the target function down to the minimum along the fastest path, and since we want the fastest, F (theta) f (\ theta) f (theta) and f (theta 0) f (\ theta_0) f (theta 0) (∣ f (theta) – f (theta 0) ∣ | f (\ theta) – f (\ theta_0) | ∣ f (theta) – f (theta 0) ∣) is the bigger the better, According to the formula of f (theta) – f (theta 0) material rho v ⃗ ⋅ ∇ f (theta 0) < 0 f (\ theta) – f (\ theta_0), approx, rho, vec v. \ nabla f (\ theta_0) < 0 f (theta) – f (theta 0) material rho v ⋅ ∇ f (theta 0) < 0, The smaller the result, the better. Then go back to v⃗\vec vv and raf (θ0)\nabla f(\theta_0) raf (θ0) the product of the two vectors, that is, when their Angle is 180°, V ⃗ thia f(θ0)\vec V ·\nabla f(\theta_0)v thia f(θ0) the smallest result (here is why to update the independent variable along the gradient in the opposite direction), this result is:

(\ \ vec v. \ nabla f theta_0) = – | | | \ vec v, | | | \ nabla f (\ theta_0) | | = – | | \ nabla f (\ theta_0) | | \ tag {8}

V ⃗\vec vv is described as:

\vec v=-\frac{\nabla f(\theta_0)}{||\nabla f(\theta_0)||} \tag{9}

θ−θ0=ρv⃗\theta-\theta_0=\rho \vec vθ−θ0=ρv

\theta=\theta_0-\rho \frac{\nabla f(\theta_0)}{||\nabla f(\theta_0)||} \tag{10}

Because rho \ rho rho and 1 ∣ ∣ ∇ f (theta 0) ∣ ∣ \ frac {1} {| | \ nabla f (\ theta_0) | |} ∣ ∣ ∇ f (theta 0) ∣ ∣ 1 is a scalar, So can set alpha = rho ∣ ∣ ∇ f (theta 0) ∣ ∣ \ alpha = \ frac {\ rho} {| | \ nabla f (\ theta_0) | |} alpha = ∣ ∣ ∇ f (theta 0) ∣ ∣ rho, while the type (gradient descent algorithm in theta \ theta theta update expression) can be described as:

\theta=\theta_0-\alpha\nabla f(\theta_0) =\theta_0-\alpha \frac{\partial}{\partial \theta_0}f(\theta) \tag{11}

Alpha alpha is what we call the rate of learning.

In addition, there is a professional way to describe this formula:

\theta \leftarrow \theta_0-\alpha \frac{\partial}{\partial \theta_0}f(\theta) \tag{12}

This formula is what you end up writing in your code, but you have to compute the partial derivative in combination with your FFF (loss function) form, and then translate it into your code.

Actually using this algorithm, sometimes will formula in the form of written component, the theta = T (w, b), theta 0 = (w0, b0) T \ theta = (w, b) ^ T, \ theta_0 theta. = = (w_0, b_0) ^ T (w, b) T, theta 0 = (w0, b0) T on the plug type:

(w,b)^T=(w_0,b_0)^T-\alpha \left ( \frac{\partial f}{\partial x} ,\frac{\partial f}{\partial y} \right )^T \tag{13}

That is:

\begin{aligned} w=w_0-\alpha \frac{\partial f}{\partial w_0} \\ b=b_0-\alpha \frac{\partial f}{\partial b_0} \end{aligned} \tag{14}

They mean the same thing, they just mean it differently. But you don’t usually use component form, you use vector form.

So that’s the mathematical derivation of updating the independent variables of the gradient descent algorithm, so when do you stop updating, you would say of course when you find the weight and the bias that minimize the loss function, yeah that’s what we described in the purpose of the algorithm. However, how to know that the minimum value of the loss function is found? In practical application, we don’t really iterate to the minimum value of the loss function, but approach the minimum value as close as possible within the acceptable range of accuracy and training time, and trade off the resource consumption and accuracy requirements. Specific closing conditions are usually:

∣ ∣ ∇ f (theta 0) ∣ ∣ | | \ nabla f (\ theta_0) | | ∣ ∣ ∇ f (theta 0) ∣ ∣ value is small enough. (In other words, the loss function is no longer significantly reduced, but the value of the loss function should also be taken into account; otherwise, initial parameters and training data should be checked.) In actual programming, the value of the loss function may not be judged directly, but may also be judged indirectly (such as error value), considering program performance.
The number of iterations reached the predetermined value.

This is the core idea of the Gradient Descent algorithm. At last, the practical application should be implemented to specific algorithms. The family of Gradient Descent algorithms includes Batch Gradient Descent (BGD). Also known as the fastest Gradient Descent method), Stochastic Gradient Descent (SGD), and mini-batch Gradient Descent (MBGD). They use the same algorithms, but adopt different strategies for entering data.

4. Batch gradient descent code implementation

4.1 Selection of loss function and model

The mean square error loss function (MSE) is chosen as the loss function, and its expression is:

L(\theta; x,y) = \frac{1}{M} \sum^M_{i=1} (\hat y(x_i,\theta) – y_i)^2 \tag{15}

Among them:

Y ^\hat yy^ — predictive function,

X,yx, yx,y — are the input value and label value of training data respectively,

Theta \theta theta — is the vector composed of WWW and BBB,

MMM: Indicates the number of training data.

The prediction model (function) is a linear regression model, expressed as:

\hat y(x,\theta)=\sum_{j=0}^n\theta_jx_j \tag{16}

Among them:

XXX — is a predictive input data point,

Theta \theta theta — is the learned weight (WWW) and bias (b).

NNN — is the dimension of the input data.

4.2 From formula to code

The first objective is to find the weight (WWW) and bias (BBB) that minimize the loss function. Did we find it? We found it. Their calculation formula is formula (12)(12)(12), so the core part of our code is to implement formula (12)(12)(12). For this formula, The important part is partial ∂θ0f(θ)\frac{\partial}{\partial \theta_0}f(\theta)∂θ0∂f(θ), the partial derivative of the loss function with respect to θ\thetaθ, our loss function has been given by formula (15)(15)(15) (15), F (theta) = L (theta; x,y)f(\theta) = L(\theta; X, y) f (theta) = L (theta. X,y), the partial derivative of formula (15)(15)(15) for θ\thetaθ can be obtained:

\begin{aligned} \frac{\partial}{\partial \theta_0}f(\theta) &= \frac{\partial}{\partial \theta_0}(\hat y(x,\theta_0) – Y) ^ 2 \ \ & = 2, (\ hat y (x, \ theta_0) – y) · \ frac {\ partial} {\ partial \ theta_0} (\ hat y (x, \ theta_0) – y) \ \ & = 2, (\ hat (x, y \ theta_0) – y) · \ frac {\ partial} {\ partial \ theta_0} \ left (\ sum_ {j = 0} ^ n \ theta_jx_j – y \ right) \ \ & = 2, (\ hat y (x, \ theta_0) -y)·x \tag{17} \\ \end{aligned}

By substituting formulas (17), (17) and (17) into formulas (12), (12) and (12), we can get:

\theta \leftarrow \theta_0-\alpha·2·(\hat y(x,\theta_0) -y)·x \tag{18}

Where y^\hat yy^ is the prediction function, θ0\theta_0θ0 represents the current value, θ\thetaθ is the next update value, x,yx,yx,y is the input value and label value of the training data, α\alphaα is the learning rate, since 222 is a constant, Alpha \ alpha alpha is a scalar, 222 can be incorporated into the alpha \ alpha alpha, is actually the alpha \ alpha alpha 2 rho ∣ ∣ ∇ f (theta 0) ∣ ∣ \ frac {2 \ rho} {| | \ nabla f (\ theta_0) | |} ∣ ∣ ∇ f (theta 0) ∣ ∣ 2 rho, It can be seen that the learning rate can express the change of iterative step size of gradient descent. In practical application, the value is often assigned artificially or by a specific strategy, rather than using the original formula.

Thus, formula (18)(18)(18) can be described as:

\theta \leftarrow \theta_0-\alpha·(\hat y(x,\theta_0) -y)·x \tag{19}

This is the formula that was written into the program last.

In order to facilitate programming, the formula is decomposed into:

hypothesis = \hat y(x,\theta_0) \tag{20}

error = \hat y(x,\theta_0) – y = hypothesis – y\tag{21}

Gradient = (\ hat (x, y \ theta_0) – y), x = (x), the error/m \ tag {and}

\ theta = \ theta_0 – \ alpha, gradient \ tag {23}

As for the calculation of gradient, here is the difference between the three algorithms of the gradient descent algorithm family.

Batch gradient Descent (BGD) : The gradient is calculated in each iteration and the whole data set (m=Mm=Mm= m) is used. In other words, all data points are used in each calculation of gradient and then the mean value is calculated.

Stochastic gradient descent (SGD) : The gradient is calculated in each iteration, and a data point (m=1m= 1M =1) is randomly selected from the whole data set.

Small batch gradient Descent (MBGD) : The gradient is calculated in each iteration, and a small batch of data (1

According to the formula below (20), (21), (22) and (23) (20), (21), (22) and (23) (20), (21), (22), (23) to write code, to achieve batch gradient descent method:

def batchGradientDescent(x, y, theta, alpha, m, maxInteration) :
    X: input Y: output a vector consisting of theta: w and B alpha: learning rate M: number of batches maxInteration: maximum number of iterations
    x_train = x.transpose() # transpose
    for i in range(0, maxInteration):
        # predicted
        hypothesis = np.dot(x, theta)
        # Prediction error
        error = hypothesis - y
        # Descent gradient
        gradient = np.dot(x_train, error) / m
        # update theta
        theta = theta - alpha * gradient
    return theta
Copy the code

Similarly, the random gradient descent code is:

def stochasticGradientDescent(x, y, theta, alpha, maxInteration) :
    X: input Y: output a vector consisting of theta: w and B alpha: learning rate M: number of batches maxInteration: maximum number of iterations
    data = []
    for i in range(4):
        data.append(i)
    for i in range(0, maxInteration):
        hypothesis = np.dot(x, theta)
        # Prediction error
        error = hypothesis - y
        # Select a random number
        index = random.sample(data, 1)  # Select a random data from the list data
        index1 = index[0]
        # Descent gradient
        gradient = error[index1] * x[index1]
        The derivative of # gives theta
        theta = theta - alpha * gradient
    return theta
Copy the code

Small batch gradient descent:

def miniBatchGradientDescent(x, y, theta, alpha, m, batch_size, epochs) :
    Alpha: the learning rate m: the amount of data in the dataset. Batch_size: the amount of data in a batch. Epochs: The maximum number of iterations in the dataset.
    for epoch in range(epochs):
        # Generate index list
        indexs_list = np.arange(m)
        # Iteration by batch
        for batch in range(m // batch_size):
            # Generate batch data index
            index_list = indexs_list[batch*batch_size : batch*batch_size+batch_size]
            # Get batch data
            x_batch = x[index_list]
            y_batch = y[index_list]
            # predicted
            hypothesis = np.dot(x_batch, theta)
            # Prediction error
            error = hypothesis - y_batch
            # Descent gradient
            gradient = np.dot(x_batch.T, error) / m
            # update theta
            theta = theta - alpha * gradient
    return theta
Copy the code

Summary and other explanations

Gradient descent operation steps:

Initialize weights and deviations with random values
Pass the input into the network and get the output value (predicted value)
Calculate the error between the predicted value and the true value (label value)
For each error-generating neuron, adjust the corresponding value to reduce the error
Iterations are repeated until the optimal values for the network weights and deviations are obtained

Batch gradient Descent (BGD) : The gradient is calculated in each iteration, using the entire data set. Each update will be carried out in the right direction, and finally it can ensure the convergence to the extreme point. Convex functions converge to the global extreme point, while non-convex functions may converge to the local extreme point. The defect is that it takes too long to learn and consumes a lot of memory.

Stochastic gradient descent (SGD) : The gradient is calculated in each iteration, and a data is randomly selected from the whole data set, so the time of each iteration is very fast. But the convergence is oscillating and unstable. It fluctuates near the optimal solution, and it is difficult to judge whether it has converged.

Small batch gradient Descent (MBGD) : This is a compromise between BGD and SGD. BGD uses the whole data each time and the convergence is too slow, while SGD only uses one data each time. Although the convergence is fast, the oscillation is severe, so there is a compromise MBGD, n data are used each time. Not only is the convergence rate faster and more stable than SGD, but also the oscillation near the optimal solution is not very big, and even better than BGD solution.

The selection of batch size can make full use of the matrix operation when the power of 2 is taken. Therefore, the optimal value can be selected from the power of 2. For example, 16, 32, 64, 128, 256, and so on.

In addition, there is also a small batch random gradient descent method, that is, in the small batch gradient descent method, the original input data is not selected according to the original input order, but the original input data is first scrambled, and then the batch data is selected. The code is as follows:

def mini_batch_stochastic_gradient_descent(x, y, theta, alpha, m, batch_size, epochs) :
    Alpha: the learning rate m: the amount of data in the dataset. Batch_size: the amount of data in a batch. Epochs: The maximum number of iterations in the dataset.
    for epoch in range(epochs):
        # Generate index list
        data_index = np.arange(m)
        # Shuffles the sample order
        np.random.shuffle(data_index)
        # Iteration by batch
        for batch in range(m // batch_size):
            # Generate batch data index
            batch_index = data_index[batch*batch_size : batch*batch_size+batch_size]
            # Get batch data
            x_batch = x[batch_index]
            y_batch = y[batch_index]
            # predicted
            hypothesis = np.dot(x_batch, theta)
            # Prediction error
            error = hypothesis - y_batch
            # Descent gradient
            gradient = np.dot(x_batch.T, error) / m
            # update theta
            theta = theta - alpha * gradient
    return theta
Copy the code

Reference: gradient (mathematical noun) _ Baidu Encyclopedia

Simple gradient descent algorithm. Do you really get it?

Solve three forms of gradient descent method BGD, SGD and MBGD

Super detailed explanation of gradient descent algorithm (from downhill metaphor, mathematical derivation to code implementation)

1. Directional derivative

2. The gradient

3. Mathematical derivation of gradient descent algorithm

4. Batch gradient descent code implementation

4.1 Selection of loss function and model

4.2 From formula to code

Summary and other explanations

Related Posts

Big Data For Linux (Part 1)

Polynomial regression and model generalization

How does Hengyuan Cloud GPU Server use Conda?