Gradient descent

For a hypothetical function, we have a way to measure how well it fits the data. Now we need to estimate the parameters in the hypothesis function. That’s where gradient descent comes in.Imagine that we plot our hypothesis functions θ0 and θ1 based on their fields (in fact, we plot the cost function as a function of parameter estimates). We are not plotting x and y per se, but rather the range of parameters of our hypothetical function and the cost of choosing a particular set of parameters.

We put theta 0 on the x axis and theta 1 on the y axis, and the cost function on the vertical Z axis. The points on our chart will be the result of a cost function, using our assumptions and those particular θ parameters. The following figure illustrates this setup:When our cost function is at the bottom of the pit in the graph, when its value is minimum, we will know we have succeeded. The red arrow shows the smallest point in the diagram.

The way we do this is by taking the derivative (tangent of the function) of our cost function. The slope of the tangent line is the derivative of that point, and it’s going to give us direction. We lower the cost function in the steepest downward direction. The size of each step is determined by the parameter α and is called the learning rate.

Take a look at the graph below, which is the definition of the gradient descent algorithm:We’re going to go back and do this until we converge, and we’re going to update the parameter theta j.

  • := denotes assignment, which is an assignment operator
  • Alpha is called the learning rate, and it controls how many steps we take down the hill
  • Differential item

The tricky thing about implementing the gradient descent algorithm is that if you want to update the equation, you need to update both θ0 and θ1. To further illustrate the process, you can use the following image:In contrast, the following is an incorrect implementation because it does not synchronize updates:

Gradient descent intuition

When the function has only one parameter, if we have a cost function J with only parameter θ1, the expression of the differential term is as follows:We initialize this point using gradient descent method θ1, as shown in the figure below:The figure below shows that θ1 increases when the slope is negative and decreases when it is positive, as shown below:As a side note, we should adjust our parameter alpha to ensure that the gradient descent algorithm converges in a reasonable amount of time. Failing to converge or taking too long to get a minimum means that our step size is wrong, as shown below: Smaller magnitude, and that’s because as we approach the local lowest point, obviously at the local lowest point the derivative is equal to zero, so gradient descent will automatically take a smaller magnitude, that’s how gradient descent works, so there’s really no need to decrease alpha any more.

It also explains why gradient descent can achieve a local optimal solution, even if the learning rate α is fixed.

Can well explain this: there is a cost function on theta a J, I want to minimize it, my first initialization algorithm, in the gradient descent, if I step forward, maybe it will take it to a new, more forward, more less steep, because of the closer it gets to the minimum value, corresponding to the derivative of more and more close to zero. As I get closer to the optimal value, every step I take, the new derivative gets a little bit smaller, so if I take another step forward, naturally I’ll take a little bit smaller step until I get closer to the global optimal value.This is the gradient descent algorithm, and you can use it to minimize any cost function J.

Linear regression gradient descent

The above three formulas are what we have mentioned among ourselves: gradient descent algorithm, linear regression model and linear hypothesis, squared error cost function.

Next, we will use gradient descent to minimize the squared error cost function. So we need to figure out what the partial derivative term is in the gradient descent method, and the following is its calculation process:These differential terms are actually the slope of the cost function J, and when we put it back into the gradient descent method, we get the following result: gradient descent for linear regression, repeating the parentheses until convergenceIn fact, it can be reduced to the following:So, this is just a gradient descent of the original cost function J. This method looks at each example of the entire training set at each step, called batch gradient descent, and we end up with the following formula: our sum of m training examples, we’re actually looking for the entire batch of training examples.Note that although gradient descent is generally susceptible to local minima, the linear regression optimization problem we propose here has only one global optimal value and no other local optimal values; So gradient descent always converges (assuming the learning rate α is not too large) to the global minimum. In fact, J is a convex quadratic function. This is an example of gradient descent because it runs to minimize quadratic functions.The ellipse shown above is the outline of a quadratic function. It also shows the trajectory of gradient descent, which is initialized at (48,30). The x in the figure (connected by straight lines) marks the continuous values of θ that the gradient descent experiences as it converges to a minimum.