Working principle of gradient descent algorithm in machine learning

The author | NIKIL_REDDY compile | source of vitamin k | Analytics Vidhya

introduce

Gradient descent algorithm is one of the most commonly used machine learning algorithms in industry. But this confuses many newcomers.

If you’re new to machine learning, the math behind gradient descent isn’t easy. In this article, my goal is to help you understand the intuition behind gradient descent.

We will quickly understand the role of cost functions, the explanation of gradient descent, and how to choose learning parameters.

What is the cost function

It is a function that measures the performance of the model for any given data. The cost function quantifies the error between the predicted value and the expected value and expresses it as a single real number.

After making assumptions about the initial parameters, we calculate the cost function. In order to reduce the cost function, the gradient descent algorithm is used to modify the given data. Here’s how it looks mathematically:

! [] (qiniu.aihubs.net/90857Screen… (41)_LI.jpg)

What is gradient descent

Suppose you’re playing a game where the player is at the top of a mountain and they’re asked to reach the lowest point of the mountain. In addition, they were blindfolded. So, how do you think we can get to the lake?

Before you read on, take a moment to think about it.

The best way is to look at the ground and find out where the ground is falling. Starting from this position, take a step down and repeat the process until you reach the lowest point.

! [](qiniu.aihubs.net/70205gd mountain.jpg)

Gradient descent (GDA) is an iterative optimization algorithm for solving local minima of functions.

To find the local minimum of a function using the gradient descent method, one must choose the direction of the negative gradient (away from the gradient) of the function at the current point. If we take a positive direction with respect to the gradient, we will approach the local maximum of the function, a process called gradient ascent.

Gradient descent was first proposed by Cauchy in 1847. It is also known as the fastest descent.

The goal of the gradient descent algorithm is to minimize a given function (such as the cost function). To achieve this, it iteratively performs two steps:

  1. Compute the gradient, the first derivative of the function at that point

  2. Make a step in the direction opposite to the gradient

! [] (qiniu.aihubs.net/36152Screen… (43).png)

Alpha is called learning rate – an adjustment parameter in the optimization process. It determines the step size.

Draw gradient descent algorithm

When we have a single parameter (θ), we can plot the dependent variable cost on the Y-axis and θ on the X-axis. If we have two parameters, we can do a three-dimensional plot with cost on one axis and two parameters (θ) on the other.

It can also be visualized using contour lines. This shows a two-dimensional three-dimensional plot with parameters and contour line response values along both axes. The response value away from the center increases and increases as the number of loops increases.

Alpha vector

We have a way forward, and now we must decide the size of the steps we must take.

Careful selection must be made to achieve local minima.

  • If the learning rate is too high, we may exceed the minimum and not reach the minimum

  • If the learning rate is too low, the training time may be too long

A) The learning rate is optimal and the model converges to the minimum

B) The learning speed is too small and requires more time, but it will converge to the minimum value

C) The learning rate is higher than the optimal value and the convergence is slower (1/c<η < 2/c)

D) The learning rate is very large, it will excessively deviate from the minimum value, and the learning performance will decline

Note: As the gradient decreases, it moves to the local minimum value and the step size decreases. As a result, the learning rate (alpha) can remain constant during optimization, rather than changing iteratively.

Local minimum

A cost function can consist of many minima. The gradient can fall at any minimum, depending on the initial point (that is, the initial parameter θ) and the learning rate. Therefore, optimization can converge to different points at different starting points and learning rates.

Gradient descent Python code implementation

At the end

Once we adjust the learning parameter (alpha) and get the optimal learning rate, we start iterating until we converge to the local minimum.

The original link: www.analyticsvidhya.com/blog/2020/1…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/