Gradient descent is an optimization algorithm that iteratively searches for the minimum of a function. Using gradient descent, the process of finding the local minimum of a function starts at a random point and moves in the direction opposite to the gradient (or approximate gradient) of the function at the current point.

Gradient descent can be used to search for optimal parameters in linear and logarithmic probability regression. As for SVM and neural networks, we will worry about them later. In many models, such as logarithmic regression or SVM, the optimization criteria are convex. Convex functions have only one minimum, the global minimum. In contrast, the optimization criteria in neural networks are non-convex. However, even finding a local minimum is sufficient for many practical problems.

Let’s see how gradient descent works.

Gradient descent

In this section, we specify how to solve a linear regression problem using gradient descent [1]. We use Python code to illustrate our description, and we also use diagrams to show how the solution changes after several iterations of gradient descent. Here, we have a data set with a single feature. Even so, the optimization criteria will still have two parameters: W and B. Scaling to multi-dimensional training data is easy: for two-dimensional data we have W (1), W (2), and B, for three-dimensional data we have W (1), W (2), W (3), and B, and so on.

For a more concrete example, here we use a real-world data set (available in the wiki of this book). The data includes two columns: how much each company spends on broadcast advertising each year, and how many units they sell each year. We want to build a regression model that predicts unit sales based on how much a company spends on radio advertising. Each row in the dataset represents a specific company.

We have 200 companies, so we have 200 training samples, in the form of (xi,yi)=(expenses, sales). The full sample can be represented in the diagram in Figure 4.1.

 

 

FIG. 4.1 Raw data

Note: The Y-axis corresponds to sales units (the amount we want to forecast) and the X-axis corresponds to our characteristics, which is the amount spent on radio advertising (millions of dollars).

To recap, the linear regression model is of the form f(x)=wx+b. We don’t know the optimal values of W and B, we have to learn from the data. Specifically, we look for values of W and B that minimize the mean square error:

 

 

Gradient descent begins by calculating the partial derivative of each parameter:

 

 

(4.1)

So to find the partial derivative of yi minus wx plus b of 2 with respect to w, we need to use the chain rule. Here,f=f2(f1) is a composition function. Among them, the f1 = yi – (wx + b), and

. So to take the partial derivative of f with respect to w, we have to take the partial derivative of f with respect to F2, which is equal to 2 times yi minus wx plus b

). And then times the partial derivative of yi minus wx plus b with respect to w minus x. Put it all together, yes

. Do the same thing. Take the partial derivative of L with respect to b

.

The gradient descent is periodic (EPOCH). Update each parameter with the entire training set each cycle. In the first cycle, we initialize the [2] parameters W ←0 and B ←0. The partial derivatives

and

Respectively is equal to the

and

. At each period, we update w and B with partial derivatives. The magnitude of update is controlled by the learning rate α.

 

 

(4.2)

We subtract (not add) the partial derivative from the parameter value, because the derivative is an indicator of how fast the function is growing. If the derivative is positive at some point [3], then the function is growing at that point. Because we want to minimize the target function, when the derivative is positive, the parameters should be moved in the opposite direction (to the left of the coordinate axis). When the derivative is negative (the function is falling), the argument continues to move to the right, causing the function to decrease further.

In the next period, we recalculate the partial derivatives with Equation 4.1 and the new values of W and b; Repeat this step until convergence. In general, we need many cycles to observe that the values of W and B no longer change much after each cycle, and then we can stop.

It’s hard to imagine a machine learning engineer who doesn’t like the Python programming language at all. So if readers are still waiting for the right time to start learning the language, now is the time. Let’s take a look at how gradient descent is implemented in Python.

In each cycle, the function for updating the parameters w and b is shown below.

 

 

 

 

A function that repeats multiple cycles in a for loop is shown below.

 

 

The avg_loss function in the train function above is used to calculate the mean square error, as defined below:

 

 

If we set the functions α= 0.001, w=0, b=0, and run the train function with a period of 1 500, we will see the following output (partial output only).

 

 

We can see that the average loss decreases with each cycle of the train function. The change of regression line with the training period is shown in Figure 4.2.

 

 

FIG. 4.2 The regression line changes with the gradient descent period

Finally, once we find the optimal values for the parameters W and b, we need only one function to make the prediction:

 

 

Try running the following code.

 

 

The output should be 13.97.

Gradient descent is sensitive to the choice of learning rate α. At the same time, the training speed on large data sets is slow. Fortunately, computer scientists have come up with some important improvements to the original algorithm.

Minibatch Stochastic Gradient Descent (Minibatch SGD) is a modified version of this, which speeds up calculations by estimating gradients using small batches of training samples (subsets). SGD itself has many “upgrades.” Adagrad, for example, is an upgrade that adjusts the learning rate of each parameter, alpha, by historical gradients: alpha decreases when the gradient is very high, and increases when the gradient is very high. Momentum is a method of accelerating SGD by specifying the direction of gradient descent as relevant and reducing the sway. SGD variants, such as RMSprop and Adam, are also commonly used when training neural networks.

It is important to note that gradient descent and its deformation are not machine learning algorithms. They are simply solvers of minimization problems, provided that the minimized function has a gradient (at most points in the domain).

This article is excerpted from Introduction to Machine Learning

 

This book teaches the necessary knowledge and skills in the field of machine learning in short paragraphs and concise language. The book consists of 11 chapters and a glossary, which introduces the basic concepts, symbols and definitions, algorithms, basic practice methods, neural networks and deep learning, problems and solutions, advanced operations, unsupervised learning and other learning methods. It covers many core concepts and methods such as supervised learning and unsupervised learning, support vector machine, neural network, ensemble learning, gradient descent, cluster analysis, dimension reduction, autoencoder, transfer learning, reinforcement learning, feature engineering, and hyperparametric debugging. A detailed glossary of terms is given at the end of the book.

This book will help readers understand how machine learning works and provide a foundation for further understanding and research into the complex problems in the field. This book is suitable for software practitioners who want to learn and master machine learning, data scientists who want to apply machine learning techniques, and general readers who want to learn about machine learning.