This article is featured at: Aistudyclub.com

Machine learning has become more and more popular, and is gradually being applied to all kinds of industries. How do you learn machine learning as a programmer? In this series of articles, I will walk you through the basics of machine learning based on my learning experience.

I was originally an embedded development engineer. I began to learn machine learning by myself in 2018, and now I am mainly engaged in CV of deep learning. When I was first introduced to machine learning, I also struggled with how to get started. After searching a lot of materials, I found the Machine Learning course on Coursera. After Learning the whole course, I finally mastered the basic knowledge of Machine Learning, laying a foundation for Learning deep Learning in the future. Therefore, in this series of articles, I will take you to understand what Machine Learning is based on Machine Learning lessons and exercises. Today we’re going to talk about linear regression. This article is to hope that everyone can understand, you can have a preliminary understanding of linear regression, so here will not go into the derivation of some principles.

1. Introduction to linear regression

What is linear regression? Let’s use housing price data as an example to describe what linear regression is. Take a look at the data in the table below.

Area (㎡) Price (ten thousand yuan)
210 460
141 232
153 315
85 178
. .

By observing the above data and combining common sense in daily life, we can draw a conclusion that there is a certain relationship between area and price. However, we are not sure what the relationship is at present, but we can first express their relationship with the formula:

H = theta. Theta 0 + x 1 H = theta. Theta _1H _0 + x = theta. Theta 0 + x 1 (1)

Where H is price and x is area. So if we determine θ0, θ_0, θ0 and θ1, θ_1, θ1, we can determine the relationship between area and price. H and x are both known data in the table, and a large amount of data can be used to determine these two values, so that the relationship between area and price can be represented by a linear function. Then the process of determining these two values is called linear regression.

We also know that housing prices are not only related to size, but may also be related to many other factors. Looking at the table below, several other features have been added to the table besides area.

Area (㎡) Number of the bedroom floor that Price (ten thousand yuan)
210 5 1 45 460
141 3 2 40 232
153 3 2 30 315
85 2 1 36 178

At this point, our formula 1 is updated to the following formula:

H = theta x1 theta 0 + 1 + 2 + x2 theta… + xn theta nH + x_1 theta equals theta _0 _1 + x_2 theta _2 +… + x_n theta _nH = theta x1 theta 0 + 1 + 2 + x2 theta… + xn theta n (2)

If the values of θ0θ_0θ0 to θnθ_nθn are initialized to random values, then the house price H calculated by Formula 2 must be in error with the real value. If you adjust the values of parameters θ0θ_0θ0 to θnθ_nθn so that the value of H tries to solve the real price, then can you determine the relationship between the area and the price, so that the approximate price of houses can be predicted according to the area.

2. Loss function

Since there is a certain error between housing price H and the real value calculated by Formula 2, we first need to use a formula to describe the error, which is called the loss function, represented by J. The definition is as follows:

J (theta 0, 1, theta theta 1,… , theta n) = 12 m ∑ I = 1 m (h) (x (I) – y (I)) 2 j (theta _0, theta _1, theta _1,… , theta _n) = \ frac {1} {2} m. \ sum_ {I = 1} ^ m (h (x ^ {(I)}) – y ^ {} (I)) ^ 2 j (theta 0, 1, theta theta 1,… , theta n) = 2 m1 ∑ I = 1 m (h) (x (I) – y (I)) 2 (3)

In formula 3 above, θθθ is the unknown variable to be desired, and x and y are known quantities in the table above, so how do we calculate θθθ? Here we use gradient descent, a method commonly used in machine learning.

3. Gradient descent

Gradient descent reduces the loss function by iterating continuously to update the value of θ\thetaθ as follows:

reapeat{

Theta. J: = theta j – alpha partial partial theta jJ (theta 0, 1, theta… , theta n) \ theta_j: = \ theta_j – \ alpha \ frac {\ partial} {\ partial \ theta_j} J (\ theta_0 \ theta_1,… , theta. J: \ theta_n) = theta – alpha partial theta partial j j j (theta 0, 1, theta… , theta n) (4)

}

In formula 4, α\alphaα represents the learning rate, which is usually a real number less than 1, such as 0.1, 0.01, etc., which needs to be adjusted according to the actual situation. θj\ theTA_j θj represents the parameter to be updated in this iteration. Partial partial theta jJ (theta 0, 1, theta… , theta n) \ frac {\ partial} {\ partial \ theta_j} J (\ theta_0 \ theta_1,… , \ theta_n) partial theta partial j j (theta 0, 1, theta… ,θn) this part is the partial derivative of J with respect to θ J \theta_jθ J. The value obtained is the gradient value of the current θ J \theta_jθ J. It is a vector, so it has direction. It’s going to be the direction of the fastest rate of change, and it’s going to be the maximum in that direction. The common point is to calculate the contribution of the current θj\theta_jθj value to the error. So the value of parameter θj\theta_jθj can be updated by using the contribution of θj\theta_jθj minus the learning rate times θj\theta_jθj to the error in an iterative way. The updated θj\theta_jθj calculates a smaller loss than the original. After several iterations, the loss function will converge to a smaller value and the iteration can be stopped. This process can also be called training. Note that this is only a brief introduction to the process of gradient descent, also known as the training process. In practice, there are still many problems to be considered, such as local optimal solution, over-fitting and under-fitting, etc., but there are solutions to these problems, which will be covered in later articles.

Let me give you a simple example of gradient descent to help you understand. Suppose we have a slightly simplified loss function as follows:

J = theta 2 J = \ theta ^ 2 J = theta 2 (5)

It’s easy to see that this is a parabola with an upward opening, and it has only one lowest point, which is J=0J=0J=0 when theta \theta theta is 0.

  • First, assume that θ\theta theta has an initial value of 1.

    Theta 0 = 1 \ theta ^ 0 = 1 theta 0 = 1 draw J J J = 1 = 1 = 1

  • Then do a gradient descent iteration, where we choose the learning rate α=0.4\alpha=0.4α=0.4.


    Theta. 1 = Theta. 0 Alpha. partial partial Theta. 0 J \theta^1 = \theta^0 -\alpha\frac{\partial}{\partial\theta^0}J


    Theta. 1 = 1 0.4 x 2 Theta. 0 = 1 0.8 = 0.2 1-0.4 \ \ theta ^ 1 = times2 \ theta ^ 0 = 1-0.8 = 0.2


    J = ( Theta. 1 ) 2 = 0.04 J = (\ theta ^ 1 ^ 2 = 0.04

  • Continue iterating with the updated value of θ1\theta^1θ1.


    Theta. 2 = Theta. 1 Alpha. partial partial Theta. 1 J \theta^2 = \theta^1 -\alpha\frac{\partial}{\partial\theta^1}J


    Theta. 2 = 0.2 0.4 x 2 Theta. 1 = 0.2 0.4 x 0.4 = 0.04 0.2 0.4 \ \ theta ^ 2 = times2 \ theta ^ 1 = 0.2-0.4 \ times0.4 = 0.04


    J = ( Theta. 2 ) 2 = 0.0016 J = (\ theta ^ 2) ^ 2 = 0.0016



    Theta. 3 = Theta. 2 Alpha. partial partial Theta. 2 J \theta^3 = \theta^2 -\alpha\frac{\partial}{\partial\theta^2}J


    Theta. 3 = 0.04 0.4 x 2 Theta. 2 = 0.04 0.4 x 0.08 = 0.008 0.04 0.4 \ \ theta ^ 3 = times2 \ theta ^ 2 = 0.04 0.4 \ times0.08 = 0.008


    J = ( Theta. 3 ) 2 = 6.4 x 1 0 5 J = (\ theta ^ 3) ^ 2 = 6.4 \ times10 ^ {5}



    Theta. 4 = Theta. 3 Alpha. partial partial Theta. 3 J \theta^4 = \theta^3 -\alpha\frac{\partial}{\partial\theta^3}J


    Theta. 4 = 0.008 0.4 x 2 Theta. 3 = 0.008 0.4 x 0.016 = 0.0016 0.008 0.4 \ \ theta ^ 4 = times2 \ theta ^ 3 = 0.008 0.4 \ times0.016 = 0.0016


    J = ( Theta. 4 ) 2 = 2.56 x 1 0 6 J = (\ theta ^ 4) ^ 2 = 2.56 \ times10 ^ {6} –

As you can see by the fourth iteration our loss function JJJ is already very small, and as we continue to iterate JJJ will approach zero. At this point, we can stop training and get a suitable parameter θ=0.0016\theta=0.0016θ=0.0016. The gradient descent process is roughly shown in the figure below.

4. To summarize

We started with the data, and we described briefly what linear regression is. Learned the formula of linear regression and multiple linear regression formula. Meanwhile, a formula of loss function is constructed by using the formula of linear regression. The loss function is needed in the training process, and the premise of the training is the need for a large number of data. Here, taking the housing price data as an example, we can calculate θ\thetaθ parameters through gradient descent to describe the direct relationship between the characteristic value of housing and the housing price, thus completing the process of linear regression.