Read the background knowledge: matrix derivation, a drop programming knowledge

One, the introduction

Introduces two binary classification algorithm – in front of the perceptron algorithm, pocket algorithm, these algorithms are classified to solve the problem, but the reality is more of such as predict an area housing prices, how many lines of credit card, bank should give someone today what should buy and sell stock, etc. The resulting numerical results of a specific problem, This type of problem is universally known in machine learning as regression problems. Regression analysis is a method to study the relationship between multiple groups of variables in statistics, and is also widely used in machine learning. One of the algorithm models, Linear Regression 1, is introduced below.

Ii. Model introduction

Before introducing the model, let’s take a look at an example. Suppose there is a place that calculates the income of people with different working years (simulated data), as shown in the following table:

Working fixed number of year Average monthly salary
1 year 1598 yuan
2 years 3898 yuan
3 years 6220 yuan
4 years 7799 yuan
5 years 10510 yuan

From the data in the table above, it is possible to plot the average monthly income of the local area every other year:

By above can see these intuitive point seems to be in a straight line or in the vicinity, based on our intuition can determine the average income of local and working years seems to have a certain linear relationship, we just need to find a straight line, then we can predict the local six years wages of the average income of people. The process of finding such a line is called linear regression analysis.

Definition: Given some random sample points {x1, y1}, {x2, y2}… , find a hyperplane (a straight line with one variable and a plane with two variables) to fit these sample points. The linear equation is as follows: (1) The general form of the hyperplane function equation (2) Like the perceptron algorithm, b is regarded as the 0th W, and the hyperplane function equation is simplified into the dot product of two vectors W and x


y = b + w 1 x 1 + w 2 x 2 + + w M x M = w T x \begin{array}{rcc} y & =b+w_{1} x_{1}+w_{2} x_{2}+\cdots+w_{M} x_{M} \\ & =\quad w^{T} x \end{array}

A univariate is a straight line, a two-variable is a plane, and a multivariable is a hyperplane

Three, algorithm steps

So how do you find the best hyperplane from a bunch of sample points? In the above example of years of service and average monthly salary, it seems possible to draw many lines to fit the points.

Just like the two lines above, which line intuitively fits better? It can be seen from the dotted line in the figure that the distance between each sample point and line B is farther than that of line A, and line A is obviously better fitted than line B, which indicates that the fit can be judged by the distance between the sample point and line. Suppose there are N sample points {x, y}, and each sample point has M independent variables {x1, x2… }, the sum of the distances between all sample points and the hyperplane can be defined as the cost function of fitting these sample points, where W is an M-dimensional column vector, x is also an M-dimensional column vector, and y is a real number:


Cost ( w ) = i = 1 N w T x i y i \operatorname{Cost}(w)=\sum_{i=1}^{N}\left|w^{T} x_{i}-y_{i}\right|

Since there is an absolute value in the function, rewrite it as a square, which is known in geometry as the Euclidean distance 2.


Cost ( w ) = i = 1 N ( w T x i y i ) 2 \operatorname{Cost}(w)=\sum_{i=1}^{N}\left(w^{T} x_{i}-y_{i}\right)^{2}

Intuitively, we only need to minimize the value of the cost function, that is, the sum of eucliian distances from all sample points to the hyperplane, and its corresponding W is the weight coefficient of the hyperplane.


w = argmin w ( i = 1 N ( w T x i y i ) 2 ) w=\underset{w}{\operatorname{argmin}}\left(\sum_{i=1}^{N}\left(w^{T} x_{i}-y_{i}\right)^{2}\right)

argmin3Function returns the variable when the value of the function equation in parentheses is minimized

The solution method based on the minimization of the above function is called the least square method. Since the cost function is a convex function 4, the local minimum value is the global minimum value according to the properties of the convex function, and the optimal analytical solution of W can be directly obtained, where X is the N X M matrix and y is the n-dimensional column vector:


w = ( X T X ) 1 X T y w=\left(X^{T} X\right)^{-1} X^{T} y

X = [ x 1 T x 2 T x N T ] = [ X 11 X 12 X 1 M X 21 X 22 X 2 M X N 1 X N 2 X N M ] y = ( y 1 y 2 y N ) X=\left[\begin{array}{c} x_{1}^{T} \\ x_{2}^{T} \\ \vdots \\ x_{N}^{T} \end{array}\right]=\left[\begin{array}{cccc} X_{11} & X_{12} & \cdots & X_{1 M} \\ X_{21} & X_{22} & \cdots & X_{2 M} \\ \vdots & \vdots & \ddots & \vdots \\ X_{N 1} & X_{N 2} & \cdots & X_{N M} \end{array}\right] \quad y=\left(\begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{N} \end{array}\right)

4. Proof of principle

The cost function of linear regression is convex. The convex function is a real-valued function f defined on a convex subset C of a vector space, and for any two vectors x1 and x2 in a convex subset C, the following equation is satisfied:


f ( x 1 + x 2 2 ) Or less f ( x 1 ) + f ( x 2 ) 2 f\left(\frac{x_{1}+x_{2}}{2}\right) \leq \frac{f\left(x_{1}\right)+f\left(x_{2}\right)}{2}

The left side of the inequality:


Cost ( w 1 + w 2 2 ) = i = 1 N [ ( w 1 + w 2 2 ) T x i y i ] 2 \operatorname{Cost}\left(\frac{w_{1}+w_{2}}{2}\right)=\sum_{i=1}^{N}\left[\left(\frac{w_{1}+w_{2}}{2}\right)^{T} x_{i}-y_{i}\right]^{2}

The right-hand side of the inequality:


Cost ( w 1 ) + Cost ( w 2 ) 2 = i = 1 N ( w 1 T x i y i ) 2 + i = 1 N ( w 2 T x i y i ) 2 2 \frac{\operatorname{Cost}\left(w_{1}\right)+\operatorname{Cost}\left(w_{2}\right)}{2}=\frac{\sum_{i=1}^{N}\left(w_{1}^{T } x_{i}-y_{i}\right)^{2}+\sum_{i=1}^{N}\left(w_{2}^{T} x_{i}-y_{i}\right)^{2}}{2}

(1) Multiply the left hand side of the inequality by 2. (2) Move the 2 into the continuous operation, write the three terms in parentheses. (3) expand the square into six terms


2 Cost ( w 1 + w 2 2 ) = 2 i = 1 N [ ( w 1 + w 2 2 ) T x i y i ] 2 ( 1 ) = i = 1 N 2 ( w 1 T x i 2 + w 2 T x i 2 y i ) 2 ( 2 ) = i = 1 N ( w 1 T x i x i T w 1 2 + w 2 T x i x i T w 2 2 + w 1 T x i w 2 T x i 2 w 1 T x i y i 2 w 2 T x i y i + 2 y i 2 ) ( 3 ) \begin{aligned} 2 \operatorname{Cost}\left(\frac{w_{1}+w_{2}}{2}\right) &=2 \sum_{i=1}^{N}\left[\left(\frac{w_{1}+w_{2}}{2}\right)^{T} x_{i}-y_{i}\right]^{2} & (1) \\ &=\sum_{i=1}^{N} 2\left(\frac{w_{1}^{T} x_{i}}{2}+\frac{w_{2}^{T} x_{i}}{2}-y_{i}\right)^{2} & (2) \\ &=\sum_{i=1}^{N}\left(\frac{w_{1}^{T} x_{i} x_{i}^{T} w_{1}}{2}+\frac{w_{2}^{T} x_{i} x_{i}^{T} w_{2}}{2}+w_{1}^{T} x_{i} w_{2}^{T} x_{i}-2 w_{1}^{T} x_{i} y_{i}-2 w_{2}^{T} x_{i} y_{i}+2 y_{i}^{2}\right) & (3) \end{aligned}

(1) The right-hand side of the inequality is also multiplied by the sum of 2 (2) to write the sum of the two terms (3) to expand the square term in parentheses (4) to be consistent with the above


Cost ( w 1 ) + Cost ( w 2 ) = i = 1 N ( w 1 T x i y i ) 2 + i = 1 N ( w 2 T x i y i ) 2 ( 1 ) = i = 1 N [ ( w 1 T x i y i ) 2 + ( w 2 T x i y i ) 2 ] ( 2 ) = i = 1 N ( w 1 T x i x i T w 1 2 w 1 T x i y i 2 + y i 2 + w 2 T x i x i T w 2 2 w 2 T x i y i + y i 2 ) ( 3 ) = i = 1 N ( w 1 T x i x i T w 1 + w 2 T x i x i T w 2 2 w 1 T x i y i 2 w 2 T x i y i + 2 y i 2 ) ( 4 ) \begin{aligned} \operatorname{Cost}\left(w_{1}\right)+\operatorname{Cost}\left(w_{2}\right) &=\sum_{i=1}^{N}\left(w_{1}^{T} x_{i}-y_{i}\right)^{2}+\sum_{i=1}^{N}\left(w_{2}^{T} x_{i}-y_{i}\right)^{2} & (1) \\ &=\sum_{i=1}^{N}\left[\left(w_{1}^{T} x_{i}-y_{i}\right)^{2}+\left(w_{2}^{T} x_{i}-y_{i}\right)^{2}\right] & (2)\\ &=\sum_{i=1}^{N}\left(w_{1}^{T} x_{i} x_{i}^{T} w_{1}-2 w_{1}^{T} x_{i} y_{i}^{2}+y_{i}^{2}+w_{2}^{T} x_{i} x_{i}^{T} w_{2}-2 w_{2}^{T} x_{i} y_{i}+y_{i}^{2}\right) & (3)\\ &=\sum_{i=1}^{N}\left(w_{1}^{T} x_{i} x_{i}^{T} w_{1}+w_{2}^{T} x_{i} x_{i}^{T} w_{2}-2 w_{1}^{T} x_{i} y_{i}-2 w_{2}^{T} x_{i} y_{i}+2 y_{i}^{2}\right) & (4) \end{aligned}

Subtract the left side of the inequality from the right side of the inequality, and call the difference delta, and now we just have to prove that the difference between these two terms is greater than or equal to zero. (1) Observe the results of the two terms. The last three terms are the same. After subtracting (2), remove half of (3) by observing the parentheses, it will be found that it is a flat way


Δ = i = 1 N ( w 1 T x i x i T w 1 2 + w 2 T x i x i T w 2 2 w 1 T x i w 2 T x i ) ( 1 ) = 1 2 i = 1 N ( w 1 T x i x i T w 1 + w 2 T x i x i T w 2 2 w 1 T x i w 2 T x i ) ( 2 ) = 1 2 i = 1 N ( w 1 T x i w 2 T x i ) 2 ( 3 ) \begin{aligned} \Delta &=\sum_{i=1}^{N}\left(\frac{w_{1}^{T} x_{i} x_{i}^{T} w_{1}}{2}+\frac{w_{2}^{T} x_{i} x_{i}^{T} w_{2}}{2}-w_{1}^{T} x_{i} w_{2}^{T} x_{i}\right) & (1) \\ &=\frac{1}{2} \sum_{i=1}^{N}\left(w_{1}^{T} x_{i} x_{i}^{T} w_{1}+w_{2}^{T} x_{i} x_{i}^{T} w_{2}-2 w_{1}^{T} x_{i} w_{2}^{T} x_{i}\right) & (2) \\ &=\frac{1}{2} \sum_{i=1}^{N}\left(w_{1}^{T} x_{i}-w_{2}^{T} x_{i}\right)^{2} & (3) \end{aligned}

If the difference between the right hand side and the left hand side of the inequality is a continuous operation in a flat way, the final result must be greater than or equal to zero in the range of real numbers.

Analytic solution of linear regression cost function (1) The cost function of linear regression (2) can be rewritten as the dot product of two N-dimensional vectors (3) Using A to represent the first n-dimensional row vector, then the cost function is actually A vector times A vector transpose


Cost ( w ) = i = 1 N ( w T x i y i ) 2 ( 1 ) = ( w T x 1 y 1 w T x N y N ) ( w T x 1 y 1 w T x N y N ) ( 2 ) = A A T ( 3 ) \begin{aligned} \operatorname{Cost}(w) &=\sum_{i=1}^{N}\left(w^{T} x_{i}-y_{i}\right)^{2} & (1)\\ &=\left(w^{T} x_{1}-y_{1} \ldots w^{T} x_{N}-y_{N}\right)\left(\begin{array}{c} w^{T} x_{1}-y_{1} \\ \cdots \\ w^{T} x_{N}-y_{N} \end{array}\right) & (2)\\ &=\quad A A^{T} & (3) \end{aligned}

(1) Definition of vector A (2) Vector A can be written as two n-dimensional row vectors minus (3) The first row vector can be written as w transpose, as an m-dimensional vector multiplied by an M x N matrix (remember x is originally an m-dimensional column vector, (4) Define an N x M matrix x, N rows correspond to N sample numbers, M columns correspond to M variables, define an n-dimensional column vector y, N dimensions correspond to N sample numbers. And you can see that the combination of a bunch of column vectors x is the transpose of x, and the combination of a bunch of real numbers y is the transpose of the column vector y


A = ( w T x 1 y 1 w T x N y N ) ( 1 ) = ( w T x 1 w T x N ) ( y 1 y N ) ( 2 ) = w T ( x 1 x N ) ( y 1 y N ) ( 3 ) = w T X T y T ( 4 ) \begin{aligned} A &=\left(w^{T} x_{1}-y_{1} \ldots w^{T} x_{N}-y_{N}\right) & (1)\\ &=\left(w^{T} x_{1} \ldots w^{T} x_{N}\right)-\left(y_{1} \ldots y_{N}\right) & (2) \\ &=w^{T}\left(x_{1} \ldots x_{N}\right)-\left(y_{1} \ldots y_{N}\right) & (3) \\ &=\quad w^{T} X^{T}-y^{T} & (4) \end{aligned}

(1) cost function as the dot product of two vector (2) above, in the form of simplified into A cost function in (3) according to the nature of the transposed 5, you can transfer at the back of the rewrite (4) will be launched (5) to observe the multiplication are transposed two found among them, and because the final result of the two is A real number, So these two terms have to be the same, so we can combine these two terms


Cost ( w ) = A A T ( 1 ) = ( w T X T y T ) ( w T X T y T ) T ( 2 ) = ( w T X T y T ) ( X w y ) ( 3 ) = w T X T X w w T X T y y T X w + y T y ( 4 ) = w T X T X w 2 w T X T y + y T y ( 5 ) \begin{aligned} \operatorname{Cost}(w) &=A A^{T} & (1)\\ &= \left(w^{T} X^{T}-y^{T}\right)\left(w^{T} X^{T}-y^{T}\right)^{T} & (2) \\ &= \left(w^{T} X^{T}-y^{T}\right)(X w-y) & (3) \\ &= w^{T} X^{T} X w-w^{T} X^{T} y-y^{T} X w+y^{T} y & (4) \\ &= w^{T} X^{T} X w-2 w^{T} X^{T} y+y^{T} y & (5) \end{aligned}

(1) The partial derivative of the cost function with respect to W is obtained. According to the vector derivative formula, only the first and second terms are related to W, and the last term is a constant. Because the cost function is convex, when the partial derivative of the cost function with respect to W is 0 vector, the cost function is minimum. (2) Take the second term and divide it by 2, and multiply both sides by the inverse, and the left-hand side and the inverse are the identity matrix, so we’re just left with the W vector.


partial Cost ( w ) partial w = 2 X T X w 2 X T y = 0 ( 1 ) w = ( X T X ) 1 X T y ( 2 ) \begin{aligned} \frac{\partial \operatorname{Cost}(w)}{\partial w} &=2 X^{T} X w-2 X^{T} y=0 & (1)\\ w &=\left(X^{T} X\right)^{-1} X^{T} y & (2) \end{aligned}

In this way, the analytic solution of W can be obtained. It can be seen that there is no inverse matrix when N x N matrix in parentheses is not full rank matrix 6, and its essence is that there is Multicollinearity 7 between independent variables X, The next section introduces multicollinearity in linear regression and how to solve such multicollinearity problems. The part before the y vector in the following formula is called the pseudo-inverse of the matrix X, which can be obtained by singular value decomposition 8 (SVD).


w = ( X T X ) 1 X T X + y w=\underbrace{\left(X^{T} X\right)^{-1} X^{T}}_{X^{+}} y

You can see that X is an N X M matrix, and y is an n-dimensional column vector. In the above example of years of service and average monthly salary, X is a 5 X 2 matrix, y is a 5-dimensional column vector, and finally w can be calculated as a 2-dimensional column vector, so the linear equation of this example is y = 2172.5 * x-512.5.


X = [ 1 1 1 2 1 3 1 4 1 5 ] y = ( 1598 3898 6220 7799 10510 ) X = \begin{bmatrix} 1 & 1\\ 1 & 2\\ 1 & 3\\ 1 & 4\\ 1 & 5 \end{bmatrix} \quad y = \begin{pmatrix} 1598\\ 3898\\ 6220\\ 7799\\ 10510 \end{pmatrix}

w = ( X T X ) 1 X T y = ( 512.5 2172.5 ) W = \ left (X ^ {T} \ right) X ^ ^ {1} X {T} y = \ left (\ begin {array} {c} – 512.5 {array} \ \ \ \ 2172.5 end right)

Geometric interpretation of analytic solution of linear regression cost function matrix form of linear equation, where y is n-dimensional column vector, X is N X M matrix, w is m-dimensional column vector:


y = X w y = Xw

First, take a look at the figure below. The gray hyperplane S is composed of M black N-dimensional column vectors X. The red n-dimensional column vector Y is the y value of the actual sample point. In the example of years of service and average monthly salary, X1 = (1, 1, 1, 1, 1), X2 = (1, 2, 3, 4, 5), y = (1598, 3898, 6220, 7799, 10510). And now we want to find the linear combination of y prime of x, which minimizes y minus y prime, which is the purple vector. It can be seen from the figure that when y – y’ is a normal vector to the hyperplane, it is shortest, which is also equivalent to the fact that y – y’ is perpendicular to every vector x.

Geometric interpretation

(1) y – y’ is perpendicular to each of the vectors x, and notice that the result is the m-dimensional zero vector (2) replaces the linear equation of y’ (3) by expanding the parentheses (4) and transpose the term to obtain w


X T ( y y ) = 0 ( 1 ) X T ( y X w ) = 0 ( 2 ) X T y X T X w = 0 ( 3 ) w = ( X T X ) 1 X T y ( 4 ) \begin{array}{l} X^{T}\left(y-y^{\prime}\right)=0 & (1) \\ X^{T}(y-X w)=0 & (2)\\ X^{T} y-X^{T} X w=0 & (3) \\ w=\left(X^{T} X\right)^{-1} X^{T} y & (4) \end{array}

You can see that the geometric interpretation is consistent with the result by taking derivatives

Five, code implementation

Using Python to implement the linear regression algorithm:

def linear(X, y) :
    "" linear regression ARgs: X-training data set Y-target label value return: W-weight coefficient ""
   The pinv function directly evaluates the pseudo-inverse matrix of the matrix
   return np.linalg.pinv(X).dot(y)
Copy the code

Sixth, third-party library implementation

Scikit – learn9 implementation:

from sklearn.linear_model import LinearRegression

Initialize the linear regressor
lin = LinearRegression(fit_intercept=False)
# Fit the linear model
lin.fit(X, y)
# Weight coefficient
w = lin.coef_
Copy the code

Seven, animation demonstration

When the linear equation takes different weight coefficients, the corresponding cost function changes. It can be seen that the cost function first decreases to a minimum value and then gradually increases.

Mind mapping

Ix. References

  1. En.wikipedia.org/wiki/Linear…
  2. En.wikipedia.org/wiki/Euclid…
  3. En.wikipedia.org/wiki/Arg_ma…
  4. En.wikipedia.org/wiki/Convex…
  5. En.wikipedia.org/wiki/Transp…
  6. En.wikipedia.org/wiki/Rank_ (…
  7. En.wikipedia.org/wiki/Multic…
  8. En.wikipedia.org/wiki/Singul…
  9. Scikit-learn.org/stable/modu…

The full demo can be found here

Note: this article strives to be accurate and easy to understand, but because the author is a beginner, the level is limited, such as the existence of errors or omissions in the article, please readers through the way of comment criticism