Linear regression

In a nutshell, a linear model is a weighted sum of input features, plus a constant called a bias term (or intercept term), and makes predictions based on that, as shown in the following formula:


y = Theta. 0 + Theta. 1 x 1 + Theta. 2 x 2 + . . . + Theta. n x n y = \theta_0 + \theta_1x_1 + \theta_2x_2 + … + \theta_nx_n

In this formula:

  • yIs predicted
  • nIs the number of features

  • x i x_i
    Is the firstiA characteristic value

  • Theta. j \theta_j
    Is the firstjModel parameters (including deviation term)
    Theta. 0 \theta_0
    And feature weight
    Theta. 1 . Theta. 2 . Theta. 3 . . . . . Theta. n \theta_1,\theta_2,\theta_3,… ,\theta_n
    )

Training linear regression model is the process of setting model parameters until the model best fits the training set. Therefore, we need to know how to measure the model’s fitting degree to the training data first. The most common performance indicator of regression models is root mean square error (RMSE), so it seems necessary to find θ\ Theta θ values to minimize RMSE when training linear regression models. However, in practice, MSE minimization is often used instead of RMSE minimization because MSE is simple to calculate and has the same effect as RMSE.

On the training set X, the MSE of the linear regression is calculated using the following formula (hθh_\thetahθ is a hypothetical function) :


M S E = ( X . h Theta. ) = 1 m i = 1 m ( Theta. T x ( i ) y ( i ) ) 2 MSE=(X,h_\theta)=\frac{1}{m}\sum_{i=1}^m(\theta^Tx^{(i)}-y^{(i)})^2

To get the lowest θ\thetaθ value for the cost function, there is a closed form solution (the mathematical equation that leads directly to the result), the standard equation:


Theta. = ( X T X ) 1 X T y \theta=(X^TX)^{-1}X^Ty

In this equation:

  • θ\theta theta is the value of theta \theta theta that minimizes the cost function
  • Y is a vector of target values from y(1)y^{(1)}y(1) to y(m)y^{(m)}y(m)

We generate some random linear data to test this formula:

import numpy as np
X = 2 * np.random.rand(100.1)
y = 4 + 3*X +np.random.randn(100.1)
Copy the code

The following linear data set is obtained:

Now let’s calculate theta \theta theta using the standard equation. Invert the matrix using the inv() function in the linear algebra module (Np.linalg) provided by NumPy, and compute the inner product of the matrix using the dot() function:

X_b = np.c_[np.ones((100.1)),X] # add x0=1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.t).dot(y) 
Copy the code

The actual generated data function is y=4+3×1+ Gaussian noise y=4+3x_1+ Gaussian noise y=4+3×1+ Gaussian noise. The formula results are as follows:

>>>theta_best
array([[4.21509616], [2.77011339]])
Copy the code

This is very close to the expected θ0=4\theta_0=4θ0=4 (4.21509616), θ1=3\ theTA_1 =3θ1=3 (2.77011339), but because the noise does not completely restore the original function, we plot the prediction results:

The computational complexity is linear with respect to the number of instances and the number of features to be predicted, and it takes roughly twice as long to predict for twice as many instances (or twice as many features).