This article has participated in the activity of “New person creation Ceremony”, and started the road of digging gold creation together

Principle of linear regression

So as YOU can see here, this is a two-dimensional set of data, so let’s think about how do we fit these scattered points well with a straight line? To put it bluntly: try to fit the line through the scattered points (the points are close to the fitting line).

Objective function (cost function)

To get these points close to the fitting line, we need to use the mathematical formula:

Gradient descent method

When we talked about regression, we get the minimum value by taking the derivative, but the data must be reversible, and we usually use gradient descent, which is the deviation in the slope direction. To see this article in detail (www.jianshu.com/p/96566542b… Tips: This article explains gradient ascent, which is similar to gradient descent.

Actual combat – housing price forecast

Data import

Sklearn. Datasets is used to import our Boston housing price dataset.

from sklearn.datasets import load_boston
boston = load_boston()
Copy the code

The DESCR property provides a detailed view of the data set, which has 14 columns, with the first 13 columns for characteristic data and the last column for label data.

print(boston.DESCR)
Copy the code

Boston’s Data and target store features and labels, respectively:

Shard the data set
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(boston.data, Boston. Target, test_size = 0.2, random_state=2)Copy the code
Data preprocessing

The ordinary linear regression model is too simple and easy to lead to under-fitting. We can add characteristic polynomials to make the linear regression model better fit the data. In SkLearn, characteristic polynomials are added through the theoreian features in the Preprocessing module. Its important parameters are:

  • Degree: indicates the number of polynomial features. The default value is 2
  • Include_bias: Defaults to True and contains a bias column that is used as a intercept item in linear models. Select False here because you can set whether or not intercept items are required in linear regression.
from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=2,include_bias=False) X_train_poly  = poly.fit_transform(X_train) X_test_poly = poly.fit_transform(X_test)Copy the code
Model training and evaluation

The linear algorithm uses the LinearRegression method in the Sklear.linear_model module. Common parameters are as follows:

  • Fit_intercept: Defaults to True, whether intercept entries are counted.
  • Normalize: Indicates whether to normalize data. The default value is False.

Simple linear regression

from sklearn.linear_model import LinearRegression

model2 = LinearRegression(normalize=True)
model2.fit(X_train, y_train)
model2.score(X_test, y_test)

# result
# 0.77872098747725804
Copy the code

Polynomial linear regression

model3 = LinearRegression(normalize=True)
model3.fit(X_train_poly, y_train)
model3.score(X_test_poly, y_test)

# result
# 0.895848854203947
Copy the code
conclusion

The increasing number of polynomials can have a good effect on the training set, but the lack of it is easy to cause overfitting and can not have a good effect on the test set, which is often said: poor model generalization ability.