This is the fourth day of my participation in Gwen Challenge

Said before learning supervision of machine learning, learning supervision is provided with a tag data, namely the part of the first known questions and answers (training set) to the machine learning, machine for data fitting in the learning process, generate the right model, and then put the new problems to the machines, then the machine through before fitting out the new model of the rule of the q The problem is solved by calculation.

For example, we first tell the machine the height values corresponding to different ages and genders. After the machine is trained with a large amount of data, we provide the machine with the corresponding age and gender and he can calculate the height corresponding to the age and gender according to the previous model.

There are two main tasks in supervised learning, namely regression and classification.

  • Regression problems
  • Classification problem

Regression is the task of predicting continuous specific values, while classification problem is to classify transactions for discrete prediction.

We’ll start today by looking at one of the regression problems: simple linear regression

Simple linear regression contains an independent variable (x) and a dependent variable (y), and the relationship between the above two variables is simulated by a straight line, namely the regression line.

It can be expressed as:


y = a + b x y=a+b*x

Where, A is estimating the vertical intercept of the linear equation, and B is estimating the slope of the linear equation.

However, there is a difference between our predicted value and the actual value. If the predicted value of X1x_1X1 is Y1Y_1Y1 and the actual value is YYY, we call the difference between YYy and Y1y_1Y1 as residual.


e r r o r = y y 1 error=y-y_1

Our goal is to make the predicted value as close to the actual value as possible, that is, the residual as small as possible. In other words, when we find a set of (a, b) that minimizes the sum of the squares of the residuals, it means that, to some extent, we have found the simple linear regression model that predicts best.

We can use the LinearRegression of sklearn.linear_model to solve simple LinearRegression problems

Taking the relationship between the length of service and salary as an example, we have the following data:

Year Salary
1.1 39343
1.3 46205
1.5 37731
2 43525
2.2 39891
2.9 56642
3 60150
3.2 54445
3.2 64445
3.7 57189
3.9 63218

Firstly, the data is divided into X_train, Y_train training set into X_test and Y_test test set through data preprocessing.

# Simple linear regression set for training set
from sklearn.linear_model import LinearRegression
Create a regressor
regressor = LinearRegression()
# Fit regressor
regressor.fit(X_train,y_train)
Copy the code

This can be predicted by fitting the regressor to the test set

# Predict results by regression
y_pred = regressor.predict(X_test)
Copy the code

Y_pred is the result of prediction by simple linear regression.

You can view the predicted results more intuitively by using ICONS

# draw test set
plt.scatter(X_test,Y_test,color='red')
plt.plot(X_test,regressor.predict(X_test),color='blue')
plt.title('Salary Vs Experience - Training Set')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
Copy the code

We have completed the correlation processing of simple linear regression, let’s look at the other regression problems, the complete code is as follows:

Import the base library
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## import data
dataset = pd.read_csv('data.csv')

Set independent variable column, dependent variable column
X = dataset.iloc[:,:-1]. values
Y = dataset.iloc[:,1].values

Generate training set and test set
from sklearn.model_selection import train_test_split
Generate a training set/test set with fixed values
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=1/3,random_state=0)

# Simple linear regression set for training set
from sklearn.linear_model import LinearRegression
Create a regressor
regressor = LinearRegression()
# Fit regressor
regressor.fit(X_train,Y_train)

# Predict test set results
y_pred = regressor.predict(X_test)

# Visual comparison
# draw the training set
plt.scatter(X_train,Y_train,color='red')
plt.plot(X_train,regressor.predict(X_train),color='blue')
plt.title('Salary Vs Experience - Training Set')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

# draw test set
plt.scatter(X_test,Y_test,color='red')
plt.plot(X_test,regressor.predict(X_test),color='blue')
plt.title('Salary Vs Experience - Training Set')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
Copy the code