Scikit-learn is a simple and efficient data analysis tool that encapsulates a large number of machine learning algorithms, a large number of public data sets built in, and is well documented.

1. Classification of iris was realized by KNN algorithm

Iris is the most well-known database in the pattern recognition literature. The dataset contains three classes with 50 instances each, and each class points to a type of iris. One is linearly separable from the other two, which are not linearly separable from each other.

Iris data set features:

Number of attributes: 4 (numeric type, numeric type, attributes and classes to help predict)

Attribute information:
Sepal length (cm)
Sepal width (cm)
Petal Length (cm)
Petal width (cm)
Category:
Iris – Iris Setosa mountain
Iris-versicolour color changing Iris
Iris-virginica is the Iris of Virginia

Procedure Step 1 Use load_iris of Sklearn to read the data set and view the first two lines and classification of feature values.

import numpy as np
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import  KNeighborsClassifier

iris =datasets.load_iris()
iris_X=iris.data# eigenvalue
iris_y=iris.target# classification

print(iris_X[:2,:])
print(iris_y)
Copy the code

Running results:

[[5.1 3.5 1.4 0.2]

[4.9 3. 1.4 0.2]]

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Step 2 Divide all data sets into training data sets and test data sets in a ratio of 7:3. The y_train output is split and out of order.

X_train, X_test y_train, y_test = train_test_split (iris_X iris_y, test_size = 0.3)print(y_train)
Copy the code

Running results:

[0 0 2 1 1 1 0 0 0 0 0 1 2 1 1 0 0 0 1 2 2 2 2 2 1 0 0 1 0 0 0 1 2 2 2 2 2 2 2 0 0 0 1 1 2 2 2 1 0 0 0 1 2 1 2 1 1 0 0 2 2 0 0 0 0 1 2 0 1 0 0 0 1 2 2 2 1 0 0 0 1 1 1 1 2 1 2 0 0 1 2 2 2 2 2 0 1 0 0 0 1]

Step 3 Perform classification training and test. During testing, knN. predict(X_test) is the predicted value of test set characteristics and is compared with the actual value of the test set.

knn=KNeighborsClassifier()
knn.fit(X_train,y_train)
print(knn.predict(X_test))
print(y_test)
Copy the code

Running results:

[2 0 1 0 0 0 0 1 2 2 2 2 2 1 0 0 0 1 2 2 0 1 2 1 2 1 2 2 0 0 1 0 0 0 1 2 2 1 0 0 2]

[2 0 1 0 0 0 0 1 2 2 2 2 2 1 0 0 0 1 2 2 2 0 1 2 1 1 1 2 2 1 0 0 0 0 0 1 2 2 1 0 0 2]

The final predicted classification was close to the actual classification, but there was still a bit of error.


2 Scikit-learn linear regression prediction of diabetes mellitus

The diabetes dataset is a quantitative measure of disease progression one year after baseline obtained from ten baseline variables, age, sex, weight, mean blood pressure, and six serum measurements, as well as interests, in 442 patients with diabetes.

Linear regression: Given each sample in the data set and its correct answer, a model function H (hypothesis) is trained according to the given training data. The goal is to find the parameter that minimizes the sum of squares of residuals.

In this example, only the first feature of the diabetes dataset is used to illustrate the two-dimensional graph of linear regression. This example will train a line that minimizes the sum of squares of residuals between the predicted value and the correct answer. Finally, parameters, residual sum of squares and variance score are calculated.

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset
diabetes = datasets.load_diabetes()


Use only the first feature of the dataset
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Divide features into training and test data sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

Type into training dataset and test dataset
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

Create a linear regression object
regr = linear_model.LinearRegression()

Train the model with training data sets
regr.fit(diabetes_X_train, diabetes_y_train)

# Use test data sets to make predictions
diabetes_y_pred = regr.predict(diabetes_X_test)

# Coefficient of regression equation
print('Coefficients: \n', regr.coef_)
# mean square error
print("Mean squared error: %.2f"
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))
# Explain variance score: 1 represents good prediction
print('Variance score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))

Draw a two-dimensional graph of the test data set
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
# Draw the fit
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()
Copy the code

Running results:

Coefficients:

[938.23786125]

Mean squared error: 2548.07

Variance score: 0.47


Thank you for reading, if you like my article, welcome to follow me