• 1  Scikit – learn is introduced
    • 1.1  classification
    • 1.2  Return to the
    • 1.3  clustering
    • 1.4  Dimension reduction
    • 1.5  Model selection
    • 1.6  pretreatment
  • 2  Scikit-learn machine learning steps
    • 2.1  Import common libraries
    • 2.2  Load the data
    • 2.3  Divide training set and test set
    • 2.4  Data preprocessing
    • 2.5  standardized
      • 2.5.1  The normalized
      • 2.5.2  binarization
      • 2.5.3  Coding classification feature
      • 2.5.4  Input missing value
      • 2.5.5  Generating polynomial feature
    • 2.6  Create a model estimator
      • 2.6.1  Supervised learning
      • 2.6.2  Unsupervised learning
    • 2.7  Fitting the data
      • 2.7.1  Supervised learning
      • 2.7.2  Unsupervised learning
    • 2.8  To predict
      • 2.8.1  Supervised learning
      • 2.8.2  Unsupervised learning
    • 2.9  Evaluating model performance
      • 2.9.1  Classification indexes
      • 2.9.2  Return to the index
      • 2.9.3  Cluster indicators
      • 2.9.4  Cross validation
    • 2.10  Model to adjust
      • 2.10.1  The grid search
      • 2.10.2  Stochastic parameter optimization

Scikit – learn is introduced

Scikit-learn is an open source Python library that implements machine learning, preprocessing, cross-validation, and visualization algorithms through a unified interface.

Scikit-learn: scikit-learn.org

Machine learning in Python

  • Simple and effective data mining and data analysis tools
  • It is accessible to all and can be reused in a variety of environments
  • Build based on NumPy, SciPy and Matplotlib
  • Open source, commercially available – BSD license

classification

Determine which category the object belongs to.

Applications: spam detection, image recognition.

Algorithms: SVM, nearest neighbor, random forest,……

Return to the

Predicts contiguous value properties associated with an object.

Applications: drug reactions, stock prices.

Algorithms: SVR, Ridge regression, lasso,……

clustering

Automatically group similar objects into collections.

Application: customer segmentation, grouping experimental results

Algorithm: K-means, spectral clustering, mean shift,……

Dimension reduction

Reduce the number of random variables to consider.

Application: visualization, improve efficiency

Algorithm: PCA, feature selection, nonnegative matrix factorization.

Model selection

Compare, validate, and select parameters and models.

Objective: To improve accuracy by adjusting parameters

Modules: grid search, cross validation, indicators.

pretreatment

Feature extraction and normalization.

Applications: Transform input data (such as text) for use with machine learning algorithms. Module: preprocessing, feature extraction.

Scikit-learn machine learning steps

# import sklearn
from sklearn import neighbors, datasets, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# load data
iris = datasets.load_iris()

# Divide training set and test set
X, y = iris.data[:, :2], iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33)

# Data preprocessing
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# create model
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
# Model fitting
knn.fit(X_train, y_train)

# prediction
y_pred = knn.predict(X_test)
# assessment
accuracy_score(y_test, y_pred)
Copy the code

Import common libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Copy the code

Load the data

Scikit-learn handles data stored as NumPy arrays or SciPy sparse matrices. It also supports other data types that can be converted to numeric arrays, such as the Pandas data box.

X = np.random.random((11.5))
y = np.array(['M'.'M'.'F'.'F'.'M'.'F'.'M'.'M'.'F'.'F'.'F'])
X[X < 0.7] = 0
Copy the code

Divide training set and test set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
Copy the code

Data preprocessing

standardized

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
standardized_X = scaler.transform(X_train)
standardized_X_test = scaler.transform(X_test)
Copy the code

The normalized

from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X_train)
normalized_X = scaler.transform(X_train)
normalized_X_test = scaler.transform(X_test)
Copy the code

binarization

from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold=0.0).fit(X)
binary_X = binarizer.transform(X)
Copy the code

Coding classification feature

from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
y = enc.fit_transform(y)
Copy the code

Input missing value

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values=0, strategy='mean', axis=0)
imp.fit_transform(X_train)
Copy the code

Generating polynomial feature

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(5)
poly.fit_transform(X)
Copy the code

Create a model estimator

Supervised learning

# Linear regression
from sklearn.linear_model import LinearRegression
lr = LinearRegression(normalize=True)
Support vector Machine (SVM)
from sklearn.svm import SVC
svc = SVC(kernel='linear')
# Naive Bayes
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
# KNN
from sklearn import neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
Copy the code

Unsupervised learning

Principal Component Analysis (PCA)
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)
# K Means
k_means = KMeans(n_clusters=3, random_state=0)
Copy the code

Fitting the data

Supervised learning

lr.fit(X, y)
knn.fit(X_train, y_train)
svc.fit(X_train, y_train)
Copy the code

Unsupervised learning

k_means.fit(X_train)
pca_model = pca.fit_transform(X_train)
Copy the code

To predict

Supervised learning

# Prediction tag
y_pred = svc.predict(np.random.random((2.5)))
# Prediction tag
y_pred = lr.predict(X_test)
# Evaluate tag probabilities
y_pred = knn.predict_proba(X_test)
Copy the code

Unsupervised learning

y_pred = k_means.predict(X_test)
Copy the code

Evaluating model performance

Classification indexes

# accuracy
knn.score(X_test, y_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
# Classification prediction evaluation function
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
# confusion matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))
Copy the code

Return to the index

# Mean absolute error
from sklearn.metrics import mean_absolute_error
y_true = [3.0.5.2]
mean_absolute_error(y_true, y_pred)
# mean square error
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred)
# R2 score
from sklearn.metrics import r2_score
r2_score(y_true, y_pred)
Copy the code

Cluster indicators

Adjust the Rand coefficient
from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(y_true, y_pred)
# homogeneity
from sklearn.metrics import homogeneity_score
homogeneity_score(y_true, y_pred)
# V-measure
from sklearn.metrics import v_measure_score
metrics.v_measure_score(y_true, y_pred)
Copy the code

Cross validation

from sklearn.cross_validation import cross_val_score
print(cross_val_score(knn, X_train, y_train, cv=4))
print(cross_val_score(lr, X, y, cv=2))
Copy the code

Model to adjust

The grid search

from sklearn.grid search import GridSearchcV
params = {"n neighbors": np.arange(1.3),"metric": ["euclidean"."cityblock"]}
grid = GridSearchCV(estimator=knn,
                    param_grid-params)
grid.fit(X_train, y_train)
print(grid.best score)
print(grid.best_estimator_.n_neighbors)
Copy the code

Stochastic parameter optimization

from sklearn.grid_search import RandomizedSearchCV
params = {"n_neighbors": range(1.5),
          "weights": ["uniform"."distance"]}
rsearch = RandomizedSearchCV(estimator=knn,
                             rsearch.fit(X_train, y_train) random_state=5)
print(rsearch.best_score_)
Copy the code