• By Han Xinzi @Showmeai
  • Tutorial address: www.showmeai.tech/tutorials/4…
  • This paper addresses: www.showmeai.tech/article-det…
  • Statement: All rights reserved, please contact the platform and the author and indicate the source
  • collectionShowMeAICheck out more highlights

The introduction

In the last SKLearn introduction and simple application case, we talked about the basic plate and use method of SKLearn tool. In this content, we expand to explain the advanced and core content of SKLearn. There are six task modules in SKLearn, as shown in the figure below: classification, regression, clustering, dimension reduction, model selection and pretreatment.

  • SKLearn website: scikit-learn.org/stable/
  • SKLearn quickly using method also recommend checking ShowMeAI articles and quick quick manual AI modeling tools | Scikit – learn to use guidelines

In SKLearn, because of the high-level encapsulation, classification models, regression models, clustering and dimensionality reduction models, preprocessors, and so on are called estimators, just as in Python everything is an object, and in SKLearn everything is an estimator.

In this article, we will explain the sciKit-Learn tool library in more depth, and try to cover all aspects of skLearn tool library application. The content of this paper includes:

  • ① Machine learning basics: Machine learning definition and four elements: data, task, performance measurement and model. Machine learning concepts to match SKLearn.
  • ② SKLearn explains: API design principle, SKLearn several features: consistency, verifiable, standard classes, can be combined and default values, as well as SKLearn built-in data and storage format.
  • ③ SKLearn three core API explanation: including estimator, predictor and converter. This section is very important, we use the main use of core API landing.
  • SKLearn advanced API explanation: It includes Pipeline estimator, Ensemble estimator, Multiclass and Multioutput estimator, and Model Selection estimator.

1. Introduction to machine learning

About this section, strongly recommend everyone read articles graphic ShowMeAI machine learning based knowledge and graphical machine learning | | machine learning model assessment methods and standards, ShowMeAI made detailed explanation to the knowledge content.

1.1 Define and constitute elements

What is machine learning? In the words of the guru Tom Mitchell’s definition of machine learning:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.

A computer program is said to learn a certain type of task T from experience E if its performance on T, as measured by P, improves with experience E.

According to the above definition of machine learning, machine learning consists of four elements:

  • Data (Data)
  • The Task (Task)
  • Quality Metrics
  • Algorithm (Algorithm)

1.2 data

Data is the carrier of information. Data can be divided in the following ways:

  • It is divided into structured data and unstructured data from the data specific type dimension.

    • Structured data is logically expressed and implemented by two-dimensional table structures.
    • Unstructured data is data that does not have predefined data and cannot be represented by two-dimensional tables in the database. Unstructured data includes images, text, voice and video.
  • From the dimension of “data expression form” : original data and processed data.

  • From the dimension of “statistical nature of data” : in-sample data and out-of-sample data.

For unstructured data, neural networks generally work better, as shown in ShowMeAI’s article on image modeling in Python Machine Learning Algorithms in practice.

Machine learning models often use structured data, two-dimensional data tables. Here we use iris petal data set as an example, as shown in the figure below.

The following terms should be clear before you dive into machine learning:

  • The record of each row (this is the data statistics of a single iris) is called a “sample”.
  • This is called “feature” in Petal Length, for example, Sepal Length or Petal Length.
  • Values of features, such as 5.1 and 3.5 corresponding to “Sample 1”, are called “feature values”.
  • Information about sample results, such as Setosa and Versicolor, is called “class label”.
  • An example containing tag information is calledThe sample (instance)“, that is,Sample =(Feature, tag).
  • The process of learning models from data is called “learning” or “training.”
  • In training data, each sample is called a “training instance” and the entire set is called a “training set”.

1.3 the task

Machine learning can be divided into several categories according to the task mode of learning (whether training data is labeled or not) :

  • Supervised Learning (labeled)
  • Unsupervised learning (no labels)
  • Semi-supervised learning (with partial labels)
  • Reinforcement learning (label with delay)

The following diagram illustrates the relationship between different types of machine learning.

1.4 Performance Measurement

The most common error functions for regression and classification tasks, as well as some useful performance measures, are shown below. For details, see the ShowMeAI article On Machine Learning Evaluation and Metrics.

2. SKLearn data

SKLearn, a toolkit for general machine learning modeling, consists of six task modules and a data import module:

  • Supervised learning: Categorizing tasks
  • Supervised learning: Return to task
  • Unsupervised learning: Clustering tasks
  • Unsupervised learning: Dimension reduction tasks
  • Model selection task
  • Data preprocessing task
  • Data import module

Start by looking at the SKLearn default data format and the built-in data set.

2.1 SKLearn Default data format

The data that Sklean’s model can use directly comes in two forms:

  • Dense data, a TWO-DIMENSIONAL Numpy array, is typically in this format.
  • Sparse data of SciPy matrix (SciPy. Sparse. Matrix), such as text analysis of each word (dictionary has 100000 words) and unique thermal coding of matrix with lots of zeros, nDARray is not suitable for this situation, which consumes too much memory.

2.2 Built-in data set

SKLearn has many built-in datasets for users to use.

For example, the iris data set used in the Python machine learning algorithm practice in the previous article contains four features (sepal length/width and petal length/width) and three categories.

We can import datasets directly from SKLearn in the following code (code can be run in the online Jupyter environment) :

# import tool library
from sklearn.datasets import load_iris    
iris = load_iris()

# Data is stored in "dictionary" format, see what keys iris has.
iris.keys()
Copy the code

The output is as follows:

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
Copy the code

Read the data set’s information:

Output the size, name and other information of features in IRIS data and the first five samples.
n_samples, n_features = iris.data.shape    
print((n_samples, n_features))    
print(iris.feature_names)    
print(iris.target.shape)    
print(iris.target_names)
iris.data[0:5]
Copy the code

The output is as follows:

(150, 4) ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'the petal width (cm)'] (150) [' setosa ' 'versicolor' 'virginica] array ([[5.1, 3.5, 1.4, 0.2], [4.9, 3., 1.4, 0.2], [4.7, 3.2, 1.3, 0.2], [4.6, 3.1, 1.5, 0.2], [5., 3.6, 1.4, 0.2]])Copy the code

Construct Dataframe format dataset:

# merge X and y into Dataframe format data
import pandas as pd
import seaborn as sns
iris_data = pd.DataFrame( iris.data,     
                          columns=iris.feature_names )    
iris_data['species'] = iris.target_names[iris.target]    
iris_data.head(3).append(iris_data.tail(3))
Copy the code

The output is as follows:

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

Let’s use Seaborn to do some data analysis to see how the data is distributed. The association analysis of paired dimensions is used here. See ShowMeAI’s Seaborn Tools and Data Visualization tutorial for details on how to use Seaborn.

# Use Seaborn's Pairplot to see the relationship between two features
sns.pairplot( iris_data, hue='species', palette='husl' )
Copy the code

2.3 Data set introduction method

The iris data set mentioned earlier is loaded in by load_iris. In fact, SKLearn has three types of imported data.

  • Packaged data: For small data sets, usesklearn.datasets.load_*
  • Streaming download data: For large data sets, usesklearn.datasets.fetch_*
  • Randomly create data: For quick presentation, usesklearn.datasets.make_*

The asterisk * refers to the name of the Jupyter IDE. If you are in Jupyter IDE, you can use the TAB to automatically complete and select.

  • datasets.load_
  • datasets.fetch_
  • datasets.make_

For example, we call load_iris

from sklearn import datasets
datasets.load_iris
Copy the code

The output is as follows:

<function sklearn.datasets.base.load_iris(return_X_y=False)>
Copy the code

We call load_digits to load the handwritten digital image data set

digits = datasets.load_digits()
digits.keys()
Copy the code

Output:

dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])
Copy the code

Let’s look at an example of fetching data via fetch:

# California Housing Data Set
california_housing = datasets.fetch_california_housing()    
california_housing.keys()
Copy the code

Output:

dict_keys(['data', 'target', 'feature_names', 'DESCR'])
Copy the code

3. SKLearn core API

We mentioned the everything estimator in SKLearn earlier. Estimator is a very abstract term, a loose understanding, we can think of it as a model (for regression, classification, clustering, dimensionality reduction), or a set of processes (preprocessing, grid search cross-validation).

The three apis in this section are all estimators:

  • An estimator is usually an estimator for the fitting function.
  • A predictor is an estimator with predictive power.
  • Transformer is an estimator with conversion function.

3.1 the estimator

Any object that can estimate some parameters based on a data set is called an estimator and has two core points:

  • ① Need to input data.
  • ② Parameters can be estimated.

Estimators are first created and then fitted.

  • Create estimator: You need to set a set of hyperparameters, such as

    • Hyperparameters in linear regressionnormalize=True
    • Hyperparameter in k-meansn_clusters=5
  • Fit estimators: Training sets are required

    • The code paradigm in supervised learning ismodel.fit(X_train, y_train)
    • The code paradigm in unsupervised learning ismodel.fit(X_train)

After fitting, parameters learned in model can be accessed, such as feature coefficient COef in linear regression or cluster labels in k-means, as shown below (the details can be found in each model page of SKLearn document).

  • model.coef_
  • model.labels_

Let’s look at specific examples of “linear regression” for supervised learning and “k-means clustering” for unsupervised learning.

(1) Linear regression

First import the LinearRegression from the Linear_Model of the SKLearn tool library. Create a model object named model and set the hyperparameter normalize to True (normalize each eigenvalue to ensure the stability of the fitting and speed up the model fitting).

from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)
model
Copy the code

Output:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=True)
Copy the code

After the estimator is created, all hyperparameters (such as normalize=True) will be displayed, and any unset hyperparameters will use their default values.

Create your own simple data set (data points on a line) and briefly explain the features in the estimator.

import numpy as np
import matplotlib.pyplot as plt
x = np.arange(10)    
y = 2 * x + 1    
plt.plot( x, y, 'o' )
Copy the code

In the data we generate, X is one dimension, we make a little adjustment, use Np.newaxis to add a dimension, change [1,2,3] into [[1],[2],[3], such data form can meet the requirements of SkLearn. X and y are then fed into the fit() function to fit the parameters of the linear model.

X = x[:, np.newaxis]    
model.fit( X, y )
Copy the code

The output is:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=True)
Copy the code

The fitted estimator appears to be the same as the created estimator, but we can now access the parameters of the fitted data using model.param_, as shown in the following code.

print( model.coef_ )    
print( model.intercept_ )
# output result
# [2]
# 0.9999999999999982
Copy the code

(2) the k-means

Let’s take a look at the clustering example, first import KMeans from the Cluster of SKLearn, initialize the model object named model, set the hyperparameter n_cluster to 3(for the sake of demonstration and we know that the iris dataset used has 3 classes, actually can set different number of n_clusters).

Although iris data contains label Y, we will not use this information in unsupervised clustering.

from sklearn.cluster import KMeans    
model = KMeans( n_clusters=3 )    
model
Copy the code

The output is:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10, n_jobs=None, = 'auto' precompute_distances, random_state = None, tol = 0.0001, verbose = 0)Copy the code

Iris data set contains four-dimensional features (sepal length, sepal width, petal length and petal width). In the following example, we hope to visualize. Here, we simply select two features (sepal length and sepal width) for clustering and visualization results.

Note that the following code X = iris.data[:,0:2] actually extracts feature dimensions.

from sklearn.datasets import load_iris    
iris = load_iris()
X = iris.data[:,0:2]    
model.fit(X)
Copy the code

The output is:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10, n_jobs=None, = 'auto' precompute_distances, random_state = None, tol = 0.0001, verbose = 0)Copy the code

The fitted estimator appears to be the same as the created estimator, but we can now access the parameters of the fitted data using model.param_, as shown in the following code.

print( model.cluster_centers_, '\n')    
print( model.labels_, '\n' )    
print( model.inertia_, '\n')    
print(iris.target)
[[5.77358491 2.69245283]
 [6.81276596 3.07446809]
 [5.006      3.428     ]] 

[2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 0 1 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 0 1 1 1 1
 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 0 1 1 1 1 1 0 0 1 1 1 0 1 1 1 0 1 1 1 0 1
 1 0] 

37.05070212765958 

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
Copy the code

Here are the parameters of the KMeans model:

  • model.clustercenters: Cluster center. Three clusters means there are three coordinates.
  • model.labels_: Indicates the label after clustering.
  • model.inertia_: Sum of squares of distances from all points to the corresponding cluster center (the smaller the better)

summary

Although the examples above are Linear Regression with supervised learning and KMeans with unsupervised learning, you can actually replace them with other models, such as Logistic Regression with supervised learning and DBSCAN with unsupervised learning. They are both “estimators” and therefore have fit() methods.

The generic pseudocode for using them is as follows:

# Supervised learning
from sklearn.xxx import SomeModel
# XXX can be linear_model or ensemble, etc
model = SomeModel( hyperparameter )
model.fit( X, y )

# Unsupervised learning
from sklearn.xxx import SomeModel
# XXX can be cluster, decomposition, etc
model = SomeModel( hyperparameter )
model.fit( X )
Copy the code

3.2 predictor

The predictor is an extension of the estimator and has the function of predicting data.

The most common predictor is the predict() function:

  • model.predict(X_test): Evaluate the model’s performance on new data.
  • model.predict(X_train): Verify the model’s performance on old data.

To evaluate the new data, we first divided the data into 80:20 training sets (X_train, Y_train) and test sets (X_test, y_test), and then predicted predict() on the test set using the model fitting fit() from the training set.

from sklearn.datasets import load_iris    
iris = load_iris()
from sklearn.model_selection import train_test_split    
X_train, X_test, y_train, y_test = train_test_split( iris['data'],     
                    iris['target'],     
                    test_size=0.2 )    
print( 'The size of X_train is ', X_train.shape )    
print( 'The size of y_train is ', y_train.shape )    
print( 'The size of X_test is ', X_test.shape )    
print( 'The size of y_test is ', y_test.shape )
The size of X_train is  (120.4)
The size of y_train is  (120,)
The size of X_test is  (30.4)
The size of y_test is  (30.)Copy the code

predict & predict_proba

For classification problems, we not only want to know what the category of prediction is, but sometimes we also want to get the probability of prediction and so on. Predict () is used for the former and predict_proba() for the latter.

y_pred = model.predict( X_test )
p_pred = model.predict_proba( X_test )
print( y_test, '\n' )
print( y_pred, '\n' )
print( p_pred )
Copy the code

score & decision_function

There are two additional functions you can use in the predictor. In the classification problem:

  • score()Returns the classification accuracy.
  • decision_function()Returns the score value for each sample under each class.
print( model.score( X_test, y_test ) )
print( np.sum(y_pred==y_test)/len(y_test) )
decision_score = model.decision_function( X_test )
print( decision_score )
Copy the code

summary

Predict_proba () and decision_function() are predict_proba() and predicon_function (), respectively. Check the official documentation for this (for example, there is no RandomForestClassifier method).

The generic pseudocode for using them is as follows:

# Supervised learning
from sklearn.xxx import SomeModel
# XXX can be linear_model or ensemble, etc
model = SomeModel( hyperparameter )
model.fit( X, y )
y_pred = model.predict( X_new )
s = model.score( X_new )

# Unsupervised learning
from sklearn.xxx import SomeModel
# XXX can be cluster, decomposition, etc
model = SomeModel( hyperparameter )
model.fit( X )
idx_pred = model.predict( X_new )
s = model.score( X_new )
Copy the code

3.3 converter

Converter is a kind of estimator, also has the fitting function, the comparison predictor finished fitting to predict, converter finished fitting to convert. The core points are as follows:

  • The estimatorfit + predict
  • In the converterfit + transform

This section introduces two types of converters:

  • Coding of categorical variable into numerical variable
  • Normalize or standardize numerical variables in Taichichuan

(1) Category variable coding

1) LabelEncoder&OrdinalEncoder

LabelEncoder and OrdinalEncoder can both convert characters to numbers, but:

  • LabelEncoder input is one-dimensional, such as 1D NDARRay
  • OrdinalEncoder input is two-dimensional, such as DataFrame
First give the list enC to encode and the list dec to decode.
enc = ['red'.'blue'.'yellow'.'red']    
dec = ['blue'.'blue'.'red']

# Import LabelEncoder from skLearn preprocessing and create converter name LE without setting any hyperparameters.
from sklearn.preprocessing import LabelEncoder    
LE = LabelEncoder()    
print(LE.fit(enc))    
print( LE.classes_ )    
print( LE.transform(dec) )
LabelEncoder()
['blue' 'yellow' 'red']
[0 1 2]
Copy the code

In addition to LabelEncoder, OrdinalEncoder can also complete coding. The following code looks like this:

from sklearn.preprocessing import OrdinalEncoder    
OE = OrdinalEncoder()    
enc_DF = pd.DataFrame(enc)    
dec_DF = pd.DataFrame(dec)    
print( OE.fit(enc_DF) )    
print( OE.categories_ )    
print( OE.transform(dec_DF) )
OrdinalEncoder(categories='auto', dtype=<class 'numpy.float64'>)
[array(['blue'.'yellow'.'red'], dtype=object] [[0] [1] [2]]Copy the code

The problem with this kind of coding is that after coding, there will be different kinds of size relationship. For example, the 3 colors are essentially equal, there is no size relationship.

One-hot Encoding, our other class-based data encoding, solves this problem. Read on.

(2) OneHotEncoder

Unique heat vector coding is simply the representation of an integer as a vector. On the right side of the figure is the unique heat vector encoding of colors. The OneHotEncoder converter can accept two types of input:

  • ① Use LabelEncoder to encode a one-dimensional array
  • (2) the DataFrame

A, use LabelEncoder encoded a one-dimensional array element (as an integer), reshape (with reshape (1, 1)) into a two-dimensional array as OneHotEncoder input.

from sklearn.preprocessing import OneHotEncoder    
OHE = OneHotEncoder()    
num = LE.fit_transform( enc )    
print( num )    
OHE_y = OHE.fit_transform( num.reshape(-1.1) )    
OHE_y
[2 0 1 2]
Copy the code

The output is:

<4x3 sparse matrix of type '<class 'numpy.float64'>'
    with 4 stored elements in Compressed Sparse Row format>
Copy the code

The above results are explained as follows:

  • Line 3 prints out the encoding result [2 0 1 2].
  • In line 5, the output is in the form of a “sparse matrix”. Since there are usually many categories in practice, the sparse matrix is used in one step to save memory.

To see what’s in this matrix, use the toarray() function.

OHE_y.toarray()
Copy the code

The output is:

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])
Copy the code

The OneHotEncoder input is DataFrame.

OHE = OneHotEncoder()    
OHE.fit_transform( enc_DF ).toarray()
Copy the code

The output is:

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])
Copy the code

(2) Feature scaling

One of the most important transformations to do with data is feature scaling. Computational models such as logistic regression and neural network are sensitive to the amplitude differences of different features.

Specifically, there are two transformation methods for a certain feature:

  • Standardization: The characteristics of each dimension minus the mean of the characteristics divided by the standard deviation of the dimension.
  • Normalization: The feature of each dimension minus the minimum value of that feature, divided by the difference between the maximum and minimum value of that feature.

1) MinMaxScaler

As shown in the figure above, MinMaxScaler scales data according to the maximum and minimum values of features.

from sklearn.preprocessing import MinMaxScaler    
X = np.array( [0.0.5.1.1.5.2.100] )    
X_scale = MinMaxScaler().fit_transform( X.reshape(-1.1) )    
X_scale
Copy the code

The output is:

Array ([[0], [0.005], [0.01], [0.015], [0.02], [1]])Copy the code

(2) StandardScaler

What StandardScaler does is adjust the distribution of the data as close to a normal distribution as possible.

from sklearn.preprocessing import StandardScaler    
X_scale = StandardScaler().fit_transform( X.reshape(-1.1) )    
X_scale
Copy the code

The output is:

Array ([[0.47424487], [0.46069502], [0.44714517], [0.43359531], [0.42004546], [2.23572584]])Copy the code

Note: The fit() function only works on the training set. If you want to transform the test set, just use the fit converter on the training set to transform. Fit and transform cannot be performed on the test set, otherwise the transformation rules of training set and test set are inconsistent, and the information learned by the model will be invalid.

4. Advanced API

In this section, we introduce the “advanced API” of SKLearn, namely the five element estimator (Ensemble, Multiclass and multilabel, Multioutput, Model Selection, Pipeline).

  • ensemble.BaggingClassifier
  • ensemble.VotingClassifier
  • multiclass.OneVsOneClassifier
  • multiclass.OneVsRestClassifier
  • multioutput.MultiOutputClassifier
  • model_selection.GridSearchCV
  • model_selection.RandomizedSearchCV
  • pipeline.Pipeline

4.1 Ensemble estimator

As shown in the figure above: The classifier counts the number of predicted categories for each subclassifier and then uses the “majority vote” principle to get the final prediction. The regressor calculates the predicted average for each subregressor.

The most commonly used Ensemble estimators are arranged as follows:

  • AdaBoostClassifier: Gradually upgrade the classifier
  • AdaBoostRegressor: Progressively upgrade the regressor
  • BaggingClassifier: Bagging classifier
  • BaggingRegressor: Bagging regress
  • GradientBoostingClassifier: Gradient lifting classifier
  • GradientBoostingRegressorGradient lifting regressor
  • RandomForestClassifier: Random forest classifier
  • RandomForestRegressor: Random forest regressor
  • VotingClassifier: voting classifier
  • VotingRegressor: Voting regressor

We use iris data and take the following Estimator as an example:

  • Contains a homogenous estimatorRandomForestClassifier
  • Contains heterogeneous estimatorsVotingClassifier

First, the data was divided into 80:20 training sets and test sets, and metrics were introduced to calculate various performance indicators.

from sklearn.datasets import load_iris    
iris = load_iris()
from sklearn.model_selection import train_test_split    
from sklearn import metrics    
X_train, X_test, y_train, y_test = train_test_split(iris['data'], iris['target'], test_size=0.2)
Copy the code

(1) RandomForestClassifier

The random forest RandomForestClassifier determines the number of base estimators by controlling the n_ESTIMators superparameter. Here, there are four decision trees (the forest is composed of trees). Additionally, the maximum tree depth of each tree is 5(max_depth=5).

from sklearn.ensemble import RandomForestClassifier    
RF = RandomForestClassifier( n_estimators=4, max_depth=5 )    
RF.fit( X_train, y_train )
Copy the code

The output is:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=5, max_features='auto', Max_leaf_nodes =None, min_impurity_Decrease =0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, Min_weight_fraction_leaf =0.0, n_estimators=4, n_jobs=None, oob_score=False, random_state=None, verbose=0, min_weight_fraction_leaf=0.0, n_estimators=4, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)Copy the code

Meta estimators also have fit() as do estimators. Now let’s look at the number of estimators contained in the random forest and themselves.

print( RF.n_estimators )    
RF.estimators_
Copy the code

The output is:

4 [DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5, max_features='auto', max_leaf_nodes=None, Min_impurity_decrease = 0.0, min_impurity_split = None, min_samples_leaf = 1, min_samples_split = 2, Min_weight_fraction_leaf = 0.0, presort = False, random_state = 705712365, splitter = 'best'), DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5, max_features='auto', max_leaf_nodes=None, Min_impurity_decrease = 0.0, min_impurity_split = None, min_samples_leaf = 1, min_samples_split = 2, Min_weight_fraction_leaf = 0.0, presort = False, random_state = 1026568399, splitter = 'best'), DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5, max_features='auto', max_leaf_nodes=None, Min_impurity_decrease = 0.0, min_impurity_split = None, min_samples_leaf = 1, min_samples_split = 2, Min_weight_fraction_leaf = 0.0, presort = False, random_state = 1987322366, splitter = 'best'), DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5, max_features='auto', max_leaf_nodes=None, Min_impurity_decrease = 0.0, min_impurity_split = None, min_samples_leaf = 1, min_samples_split = 2, Min_weight_fraction_leaf = 0.0, presort = False, random_state = 1210538094, splitter = 'best')]Copy the code

Accuracy_score in metrics is used to calculate accuracy. Training accuracy 98.33%, test accuracy 100%.

print ( "RF - Accuracy (Train): %.4g" %     
        metrics.accuracy_score(y_train, RF.predict(X_train)) )    
print ( "RF - Accuracy (Test): %.4g" %     
        metrics.accuracy_score(y_test, RF.predict(X_test)) )
Copy the code
Rf-accuracy (Train): 1 RF-accuracy (Test): 0.9667Copy the code

(2) VotingClassifier

Unlike random forest, which consists of homogeneous classifier “decision tree”, voting classifier consists of several heterogeneous classifiers. Below, we establish an integration model with Logistic regression, RandomForest and GNB classifiers by VotingClassifier.

The base classifier of the RandomForestClassifier can only be a decision tree, so the number of trees can only be determined by controlling the n_ESTIMators superparameter, while the base classifier of the VotingClassifier needs to be input to each heterogeneous classifier.

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier

LR = LogisticRegression( solver='lbfgs', multi_class='multinomial' )
RF = RandomForestClassifier( n_estimators=5 )
GNB = GaussianNB()

Ensemble = VotingClassifier( estimators=[('lr', LR), ('rf', RF), ('gnb', GNB)], voting='hard' ) Ensemble. fit( X_train, y_train )Copy the code

The results are as follows:

VotingClassifier (estimators = [(' lr 'LogisticRegression (C = 1.0, class_weight = None, dual = False, fit_intercept=True,intercept_scaling=1, max_iter=100, multi_class='multinomial',n_jobs=None, penalty='12', Random_state =None, solver=' LBFGS ', toL =0.0001, verbose=6, warm_start=False), ('rf',...e, verbose=0,warm_start=False), ('gnb', GaussianNB(priors=None, var_smoothing=1e-09))],flatten_transform=None, n_jobs=None, voting='hard', weights=None)Copy the code

Look at the number of estimators included in the Ensemble Ensemble model and themselves.

print( len(Ensemble.estimators_) )        
Ensemble.estimators_
Copy the code

The results are as follows:

3 [LogisticRegression(C=1.0, class_weight-none, dual-false, fit_Intercept =True,intercept_scaling=1, max_iter=100, Multi_class = 'multinomial, n_jobs - None, penalty = "12", random_state - None, solver =' 1 BFGS 't01 = 0.0001, verbose = 0, warm_start=False), RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',max_depth=None, = 'auto' max_features, max_leaf_nodes = None, min_impurity_decrease - 0.0, min_impurity_splitmin_samples_leaf = 1, Min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimator: oob_score = False, random_state - None, verbose= warm_start=False), GaussianNB(priors-None, var_smoothing=1e-9)]Copy the code

Comparing the performance of the meta-estimator and its three components, the following performance is shown:

# fitting
LR.fit( X_train, y_train )        
RF.fit( X_train, y_train )        
GNB.fit( X_train, y_train )
Copy the code
# Evaluate effectiveness
print ( "LR - Accuracy (Train): %.4g" % metrics.accuracy_score(y_train, LR.predict(X_train)) )
print ( "RF - Accuracy (Train): %.4g" % metrics.accuracy_score(y_train, RF.predict(X_train)) )
print ( "GNB - Accuracy (Train): %.4g" % metrics.accuracy_score(y_train, GNB.predict(X_train)) )
print ( "Ensemble - Accuracy (Train): %.4g" % metrics.accuracy_score(y_train, Ensemble.predict(X_train)) )
print ( "LR - Accuracy (Test): %.4g" % metrics.accuracy_score(y_test, LR.predict(X_test)) )

print ( "RF - Accuracy (Test): %.4g" % metrics.accuracy_score(y_test, RF.predict(x_test)) )
print ( "GNB - Accuracy (Test): %.4g" % metrics.accuracy_score(y_test, RF.predict(X_test)) )
print ( "Ensemble - Accuracy (Test): %.4g" % metrics.accuracy_score(y test, Ensemble.predict(X_test)) )
Copy the code
Lr-accuracy (Train): 0.975RF-accuracy (Train): 0.9833 gnB-accuracy (Train): 0.95 Ensemble - Accuracy (Train): 0.9833 LR-accuracy (Test): 1 RF-accuracy (Test): 1 GNB-accuracy (Test): 1 Ensemble - Accuracy (Test): 1Copy the code

4.2 Multiclass estimator

Sklearn. multiclass can handle multi-label classification of multi-classes. Next, we’ll use the numeric data set digits as the sample data. We first divided the data into 80:20 training sets and test sets.

# import data
from sklearn.datasets import load_digits                 
digits = load_digits()        
digits.keys()
Copy the code

The output is as follows:

# output dict_keys ([' data ', 'target', 'target_names',' images', 'DESCR'])Copy the code

Here we slice the data set:

# Data set segmentation
X_train, X_test, y_train, y_test = train_test_split( digits['data'], digits['target'], test_size=0.2 )
                
print( 'The size of X_train is ', X_train.shape )        
print( 'The size of y_train is ', y_train.shape )        
print( 'The size of X_test is ', X_test.shape )        
print( 'The size of y_test is ', y_test.shape )
Copy the code

The output is as follows

The size of X_train is (1437, 64)
The size of y_train is (1437,)
The size of X_test is (360, 64)
The size of y_test is (360,)
Copy the code

There are 1437 images in the training set and 360 images in the test set. Each photo contains 8×8 pixels, and we flatten the 2-dimensional 8×8 to 1-dimensional 64 using the flatten operation.

Take a look at the top 100 images in the training set and the corresponding tags (below). It’s very low in pixels, but you can see it basically.

fig, axes = plt.subplots( 10.16, figsize=(8.8) )
fig.subplots_adjust( hspace=0.1, wspace=0.1 )
for i, ax in enumerate( axes.flat ):
    ax.imshow( X_train[i,:].reshape(8.8), cmap='binary', interpolation = 'On his ax '). The text (0.05.0.05.str(y_train[i]),
    transform=ax.transAxes, color='blue')
    ax.set_xticks([])
    ax.set_yticks([])
Copy the code

(1) Multi-category classification

How to use handwritten numbers of ten categories, 0 to 9, when you only have binary estimators (like support vector machines) at your fingertips? We can adopt the following strategies to deal with it:

  • One vs One (OvO) : One classifier is used to process the numbers 0 and 1, One for the numbers 0 and 2, One for the numbers 1 and 2, and so on. N classes require N(n-1)/2 classifiers.
  • A pair of others (One vs All, OvA) : Train 10 binary classifiers, each corresponding to a number, the first category “1” and “not 1”, the second category “2” and “not 2”, and so on. N classes require N classifiers.

1) OneVsOneClassifier

Consider the problem of multiple classifications of a specific weather, which could be sunny, cloudy, or rainy. In OvO, the three classifications are F1, F2, and F3.

  • F1 is responsible for distinguishing between orange and green samples
  • F2 is responsible for distinguishing between orange and purple samples
  • F3 is responsible for distinguishing between green and purple samples

In the example below, both F1 and F2 are predicted to be orange, and F3 is predicted to be purple. The combination prediction based on the majority rule is orange, as shown in the figure below.

Back to the numeric classification problem, the code and result are as follows:

from sklearn.multiclass import OneVsOneClassifier
from sklearn.linear_model import LogisticRegression
ovo_lr = OneVsOneClassifier( LogisticRegression(solver='lbfgs', max_iter=200) )
ovo_lr.fit( X_train, y_train )
Copy the code
OnevsOneClassifier estimator (= LogisticRegression (C = 1.0, class_weight = None, dual = False, Fit_intercept =True,intercept_scaling=1, max_iter=200, multi_class= 'warn ', n_jobs=None, penalty='12', random_state=None, Solver = 'LBFGS', tol = 0.0001, verbose = 6, warm_start = False), n_jobs = None)Copy the code

10*9/2=45 there are a total of 45 OvO classifiers for 10 classes.

print( len(ovo_lr.estimators_) )        
ovo_lr.estimators_
Copy the code

The results are as follows:

45 (LogisticRegression(C=1.0, class_weight=None, dual=False, FIT_Intercept =True,intercept_scaling=1, max_iter=200, Multi_class ='warn',n_jobs=None, penalty='12', random_state=None, solver=' LBFGS ',tol=60.0001, verbose=0, Warm_start =False), LogisticRegression(C=1.0, class_weight=None, dual=False, FIT_Intercept =True, intercept_scaling=1, Max_iter =200, multi_class='warn', n_jobs=None, penalty='l2', random_state=None, solver=' LBFGS ',tol=0.0001, verbose=0, Warm_start =False), LogisticRegression(C=1.0, class_weight=None, dual=False, FIT_Intercept =True, intercept_scaling=1, Max_iter =200, multi_class='warn', n_jobs=None, penalty='12', random_state=None, solver=' LBFGS ', tol=60.0001, verbose=0, Warm_start =False), LogisticRegression(C=1.0, class_weight=None, dual=False, FIT_Intercept =True, intercept_scaling=1, Max_iter =200, multi_class='warn', n_jobs=None, penalty="12", random_state=None, solver=' LBFGS ', tol=0.0001, verbose=0, Warm_start =False), LogisticRegression(C=1.0, class_weight=None, dual=False, FIT_Intercept =True,...Copy the code

The classification of training set is completely correct, and the accuracy of test set is 98%.

print(" OvO LR-accuracy (Train): %4.g" % metrics.accuracy_score(y_train, ovo_Ir.predict(X_train)) )
print ( "OvO LR - Accuracy (Test): %4.g" % metrics.accuracy_score(y_test, ovo_lr.predict(X_test}) )
Copy the code
OvO LR-accuracy (Train): 1 OvO LR-accuracy (Test): 0.9806Copy the code

(2) OneVsRestClassifier

In OvA, divide data into “some” and “others”

  • Figure 1. One = orange, others = green and purple
  • Figure 2. One = green, others = orange and purple
  • Figure 3. One = purple, others = orange and green

The tritaxa is decomposed into three dichotomies, corresponding to f1, F2 and F3.

  • F1 predicts the negative class, i.e. predicts green and purple
  • F2 predicts the negative category, i.e. predicts orange and purple
  • F3 predicts positive, that is, purple

All three classifiers predicted purple, and the majority rule predicted purple, which is overcast.

Back to the numeric sorting problem, the code and result are as follows:

from sklearn.multiclass import OneVsRestClassifier
ova_lr = OneVsRestClassifier( LogisticRegression(solver='lbfgs', max_iter=800) )
ova_lr.fit( X_train, y_train )
Copy the code
OnevsRestClassifier estimator (= LogisticRegression (C = 1.0, class_weight = None, dual = False, fit_intercept = True, Intercept_scaling =1, max_iter=800, multi_class= 'warn ', n_jobs=None, penalty='12', random_state=None, solver=' LBFGS ', Tol = 0.0001, verbose = 6, warm_start = False), n_jobs = None)Copy the code

There are 10 OvA classifiers for 10 classes.

print( len(ova_lr.estimators_) )        
ova_lr.estimators_
Copy the code

The results are as follows:

10 [LogisticRegression(C=1.0, class_weight=None, dual=False, FIT_Intercept =True, intercept_scaling=1, max_iter=800, Multi_class ='warn', n_jobs=None, penalty='12', random_state=None, solver=' LBFGS ',tol=0.0001, verbose=0, Warm_start =False), LogisticRegression(C=1.0, class_weight=None, dual=False, FIT_Intercept =True, intercept_scaling=1, Max_iter =800, multi_class='warn', n_jobs=None, penalty='12', random_state=None, solver=' LBFGS ', tol=0.0001, verbose=0, Warm_start =False), LogisticRegression(C=1.0, class_weight=None, dual=False, FIT_Intercept =True, intercept_scaling=1, Max_iter =800, multi_class= 'warn', n_jobs=None, penalty='12', random_state=None, solver=" LBFGS ', tol=0.0001, verbose=0, penalty= 0, default =0, default =0, default =0 Warm_start =False), LogisticRegression(C=1.0, class_weight=None, dual=False, FIT_Intercept =True, intercept_scaling=1, Max_iter =800, multi_class='warn', n_jobs=None, penalty='12', random_state=None, solver=' LBFGS ', tol=0.0001, verbose=0, Warm_start =False), LogisticRegression(C=1.0, class_weight=None, dual=False, FIT_Intercept =True,...Copy the code

The accuracy of the training set is almost 100%, and the accuracy of the test set is 96%. The code and result are as follows:

print(" OvA LR-accuracy (Train): %4.g" % metrics.accuracy_score(y_train, ova_Ir.predict(X_train)) )
print ( "OvA LR - Accuracy (Test): %4.g" % metrics.accuracy_score(y_test, ova_lr.predict(X_test}) )
Copy the code
OvA LR-accuracy (Train): 6.9993 OvA LR-accuracy (Test}: 6.9639Copy the code

(2) Multi-label classification

So far, all samples have always been assigned to just one class. In some cases, you might want the classifier to output multiple categories for a sample. In the driverless application, there are cars and signs identified in the picture below, but no traffic lights and people.

Object recognition is a complex deep learning problem that we won’t go into here. Let’s start with a simpler example. In the case of handwritten numbers, we specifically designed two labels for each number:

  • Tag 1: Odd, even
  • Label 2: less than or equal to 4, greater than 4

We build multi-label y_train_multilabel, code as follows (OneVsRestClassifier can also be used for multi-label classification) :

from sklearn.multiclass import OneVsRestClassifier                 
y_train_multilabel = np.c_[y_train%2= =0, y_train<=4 ]        
print(y_train_multilabel)
Copy the code
[[ True True] [False False] [False False] 
... 
[False False] [False False] [False False]]
Copy the code

The first and second pictures in the training set are numbers 4 and 5, corresponding to the above two labels. The results are as follows:

  • [True True] : 4 is an even number, less than or equal to 4
  • [False False] : 5 is not even, greater than 4

We will use y_train_multilabel to train the model this time. The following code

ova_ml = OneVsRestClassifier( LogisticRegression(solver='lbfgs', max_iter=800) )
ova_ml.fit( X_train, y_train_multilabel )
Copy the code
OnevsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_Intercept =True, Intercept_scaling =1, max_iter=800, multi_class= 'warn ', n_jobs=None, penalty='12', random_state=None, solver=' LBFGS ', Tol = 0.0001, verbose = 6, warm_start = False), n_jobs = None)Copy the code

There are two estimators, each corresponding to a label.

print( len(ova_ml.estimators_) )        
ova_ml.estimators_
Copy the code

The running results are as follows:

3 [LogisticRegression(C=1.0, class_weight=None, dual=False, FIT_Intercept =True, intercept_scaling=1, max_iter=800, Multi_class =' warn', n_jobs=None, penalty='12°, random_state=None, solver=' LBFGS ', tol=0.0001, verbose=0, Warm_start =False), LogisticRegression(C=1.0, class_weight=None, dual=False, FIT_Intercept =True, intercept_scaling=1, Max_iter =800, multi_class='warn', n_jobs=None, penalty='l2', random_state=None, solver=' LBFGS ', tol=0.0001, verbose=0, warm_start=False) ]Copy the code

Show me the 100 images from the test set.

fig, axes = plt.subplots( 10.10, figsize=(8.8) )
fig.subplots_adjust( hspace=0.1, wspace=0.1 )

for i, ax in enumerate( axes.flat ):
    ax.imshow( X_test[i,:].reshape(8.8), cmap='binary', interpolation='nearest')
    ax.text( 6.05.0.05.str(y_test[i]), transform=ax.transAxes, color='blue')
    ax.set_xticks([])
    ax.set_yticks([])
Copy the code

The first image is the number 2, which is even (label 1 is true) and less than or equal to 4(label 2 is true).

print( y_test[:1])print( ova_ml.predict(X_test[:1, :))Copy the code
[2]
[[1 1]]
Copy the code

4.3 Multioutput estimator

Sklearn. Multioutput can handle the classification of multi-outputs.

Multi-output classification is a generalization of multi-label classification, where each label can be multi-category (more than two categories). One example is to predict the pixel value of each pixel (label) of an image (256 categories ranging from 0 to 255).

There are two Multioutput estimators:

  • MultiOutputRegressor: multiple output regression
  • MultiOutputClassifier: Multi-output classification

Here we will focus only on multi-output classification.

(1) MultiOutputClassifier

First introduce the MultiOutputClassifier and RandomForestClassifier.

from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
Copy the code

In the case of handwritten numbers, we also designed multiple labels for each number and each label had a category greater than two.

  • Label 1: Less than or equal to 4, between 4 and 7, greater than or equal to 7(Class 3)
  • Label 2: Numbers themselves (10 categories)

The code is as follows:

y_train_1st = y_train.copy()
y_train_1st[ y_train<=4 ] = 0
y_train_1st[ np.logical_and{y_train>4, y_train<7)] =1
y_train_ist[ y_train>=7 ] = 2

y_train_multioutput = np.c_[y_train_1st, y_train]
y_train_multioutput
Copy the code
# run results array ([[0, 4], [1, 5], [2, 7], [1, 5], [2, 9], [2, 9]])Copy the code

A random forest with 100 decision trees is used to solve the multi-input classification problem.

MO = MultiOutputClassifier( RandomForestClassifier(n_estimators=100) )
MO.fit( X_train, y_train_multioutput )
Copy the code
Classifiers (bootstrap=True, class_weight=None, criterion='gini') Max_depth =None, max_features='auto', max_leaf_nodes=None, min_impurity_Decrease =0.0, min_impurity_split=None, Min_samples_leaf =1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, min_weight_fraction_leaf= 1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False), n_jobs=None)Copy the code

Take a look at the model’s predictions on the first five photos of the test set.

MO.predict( X_test[:5, :)Copy the code
array([[0, 2],[0, 2],[0, 0],[2, 9],[1, 5]])
Copy the code

The first column of this NDARRay is the category of tag 1, and the second column is the category of tag 2. The predicted result was that the five images showed numbers 2, 2, 0, 9, and 5(tag 2), with the first three numbers 2, 2, and 0 less than or equal to 4(tag 1, category 1, category 2), the fourth number 9 greater than or equal to 7(tag 1, category 2), and the fifth number 5 between 4 and 7(tag 1, category 3).

Look at real tags.

y_test_1st = y_test.copy()        
y_test_1st[ y_test<=4 ] = 0        
y_test_1st[ np.logical_and(y_test>4, y_test<7)] =1        
y_test_1st[ y_test>=7 ] = 2                 
y_test_multioutput = np.c_[ y_test_1st, y_test ]                 
y_test_multioutput[:5]
Copy the code
array([[0, 2],[0, 2],[0, 0],[2, 9],[1, 5]])
Copy the code

Compared to the reference result label, the model predicted the results very well.

4.4 Model Selection estimator

Model Selection is very important in machine learning. It is mainly used to evaluate Model performance. Common Model Selection estimators are as follows:

  • cross_validate: Evaluate the results of cross validation.
  • learning_curve: Build and plot the learning curve.
  • GridSearchCV: Search the best hyperparameter from the hyperparameter candidate grid using cross validation.
  • RandomizedSearchCV: Uses cross validation to search for the best hyperparameter from a random set of hyperparameters.

Here we focus only on the two estimators that regulate hyperparameters, GridSearchCV and RandomizedSearchCV. We first review the cross validation (more detailed explanation see ShowMeAI article graphic machine learning | model assessment methods and standards).

(1) Cross validation

K-fold cross validation set (K-fold cross validation set) refers to dividing the entire data set into K equal but random pieces, each containing approximately M /K pieces of data (M is the total number of data).

Among the K samples, K-1 samples were selected as the fitting parameters of the training set each time, and the evaluation calculation was carried out on the remaining verification set. Because the K pieces of data are traversed, the operation is called cross-validation. The operation is shown below

The figure below shows the estimators for two call-arguments: “grid search” and “random search”.

Grid search parameter: Parameter 1 is set in [1,10,100,1000], and parameter 2 is set in [0.01, 0.1, 1 10]. Note that the value is not equally spaced. The model was tested on all 16 groups of hyperparameters, and the parameters with the minimum cross-validation error were selected.

Random search parameter: Random search based on the specified distribution can be selected independently of the number of parameters, such as log(parameter 1) uniformly distributed from 0 to 3, log(parameter 2) uniformly distributed from -2 to 1.

The application mode and reference code are as follows:

from time import time
from scipy.stats import randint
from sklearn.model_selection import GridSearchCv
from sklearn.model_selection import RandomizedSearchcCv
from sklearn.ensemble import RandomForestClassifier

X, y = digits.data, digits.target
RFC = RandomForestClassifier(n_estimators=20)

# Randomized Search
param_dist = {  "max_depth": [3.5]."max_features": randint(1.11),
                                "min_samples_split": randint(2.11),
                                "criterion": ["gini"."entropy"]}
n_iter_search = 20
random_search = RandomizedSearchCv( RFC, param_distributions=param_dist, n_iter=n_iter_search, cv=5 )}
start = time()
random_search.fit(X, y)
print("RandomizedSearchCv took %.2f seconds for %d candidates, parameter Settings." % ((time() - start), n_iter_search))
print( random_search.best_params_ )
print( random_search.best_score_ )

Grid Search
param_grid = {  "max_depth": [3.5]."max_features": [1.3.10]."min_samples_ split": [2.3.10]."criterion": ["gini"."entropy"]}
grid_search = GridSearchCV( RF, param_grid=param_grid, cv=5 )
start = time()
grid_search.fit(X, y)

print("\nGridSearchcv took %.2f seconds for %d candidate parameter settings." % (time() - start, len(grid_search.cv_results_['params')))print( grid_search.best_params_ )
print( grid_search.best_score_ )
Copy the code

The following output is displayed:

RandomizedSearchCv took 3.73 seconds for 20 parameter Settings. {'criterion': 'entropy', '*max_depth': 5, 'max_features': 6, 'min_samples_split': 4} 0.8898163606010017 GridSearchCV took 2.30 seconds for 36 candidate parameter Settings. {'criterion': 'entropy', 'max_depth': 5, 'max_features': 10, 'min_samples_ split': 10} 0.841402337228714S5Copy the code

Here’s an explanation of the code:

  • The first five lines introduce the appropriate tool libraries.
  • Lines 7-8 prepare data X and Y to create a random forest model with 20 decision trees.
  • The candidate parameter distribution and parameter grid were constructed for the hyperparameters “maximum tree depth, maximum feature number, minimum splitable sample number and splitting criterion” of the random forest in acts 10-14 and 23-27.
  • Lines 15-18 run a random search.
  • Lines 18-30 run the grid search.

Run results:

  • The first line prints how many times each trace was run and how long it took.
  • The second line outputs the best combination of hyperparameters.
  • The third line outputs the highest score.

In this case, a random search finds a set of hyperparameters in a shorter time than a grid search, resulting in a higher score.

4.5 Pipeline estimator

Pipeline estimator is also called Pipeline, which consists of various estimators in series (Pipeline) or parallel (FeatureUnion). It can be a real productivity booster if used properly.

(1) Pipeline

Pipeline connects several estimators together in order, such as feature extraction → dimension reduction → fitting → prediction

The Pipeline property is always the same as the last estimator property:

  • If the last estimator is a predictor, then Pipeline is a predictor.
  • If the last estimator is a converter, then Pipeline is a converter.

Here is a simple example of using Pipeline to complete the “fill in missing values – standardize” two steps. We start by building data X with the missing value NaN.

X = np.array([[56.40.30.5.7.10.9,np.NaN,12],
              [1.68.1.83.1.77,np.NaN,1.9.1.65.1.88,np.NaN,1.75]])
X = np.transpose(X)
Copy the code

We build the Pipeline with the following process components:

  • Converter SimpleImputer to handle missing values.
  • Do the planned converter MinMaxScaler.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

pipe = Pipeline([
                ('impute', SimpleImputer(missing_values=np.nan, strategy='mean')),
                ('normalize', MinMaxScaler())])
Copy the code

Lines 5-7 create the Pipeline by simply entering (name, estimator) the list of pipelines built by this tuple in Pipeline(). In this example, the name of SimpleImputer is impute and the name of MinMaxScaler is Normalize.

Because the last estimator is a converter, pipeline is also a converter. Now let’s run it. We see that the values are filled and the two columns are normalized.

X_proc = pipe.fit_transform( X )
Copy the code

To verify the pipeline parameters, we can run the two converters in sequence, the result is the same.

X_impute = SimpleImputer(missing values=np.nan, strategy='mean').fit_transform( X )
X_impute
Copy the code
# the results array ([[50, 1.68], [40, 1.83], [30, 1.77], [5, 1.78], [7, 1.9], [10, 1.65], [9, 1.88], [20.375, 1.78], [12, 1.75]])Copy the code
X_normalize = MinMaxScaler().fit_transform( X_impute )
X_normalize
Copy the code

The results

Array ([[1., 0.12], [0.77777778, 0.72], [0.55555556, 6.48], [0.52, 1], [0.04444444, 1], [0.11111111, 9]. [0.08888889, 6.92], [0.34166667, 6.52], [0.15555556, 0.4]])Copy the code

(2) FeatureUnion

If we want to run several estimators on a node at the same time, we can use FeatureUnion. In the following example, we first create a DataFrame data, which has the following characteristics:

  • The first two columns intelligence IQ and temper are category variables.
  • The last two columns, “income” and “height” are both numerical variables.
  • There are missing values in each column.
d= { 'IQ' : ['high'.'avg'.'avg'.'low', high', avg'.'high'.'high'.None].'temper' : ['good'.None.'good'.'bad'.'bad'.'bad'.'bad'.None.'bad'].'income' : [50.40.30.5.7.10.9,np.NaN,12].'height' : [1.68.1.83.1.77,np.NaN,1.9.1.65.1.88,np.NaN,1.75]}

X = pd.DataFrame( d )
X
Copy the code

The results are as follows:

We now follow these steps to clean the data.

  • For category variables: data acquisition → median filling → unique heat coding
  • Logarithmic variables: data acquisition → mean filling → standardization

The above two steps are done in parallel.

Let’s start by defining a class called DataFrameSelector that gets columns from a DataFrame.

from sklearn.base import BaseEstimator, TransformerMixin

class DataFrameSelector( BaseEstimator, TransformerMixin ) :
        def _init_( self, attribute_names ) :
                self.attribute_names = attribute_names
        def fit( self, X, y=None ) :
                return self
        def transform( self, X ) :
                return X[self.attribute_names].values
Copy the code

In the transform function, we take the input DataFrame X and get its value based on the property name.

Next, create pipeline full_PIPE, which has two pipelines in parallel

  • Categorical_pipe handles categorical variables

    • DataFrameSelector is used to fetch
    • SimpleImputer fills None with the most values
    • OneHotEncoder to encode the return non-sparse matrix
  • Numeric_pipe handles numeric variables

    • DataFrameSelector is used to fetch
    • SimpleImputer populates nans with means
    • Normalize to normalize values

The code is as follows:

from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder

categorical features = ['IQ'.'temper']
numeric_features = ['income'.'height']

categorical pipe = Pipeline([
        ('select', DataFrameSelector(categorical_features)),
        ('impute', SimpleImputer(missing values=None, strategy='most_frequent')),
        ('one_hot_encode', OneHotEncoder(sparse=False))])

numeric_pipe = Pipeline([
        ('select', DataFrameSelector(numeric_features)),
        ('impute', SimpleImputer(missing values=np.nan, strategy='mean')),
        ('normalize', MinMaxScaler())])

full_pipe = FeatureUnion( transformer_list=[
        ('numeric_pipe', numeric_pipe),
        ('categorical_pipe', categorical_pipe)])
Copy the code

We print the result as follows:

X_proc = full_pipe.fit_transform( X )        
print( X_proc )
Copy the code
[[1. 0.12 0. 1. 0. 0. 1.] [0.77777778 0.72 1. 0. 0. 1. 0.] [0.55555556 0.48 1. 0. 0. 0. 1.] [0. 0.52 0. 0. 1. 1. 0.] [0.04444444 1.0.1.0.] [0.11111111 0.0.1.0.] [0.08888889 0.92 0.0.1.0.] [0.34166667 0.52 0. 1. 0. 0.] [0.15555556 0.4 0.Copy the code

5. To summarize

Now let’s summarize the sklearn library application knowledge explained above.

5.1 Five SKLearn Principles

SKLearn’s main API is designed to follow five principles

(1) consistency

The interface for all objects is consistent and simple, in the estimator

  • Create:model = Constructor(hyperparam)
  • Participation:
    • Supervised learning:model.fit(X_train, y_train)
    • Unsupervised learning:model.fit(X_train)

In the predictor

  • Predictive labels in supervised learning:y_pred = model.predict(X_test)
  • Identifying patterns in unsupervised learning:idx_pred = model.predict( Xtest)

In the Converter

  • Create:trm = Constructor(hyperparam)
  • For reference:trm.fit(X_train)
  • Translation:X_trm = trm.transform(X_train)

(2) test

All of the set and learned parameters in the estimator can be directly accessed by the instance variable to verify their value, the difference is that the names of the hyperparameters do not end with the underscore _, while the names of the parameters end with the underscore _. Examples are as follows:

  • Money:model.hyperparameter
  • Special case:SVC.kernel
  • Money:model.parameter_
  • Special case:SVC.support_vectors_

(3) the standard classes

The SKLearn model accepts data sets only in the format of “Numpy array” and “Scipy sparse matrix”. The format of a hyperparameter can only be “character” or “numeric”. Other classes are not accepted!

(4)

Modules can be used repeatedly “chained together” or “side-by-side,” as in two forms of pipeline

  • Arbitrary converter sequence
  • Arbitrary converter sequence + estimator

(5) by default

SKLearn provides reasonable defaults for most hyperparameters, greatly reducing the difficulty of modeling.

5.2 SKLearn Framework Process

The skLearn modeling application process framework is as follows:

(1) Determine the task

Is it “supervised” classification or regression? Or “unsupervised” clustering or dimensionality reduction? Once you’ve done that, you basically know which models to use in Sklearn.

(2) Data preprocessing

This step is the most tedious, to deal with missing values, outliers; To encode category variables; To normalize or standardize numerical variables, and so on. But with the Pipeline artifact everything becomes simple and efficient.

(3) Training and evaluation

This step is the easiest. Fit () is used for training and predict() is used for evaluation.

(4) Select the model

Launch GridSearchCV and RandomizedSearchCV in the ModelSelection estimator and select the set of hyperparameters (the model) that has the highest score.

The resources

  • Diagram of machine learning algorithm | from entry to master series
  • SKLearn website: scikit-learn.org/stable/
  • AI quick modeling tools | Scikit – learn to use guidelines

ShowMeAIRecommended series of tutorials

  • Illustrated Python programming: From beginner to Master series of tutorials
  • Illustrated Data Analysis: From beginner to master series of tutorials
  • The mathematical Basics of AI: From beginner to Master series of tutorials
  • Illustrated Big Data Technology: From beginner to master
  • Illustrated Machine learning algorithms: Beginner to Master series of tutorials
  • Machine learning: Teach you how to play machine learning series

Related articles recommended

  • Application practice of Python machine learning algorithm
  • SKLearn introduction and simple application cases
  • SKLearn most complete application guide
  • XGBoost modeling applications in detail
  • LightGBM modeling applications in detail
  • Python Machine Learning Integrated Project – E-commerce sales estimates
  • Python Machine Learning Integrated Project — E-commerce Sales Estimation
  • Machine learning feature engineering most complete interpretation
  • Application of Featuretools
  • AutoML Automatic machine learning modeling