This is the 12th day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

Why Pipeline?

In daily machine learning project development, it may go through the process of data scaling, feature combination and model learning fitting. And, when the problem is more complex, the algorithm and model applied is more complicated.

At the same time, the whole machine learning process can be completed more efficiently by combining these data processing steps into an algorithm chain without neglecting some details.

What is a Pipeline?

In SkLearn, a pipeline is a processing chain consisting of a series of data transformation steps or a model to be fitted (if any, the model must be at the end of the pipeline).

What are the benefits of Pipeline?

Sklearn Pipeline has the following advantages:

  1. Convenience and encapsulation: The FIT and Predict methods are directly called to train and predict all the algorithm models in the pipeline.
  2. Joint parameter selection: You can use Grid search to select the parameters of all the estimators in the pipeline.
  3. Security: The training converter and predictor use the same sample, and the pipeline helps prevent statistics from the test data from leaking into the cross-validated training model.

The principle of Pipeline

Pipeline can concatenate many algorithm models to form a typical machine learning problem workflow.

The Pipeline processing mechanism is like putting all the models into a pipe and then processing the data in turn to get the final classification result.

For example, model 1 can be a data standardization process, Model 2 can be a feature selection model or feature extraction model, and Model 3 can be a classifier or prediction model (it is not necessary to have three models according to their actual needs), as shown in the figure below.

In SKLearn, all estimators except the last one in Pipleline must be converters, and the last estimator can be of any type (Transformer, classifier, regresser). If the last estimator is a classifier, The entire pipeline can be used as a classifier, or if the last estimator is a regressor.

Remark:

  • All machine learning algorithm models are called estimators.
  • Transformer: standardized. The output of the converter can be put into another converter or estimator as input.

A complete sample Pipeline step in SkLearn is as follows:

  1. First, the data is preprocessed, such as missing value processing
  2. Standardization of data
  3. Dimension reduction
  4. Feature selection algorithm
  5. Classification or prediction or clustering algorithms (estimator,estimator)

In fact, the fit method of the pipeline is called with the first N-1 converter processing features, which are then passed to the final estimator for training. The pipeline inherits all the methods from the last estimator.

Sklearn the use of Pipeline

sklearn.pipeline.Pipeline(steps, memory=None, verbose=False)
Copy the code

Parameter description:

  • steps: steps are built using a list of (key, value) steps, where key is the name you give the step and value is an evaluator object.
  • memoryThe memory parameter is used only when “transformer” in the middle of Pipeline needs to be saved. The default is None.

Methods of the Pipeline class

Pipline methods execute the corresponding method in each learner. If the learner does not have the method, an error will be reported.

Suppose the Pipeline has n learners:

  • Transform: Executes the transform method of each learner in turn.
  • Fit: Execute fit and transform methods on the first N-1 learner in turn, and the NTH learner (the last learner) executes FIT method.
  • Predict: Perform the NTH predictor predict method.
  • Score: Executes the SCORE method of the NTH learner.
  • Set_params: Sets the parameters of the NTH learner.
  • Get_param: gets the argument of the NTH learner.

The case of Pipeline

Modular feature transformation

Step description:

  • Step 1: first use StandardScaler to do standardized processing (transformer) for each column of the data set.
  • Step 2: PCA principal component analysis is used for feature dimension reduction (Transformer).
  • Step 3: Finally, SVC model classification (Estimator) is used.

The result of the training is a model that can be directly used to predict using pipe.predict(X). During the prediction, the data will be transformed from the beginning of the step, avoiding the need to write extra code to predict the data used by the model. The correct rate of this model on the X training set can also be obtained through pipe.score(X,Y).

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

iris=load_iris()
pipe=Pipeline([('sc', StandardScaler()),('pca',PCA()),('svc',SVC())])

# ('sc', StandardScaler()) sc is the custom converter name, StandardScaler() is the converter that performs the standardization task

pipe.fit(iris.data, iris.target)

# prediction
print(pipe.predict(iris.data))

# Evaluate the score of the model (accuracy)
print(pipe.score(iris.data, iris.target))
Copy the code

Running results:

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] 0.97333333333333333334Copy the code

In addition, we can use the make_pipeline function, which is a simple implementation of the Pipeline class. It simply passes in a class instance of each step, without naming it itself, and automatically sets the class’s lowercase to the name of the step.

from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import GaussianNB
make_pipeline(StandardScaler(),GaussianNB())
Copy the code

Running results:

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('gaussiannb', GaussianNB())])
Copy the code

Automatic Grid Search

Pipeline can also combine GridSearch to select parameters.

Step description:

  • Step 1: First, use TfidfVectorizer for feature extraction.
  • Step 2: Then, classify using SVC model.
  • Step 3: Finally, grid search is performed using cross validation using GridSearchCV.
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import numpy as np

# Download data
news = fetch_20newsgroups(subset='all')

# Shard the data set
X_train,X_test,y_train,y_test = train_test_split(news.data[:3000],news.target[:3000],test_size=0.25,random_state=33)

# Text feature extraction
vec = TfidfVectorizer()
X_count_train = vec.fit_transform(X_train)
X_count_test = vec.transform(X_test)

Simplify the system building process with pipeline, and connect text extraction with classifier model in series
clf = Pipeline([('vect',TfidfVectorizer(stop_words='english')), ('svc',SVC())])

# After pipeline feature processing and SVC model training, the trained classifier CLF is obtained directly


parameters = {
    'svc__gamma': np.logspace(-2.1.4),
    'svc__C': np.logspace(-1.1.3),
    'vect__analyzer': ['word']}# GridSearchCV uses cross validation for grid search, where n_jobs=-1 represents the use of all the CPU of the computer
gs = GridSearchCV(clf, parameters, verbose=2, refit=True, cv=3, n_jobs=-1)
gs.fit(X_train,y_train)

# Get the best hyperparameter and the best score for model evaluation by cross validation
print (gs.best_params_, The '-',gs.best_score_)

# Evaluate the score of the best model (accuracy)
print (gs.score(X_test, y_test))
Copy the code

Running results:

Fitting 3 folds for each of 12 candidates, totalling 36 fits {'svc__C': 10.0, 'svc__gamma': 0.1, 'vect__Analyzer ': 'word'} - 0.7888888888888889 0.8226666666666667Copy the code

As you can see, the key in the parameters variable has a prefix, which is actually the operation name defined in the Pipeline. The combination of the two makes our code very concise.

Grid Search (Grid Search)

Grid search is a kind of tuning method, which conducts exhaustive search in parameter list and trains each case to find the optimal parameter. Therefore, the main disadvantage of this method is time-consuming, the more parameters, the more candidate values, the more time consuming! Therefore, under normal circumstances, first set a large range, and then refine.

Characteristics of the combination

FeatureUnion combines multiple transformers into a single new transformer object. During the fitting process, each transformer fits the data independently, in parallel, and the output feature matrices are arranged side by side in a large matrix.

A FeatureUnion object accepts a list of Transformer objects. FeatureUnion can be used in combination with Pipeline.

Note:

FeatureUnion can’t check whether two transformers produce the same feature output, it just produces a set of originally separate feature vectors. It is up to the caller to ensure that it produces a different characteristic output.

Step description:

  • Step 1: feature processing via StandardScaler and FunctionTransformer respectively.
  • Step 2: Then combine StandardScaler processed features with FunctionTransformer processed features.
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from numpy import log1p

# Raw matrix data shape
print(iris.data.shape)


step1=('Standar', StandardScaler())
step2=('ToLog', FunctionTransformer(log1p))

steps=FeatureUnion(transformer_list=[step1,step2])
print(steps)
data=steps.fit_transform(iris.data)

# Shape of matrix data after feature combination
print(data.shape)

View the first row of data
print(data[0])
Copy the code

Running results:

(150, 4) FeatureUnion(transformer_list=[('Standar', StandardScaler()), ('ToLog', FunctionTransformer(func=<ufunc 'log1p'>))]) (150, 8) [-0.90068117 1.01900435-1.34022653-1.3154443 1.80828877 1.5040774 0.87546874 0.18232156]Copy the code

Reference documentation

  • Grid Search for machine learning
  • A pipeline processing mechanism that allows you to get twice the result with half the effort