1/ What is pipeline

A typical machine learning build consists of several steps - 1, source data ETL-2, data preprocessing - 3, feature selection - 4, model training and validation - the above four steps can be abstracted into a multi-step pipelining work from the beginning of data collection to the output of the final result we need. Therefore, it is feasible to abstract modeling the above steps and simplify them into pipelined workflow. For spark users, pipelined machine learning is more efficient and easy to use than independent modeling of a single step. The pipeline mechanism is used in machine learning algorithms because of the reuse of parameter sets on new data sets (such as test sets). ** Streaming workflows with pipelines** Note: The plumbing mechanism is more of an innovation in programming skills than in algorithms.Copy the code

Example 2 /

<1> Load the data set

import pandas as pd
from sklearn.cross_validation import train_test_split # Divide training set and test set
from sklearn.preprocessing import LabelEncoder # Digitization of category data
 
df = pd.read_csv('xxxx',header=None)
# Breast Cancer Wisconsin dataset
 
x = df.values[:, 2:]
y = df.values[:, 1]
# y is a character tag
Use the LabelEncoder class to convert it to a numeric type starting at 0
encoder = LabelEncoder()
y = encoder.fit_transform(y)
                    >>> encoder.transform(['M'.'B'])
                    array([1.0])
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=2.,random_state=0)
Copy the code

<2> Conceive the flow of the algorithm

The steps that can be put in Pipeline may include: - Feature standardization is needed, which can be used as the first link - since it is a classifier, classifier is also indispensable, naturally is the last link - can be added in the middle, such as data dimension reduction (PCA).Copy the code
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
 
from sklearn.pipeline import Pipeline
 
pipe_lr = Pipeline([('sc', StandardScaler()),
                    ('pca', PCA(n_components=2)),
                    ('clf', LogisticRegression(random_state=1))
                    ]
                   )
pipe_lr.fit(X_train, y_train)
print('Test accuracy: %.3f' % pipe_lr.score(X_test, y_test))
 
# Test accuracy: 0.947
Copy the code
Pipeline objects accept a list of binary tuples, and [(a,b),(aa,bb),(aaa, BBB)]. The first element in each binary tuple is a arbitrary identifier string. The second element in the binary tuple is the Transformer or Estimator that scikit-learn matches with the individual elements in the Pipeline object.Copy the code

<3> Analysis of Pipeline execution process

The intermediate process of Pipeline consists of Scikit-Learn compatible transformer, and the last step is an Estimator. For example, in the code above, StandardScaler and PCA Transformer form intermediate Steps and LogisticRegression acts as the final Estimator. When we execute pipe_lR.fit (X_train, y_train), StandardScaler first executes fit and transform methods on the training set, and the transformed data is passed to the next step in the Pipeline object. Namely PCA (). Like StandardScaler, PCA performs the FIT and transform methods and ultimately passes the transformed data to LosigsticRegression. The whole process is shown in the figure below:Copy the code