Pipline in SkLearn

Introduction to the

The streaming workflows with pipelines realized.

Pipeline mechanism (also translated as pipeline learner? The root of the application in machine learning algorithms is the reuse of parameter sets on new data sets (such as test sets). Using a pipeline mechanism can significantly reduce the amount of code. Overall this is a very practical and fun method

Note: The plumbing mechanism is more of an innovation in programming skills than in algorithms.

The normal steps of the general pipeline learner Data normalization learner => feature selection learner => Learner that performs prediction all learners except the last learner must provide the transform algorithm, which is used for data transformation

Common methods and properties

Sklearn official documentation

Parameters

A list of joined (name, transform) tuples (implement fit/transform) in the order in which they are joined, the last object being an estimator.
Memory: memory parameters, the Instance of sklearn. External. Joblib. The memory or a string, optional (default = None)
Attribute, name_Steps: Bunch Object, a dictionary read-only attribute with attribute access access to any step parameter with the name given by the user. The key is the step name and the value is the step parameter. Alternatively, you can obtain the value from. Step Name

funcution

Pipline methods execute the corresponding method in each learner. If the learner does not have the method, an error will be reported
Suppose the Pipline has n learners
Transform, execute the transform method of each learner in turn
Inverse_transform, execute the inverse_transform method of each learner in turn
Fit, perform fit and transform methods on the first N-1 learner in turn, and perform FIT method on the NTH learner (the last learner)
Predict, perform the NTH learner’s predict method
Score, executes the score method of the NTH learner
Set_params, set the parameters of the NTH learner
Get_param gets the argument to the NTH learner

example

General steps

Data is preprocessed first, such as missing value processing
Standardization of data
Dimension reduction
Feature selection algorithm
Classification or prediction algorithm (estimator,estimator)

The flow chart

Load the data

import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/' 'breast-cancer-wisconsin/wdbc.data', header=None) # Breast Cancer Wisconsin dataset X, y = df.values[:, 2:], df.values[:, Y = encoder. Fit_transform (y) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0)Copy the code

Merge algorithm flow with Pipline

Pipline


def Examples_SklearnOrg_Pipline(X_train, X_test, y_train, y_test):
    from sklearn import svm
    from sklearn.feature_selection import SelectKBest
    from sklearn.feature_selection import f_regression
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    from sklearn.decomposition import PCA

    anova_filter = SelectKBest(f_regression, k=5)
    # clf = svm.LinearSVR(kernel='linear')
    clf = svm.LinearSVC()
    anova_svm = Pipeline([('sc', StandardScaler()),
                        ('pca', PCA(n_components=2)),('anova', anova_filter), ('svc', clf)])
    # You can set the parameters using the names issued
    # For instance, fit using a k of all in the SelectKBest
    # Because the PCA,we only have 2 features
    # and a parameter 'C' of the svm
    anova_svm.set_params(anova__k="all", svc__C=.1).fit(X_train, y_train)

    # prediction_trian = anova_svm.predict(X_train)
    # prediction_test = anova_svm.predict( X_test)
    score_train = anova_svm.score(X_train, y_train)
    score_test = anova_svm.score(X_test, y_test)

    # print("prediction_train :", prediction_trian)
    # print("prediction_test :", prediction_test)
    print("score_train :", score_train)
    print("score_test :", score_test)
    # getting the selected features chosen by anova_filter
    Pri_nameed_steps00 = anova_svm.named_steps['anova'].get_support()
    print(Pri_nameed_steps00)

    # Another way to get selected features chosen by anova_filter
    Pri_nameed_steps01 = anova_svm.named_steps.anova.get_support()
    print(Pri_nameed_steps01)

UCI_sc_pca_logisticRe(X_train, X_test, y_train, y_test)
Examples_SklearnOrg_Pipline(X_train, X_test, y_train, y_test)
Copy the code

Output:

Score_test: 0.921052631579 [True True] [True True]Copy the code

reference

Pipeline mechanism in SKLearn – Inside_Zhang
<< Python vs machine learning data scientist a small target >> Wang Zhenglin, 2017.03
Sklearn: Official website :scikit-learn.org/stable/modu…