Introduction to the

The streaming workflows with pipelines realized.

Pipeline mechanism (also translated as pipeline learner? The root of the application in machine learning algorithms is the reuse of parameter sets on new data sets (such as test sets). Using a pipeline mechanism can significantly reduce the amount of code. Overall this is a very practical and fun method

Note: The plumbing mechanism is more of an innovation in programming skills than in algorithms.

The normal steps of the general pipeline learner Data normalization learner => feature selection learner => Learner that performs prediction all learners except the last learner must provide the transform algorithm, which is used for data transformation

Common methods and properties

Sklearn official documentation

Parameters

  • A list of joined (name, transform) tuples (implement fit/transform) in the order in which they are joined, the last object being an estimator.
  • Memory: memory parameters, the Instance of sklearn. External. Joblib. The memory or a string, optional (default = None)
  • Attribute, name_Steps: Bunch Object, a dictionary read-only attribute with attribute access access to any step parameter with the name given by the user. The key is the step name and the value is the step parameter. Alternatively, you can obtain the value from. Step Name

funcution

  • Pipline methods execute the corresponding method in each learner. If the learner does not have the method, an error will be reported
  • Suppose the Pipline has n learners
  • Transform, execute the transform method of each learner in turn
  • Inverse_transform, execute the inverse_transform method of each learner in turn
  • Fit, perform fit and transform methods on the first N-1 learner in turn, and perform FIT method on the NTH learner (the last learner)
  • Predict, perform the NTH learner’s predict method
  • Score, executes the score method of the NTH learner
  • Set_params, set the parameters of the NTH learner
  • Get_param gets the argument to the NTH learner

example

General steps

  • Data is preprocessed first, such as missing value processing
  • Standardization of data
  • Dimension reduction
  • Feature selection algorithm
  • Classification or prediction algorithm (estimator,estimator)

The flow chart

Load the data

import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/' 'breast-cancer-wisconsin/wdbc.data', header=None) # Breast Cancer Wisconsin dataset X, y = df.values[:, 2:], df.values[:, Y = encoder. Fit_transform (y) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0)Copy the code

Merge algorithm flow with Pipline

Pipline


def Examples_SklearnOrg_Pipline(X_train, X_test, y_train, y_test):
    from sklearn import svm
    from sklearn.feature_selection import SelectKBest
    from sklearn.feature_selection import f_regression
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    from sklearn.decomposition import PCA

    anova_filter = SelectKBest(f_regression, k=5)
    # clf = svm.LinearSVR(kernel='linear')
    clf = svm.LinearSVC()
    anova_svm = Pipeline([('sc', StandardScaler()),
                        ('pca', PCA(n_components=2)),('anova', anova_filter), ('svc', clf)])
    # You can set the parameters using the names issued
    # For instance, fit using a k of all in the SelectKBest
    # Because the PCA,we only have 2 features
    # and a parameter 'C' of the svm
    anova_svm.set_params(anova__k="all", svc__C=.1).fit(X_train, y_train)

    # prediction_trian = anova_svm.predict(X_train)
    # prediction_test = anova_svm.predict( X_test)
    score_train = anova_svm.score(X_train, y_train)
    score_test = anova_svm.score(X_test, y_test)

    # print("prediction_train :", prediction_trian)
    # print("prediction_test :", prediction_test)
    print("score_train :", score_train)
    print("score_test :", score_test)
    # getting the selected features chosen by anova_filter
    Pri_nameed_steps00 = anova_svm.named_steps['anova'].get_support()
    print(Pri_nameed_steps00)

    # Another way to get selected features chosen by anova_filter
    Pri_nameed_steps01 = anova_svm.named_steps.anova.get_support()
    print(Pri_nameed_steps01)

UCI_sc_pca_logisticRe(X_train, X_test, y_train, y_test)
Examples_SklearnOrg_Pipline(X_train, X_test, y_train, y_test)
Copy the code

Output:

Score_test: 0.921052631579 [True True] [True True]Copy the code

reference

  • Pipeline mechanism in SKLearn – Inside_Zhang
  • << Python vs machine learning data scientist a small target >> Wang Zhenglin, 2017.03
  • Sklearn: Official website :scikit-learn.org/stable/modu…