Machine Learning 028- Five minutes to teach you how to build a machine learning pipeline

(Python libraries and versions used in this article: Python 3.6, Numpy 1.14, Scikit-learn 0.19, matplotlib 2.2)

Now the social industrialization of large-scale production is inseparable from the assembly line, with the assembly line, we can easily produce thousands of the same products, and the price cost is greatly reduced, so the assembly line operation, makes the level of industrial production greatly improved.

Is it possible to transfer this idea of pipelining processing to machine learning? Can we make the whole process of data cleaning, data regularization, data processing, feature selection, supervised learning and model evaluation into a machine learning pipeline? If it can, it will greatly save us the time to build an AI model and greatly improve the efficiency of building good AI.

Here, Furnace AI can tell you with absolute certainty that it can be done, and it’s very easy to build this machine learning pipeline in as little as five minutes.


1. Assembly line step 1: Prepare data set

Data sets are not important in this project, so we will use the samples_generator module of SkLearn to generate some sample data. Although Numpy also has functions in the Random module to randomly generate data sets, NUMpy is more suitable for generating simple sample data. The DATASets class in SkLearn, on the other hand, can be used to generate datasets suitable for machine learning models.

The apis commonly used in SkLearn’s datasets are:

  1. Generate regression model data using make_regression

  2. Generate classification model data using make_hastiE_10_2, make_CLASSIFICATION, or make_multilabel_CLASSIFICATION

  3. Generate cluster model data with make_blobs

  4. Generate grouped multidimensional normal distribution data using make_gaussian_quantiles

# Prepare the data set
from sklearn.datasets import samples_generator
Use this function to generate sample data
X,y=samples_generator.make_classification(n_informative=4,
                                          n_features=20,
                                          n_redundant=0,
                                          random_state=5)
# Generate a classification dataset with 100 samples, 20 features, 2 categories, and no redundant features.
# print(X.shape) # (100, 20)
# print(y.shape) # (100,)
# print(X[:3]
Copy the code


2. Pipeline Step 2: Build feature selectors

After the data set is prepared, it is necessary to extract the most important features of the data set, namely the main features that have the greatest influence on our classification results, so as to reduce the complexity of the model and maintain the prediction accuracy of the model. The SkLearn framework also has the feature selection function SelectKBest() ready for us. We just need to specify the number of features K to select. The code below is very simple.

# create a feature selector
from sklearn.feature_selection import SelectKBest, f_regression
feature_selector=SelectKBest(f_regression,k=10) 
# There are a total of 20 feature vectors, from which we select the most important 10 feature vectors
Copy the code


3. Pipeline Step 3: Build the classifier

In the next step, we need to build a classifier model. Many classifier algorithms have been mentioned in my previous article, such as SVM, random forest, naive Bayes, etc. Here, we build a simple random forest classifier as an example.

# create a classifier
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=50,max_depth=4)
# build a random forest classifier as an example
Copy the code


4. Assembly line Step 4: Complete assembly line

The above steps are equivalent to building various product processing modules. In this step, we need to assemble these modules into a fully operational machine learning pipeline. The code is simple, as shown below.

# Step 4: Complete assembly line
from sklearn.pipeline import Pipeline
pipeline=Pipeline([('selector',feature_selector),
                   ('rf_classifier',classifier)])
# Modify pipeline parameter Settings
# Suppose we want the feature selector to select 5 features instead of 10,
N_estimators = n_estimators
pipeline.set_params(selector__k=5,
                    rf_classifier__n_estimators=25)
Copy the code

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — – a — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Pipeline(memory=None, steps=[(‘selector’, SelectKBest(k=5, score_func=

)), (‘rf_classifier’, RandomForestClassifier(bootstrap=True, class_weight=None, criterion=’gini’, max_depth=4, Max_features =’auto’, max_leaf_nodes=None, min_impurity_DECREASE =0.0, min…n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False))])

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –

As can be seen from the output above, there are only two modules in this pipeline, one is the selector and the other is the classifier Rf_classifier, with their respective parameters in the rear.

As for the model, it is all about training with the data set, and using the trained model to make predictions for the new sample. The code is as follows:

# Input data into the pipeline
pipeline.fit(X,y) Train the assembly line

predict_y=pipeline.predict(X) # Predict samples with trained assembly lines
# print(predict_y)

Evaluate the model performance of the pipeline
print('pipeline model score: {:.3f}'.format(pipeline.score(X,y)))
Copy the code

— — — — — — — — — — — — — — — — — — — — — — — — — — — – — – a — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Pipeline model score: 0.960

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –

With a score of 0.960 on the training set, the pipelined model seems to be performing well.

We built a feature selector above, but how do we know which features are selected and which are discarded? The following code:

# view the features selected by the feature selector:
feature_status=pipeline.named_steps['selector'].get_support()
# get_support() returns true/false, or true if the feature is supported.
selected_features=[]
for count,item in enumerate(feature_status):
    if item: selected_features.append(count)
print('selected features by pipeline, (0-indexed): \n{}'.format(
        selected_features))
Copy the code

— — — — — — — — — — — — — — — — — — — — — — — — — – — – a — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

selected features by pipeline, (0-indexed): [5, 9, 10, 11, 15]

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –

It can be seen that the pipeline automatically selects five features (we specified k=5 earlier), and the five most important features are labeled 5,9,10,11, and 15 respectively.

# # # # # # # # # # # # # # # # # # # # # # # # small * * * * * * * * * * and # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

1. Building a machine learning pipeline is very simple. You just need to build the basic building blocks of machine learning and then assemble them.

2. The previous feature extractor selects the most important features through univariate statistical tests, and then extracts the best features from feature vectors. After such a test, each feature in the vector space will have an evaluation score. Based on these evaluation scores, the best K features will be selected. Once K features are extracted and k-dimensional feature vectors are formed, this feature vector can be used for the input training data of the classifier.

3. This assembly line has many advantages, such as simple and fast machine learning model construction, convenient extraction of the most important K feature vectors, and rapid evaluation of the constructed model. Exactly, it is a necessary tool for rapid AI model construction.

4. The above pipeline only integrates feature selection and classifier. The only pity is how to integrate data processing and data cleaning into the pipeline. For the time being, I haven’t found this part of the content. If anyone has found this method, please contact me.

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #


Note: This part of the code has been uploaded to (my Github), welcome to download.

References:

1, Classic Examples of Python machine learning, by Prateek Joshi, translated by Tao Junjie and Chen Xiaoli