Automatic machine learning (2) Automatic construction of machine learning pipeline

The article directories

- Technology is introduced
- - Core Technology Stack
- implementation
- - data
  - implementation
  - - Class library loading and data reading
    - parameter
    - Methods:
- conclusion

Technology is introduced

Automatic machine learning (I) Automatic optimization of hyperparameters

Automatic machine learning is a method that can automatically establish machine learning model, which mainly includes three aspects: first, hyperparameter optimization; Aspect two, automatic feature engineering and machine learning algorithm automatic selection; Aspect three, neural network structure search. This paper focuses on aspect two, we will use TPOT to complete automatic feature engineering and machine learning algorithm automatic selection.

In machine learning, the parameters of the model can be obtained through training data, and these parameters are common parameters of the algorithm. To obtain appropriate parameters of the algorithm through data training, building a powerful model is the core goal of machine learning. However, the machine learning algorithm itself also exists in hyperparameters, which are those parameters that need to be manually set by scientists, such as the kernel function of SVM, the alpha of Lasso, the maximum depth and branching conditions of the decision tree, the sub-sampling rate of the random forest and the type of decision tree, and so on. How to optimize these hyperparameters automatically instead of manually is the first step of automatic machine learning. Optimization of hyperparameters can be treated as a special non-convex function, although we can also look at machine learning at a higher level.

Normal machine learning process should include data read – processing – features building – – model selection – super parameter optimization – (integrated learning), and so on, and in most cases the parts need to be cycle for, constantly updated, end up with a good machine learning assembly line (or pipes), and the focus of this article is that, The main tool used in this paper is TPOT.

TPOT is a Python automatic machine learning tool that optimizes machine learning pipelines based on genetic algorithms. Simply put, TPOT can intelligently explore thousands of possible pipelines to find the best one for a data set, thus implementing the most tedious part of machine learning. In general, TPOT can automatically complete the feature work (feature selection, feature pre-processing, feature construction, etc.), and also carry out model selection and parameter tuning.

Core Technology Stack

tpot
xgboost
lughtgbm
scikit-learn

implementation

data

Here we use the data sets the data download can go to http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv

The data is a UCI wine quality dataset labelled as scores (whole numbers), so it can be used for both regression and classification.

AutoML is not as simple as fitting a model into a dataset. They are considering using multiple machine learning algorithms (random forest, linear model, SVM, etc.) in pipelines with multiple preprocessing steps (missing value interpolation, scaling, PCA, feature selection, etc.), hyperparameter and preprocessing steps for all models, and multiple ways to integrate or stack algorithms in pipelines.

As a result, TPOT takes some time to run on larger data sets, but it’s important to understand why. Using the default TPOT Settings (100 generations for a total population of 100), TPOT will evaluate 10,000 pipeline configurations before completing. To put this number in context, consider a grid search for 10,000 hyperparameter combinations of the machine learning algorithm, and how long that grid search would take. This is a 10,000 model configuration that requires 10x cross validation to be evaluated, which means that approximately 100,000 models can be evaluated in a grid search for quasi-merging training data. Even for a simple model like a decision tree, this is a time-consuming process.

A typical TPOT run will take hours to days to complete (unless it’s a very small data set), but you can always break the run halfway through and see the best results so far. TPOT also provides a warm_start parameter that lets you restart the TPOT operation from the break. But since this is more demo focused, it won’t run for too long.

implementation

Class library loading and data reading

from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('/home/fonttian/Data/dataset/wine/wine.csv', dtype=np.float64)
labels = tpot_data['quality']
tpot_data = tpot_data.drop('quality', axis=1)

X_train, X_test, y_train, y_test = \
            train_test_split(tpot_data.values, labels.values, train_size=0.75, test_size=0.25,random_state=42)
Copy the code

/ home/fonttian/anaconda3 / envs/keras/lib/python3.8 / site - packages/tpot/builtins/set py: 36: UserWarning: Warning: optional dependency `torch` is not available. - skipping import of NN models. warnings.warn("Warning: optional dependency `torch` is not available. - skipping import of NN models.")Copy the code

After we execute the play code, we get a hint that the neural network module will not be called for automated machine learning because PyTorch is not installed.

At the end of the code, we used the data segmentation algorithm in Skleran to divide the data set into 3:1, with the former used for training and the latter for testing.

parameter

The TPOT interface was designed to be as similar to SciKit-Learn as possible. TPOT can be imported just like any regular Python module. To use TPOT, all you need is the following simple code. As TPOT is an automated machine learning project written based on genetic algorithm, all the parameters we need to pass in when creating tPOT model are those required by genetic algorithm. The following parameters are explained as follows:

Generations: int, optional (default=100), the number of iterations to run the pipeline optimization process. It must be positive.
Population_size: int, optional (default=100), the number of individuals retained in each generation. It must be positive.
CV: folding number of cross-validation. It must be positive.
Random_state: Random number seed used to control random numbers
Verbosity: Detail level of printing, 0, no printing at all, 1, little, 2, more information and progress bar, 3, all information and progress bar

In addition, there are several more important parameters, such as warm_start, whether to call the results of the previous FIT and continue training. And several other common parameters:

Offspring_size: defaults to one hundred, the number of offspring produced in each inheritance. It must be positive.
Mutation_rate: indicates the variation rate. The default value is 0.9.
Crossover_rate: crossover rate. The default value is 0.1. Generally, no change is required.
Scoring: evaluation function for internal use
Subsample: sampling ratio during training. The default value is 1, that is, 100%.
N_jobs: indicates the number of threads used, and the default value is 1. -1 means to use as much content as possible, and -2 means to use all but one CPU
Max_time_mins: How many minutes to optimize pipes, default None. Set to None, the time limit is not applied.
Max_eval_time_mins: How many minutes to evaluate the pipeline. The default is 5, which is five minutes.
Early_stop: Early stop, a common parameter that stops training when the number of rounds lost is essentially constant.
Config_dict: Python dictionary, String, or None, optional (default=None), used to customize the configuration dictionary of operators and parameters searched by TPOT during optimization. However, since it is basically used directly, this parameter is not meaningful. If you need to use can refer to this link: epistasislab. Making. IO/tpot/using /…

Methods:

Methods are simpler, with only four methods that can be called:

Fit (features, target, sample_weight=None, groups=None), runs the TPOT optimization process on the given training data.
Predict (features), which uses optimized pipes to predict the target values of the test set.
Score (testing_features, testing_target), which returns the score of the optimized pipe on a given test data using a user-specified scoring function.
Export (output_file_name) to export the optimized best machine learning pipeline into Python code.

pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5, random_state=42, verbosity=2)

pipeline_optimizer.fit(X_train, y_train)

print(pipeline_optimizer.score(X_test, y_test))
Copy the code

HBox(children=(value=0.0, description='Optimization Progress', Max =120.0, style=ProgressStyle(de... Generation 1-current best internal CV Score: 0.6430648535564853 Generation 2-current best internal CV score: 0.6430648535564853 Generation 3-current best internal CV score: 0.6430648535564853 Generation 4-current Best internal CV score: 0.6822838214783822 Generation 5-current Best Internal CV Score: 0.6822838214783822 Best Pipeline: ExtraTreesClassifier(RandomForestClassifier(ExtraTreesClassifier(PCA(input_matrix, iterated_power=5, Svd_solver =randomized), bootstrap=False, criterion=entropy, max_features=0.7500000000000001, min_samples_leaf=4, Min_samples_split =16, n_estimators=100), bootstrap=True, criterion=gini, max_features=0.15000000000000002, min_samples_leaf=7, min_samples_split=15, n_estimators=100), bootstrap=False, criterion=entropy, Max_features =0.9500000000000001, MIN_SAMples_LEAF =9, MIN_SAMples_split =15, N_ESTIMators =100) 0.665Copy the code

From the execution results above, we can see that TPOT does show corresponding results during the training process, and there is a progress bar. However, as its documentation says, tPOT tends to get poor results when the time is short, and it is quite normal to train for hours or even days on end. Also note that while the best parameters are printed by default after training, this is obviously not the best way to accept them. Therefore, it is better to use the export method to print the optimized best results directly to Python code. Here is the output:

# export code
pipeline_optimizer.export('tpot_exported_pipeline.py')
Copy the code

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=42)

# Average CV score on the training set was: 0.6822838214783822
exported_pipeline = make_pipeline(
    PCA(iterated_power=5, svd_solver="randomized"),
    StackingEstimator(estimator=ExtraTreesClassifier(bootstrap=False, criterion="entropy", max_features=0.7500000000000001, min_samples_leaf=4, min_samples_split=16, n_estimators=100)),
    StackingEstimator(estimator=RandomForestClassifier(bootstrap=True, criterion="gini", max_features=0.15000000000000002, min_samples_leaf=7, min_samples_split=15, n_estimators=100)),
    ExtraTreesClassifier(bootstrap=False, criterion="entropy", max_features=0.9500000000000001, min_samples_leaf=9, min_samples_split=15, n_estimators=100))# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state'.42)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
Copy the code

In addition, there is a more commonly used algorithm for regression :TPOTRegressor, whose parameters and effects are similar to those of the classifier. The details will not be described here.

conclusion

Tpot as a kind of fully automatic machine learning tools, it can build from characteristics, characteristics of processing, model selection and super parameter optimization of the data on optimization, and finally give us a best machine learning pipelines, but also the existence of certain problems, the effect is poorer, must train a few hours or even days, Only in this way can we better explore the combination of eng and finally obtain a better result. For professionals, that’s a long time. But very low operability means that it’s easy to use, and given enough time, TPOT can actually produce better results than the average engineer. So it’s bound to have a lot of use space.