• By Han Xinzi @Showmeai
  • Tutorial address: www.showmeai.tech/tutorials/4…
  • This paper addresses: www.showmeai.tech/article-det…
  • Statement: All rights reserved, please contact the platform and the author and indicate the source
  • collectionShowMeAICheck out more highlights

1.AutoML and Automated machine learning

In this series, you learned how to build machine learning applications with ShowMeAI. It was easy for us to build a machine learning model solution baseline, but model selection and generalized performance optimization was a difficult task. Choosing the right model is not a process that requires high computational cost, time and effort.

AutoML(Automated Machine Learning) is a process to automate the construction of end-to-end machine learning processes and solve problems in practical scenarios.

In this article, we will introduce FLAML(A Fast and Lightweight AutoML Library), an automated machine learning framework developed by Microsoft.

2. FLAML is introduced

2.1 FLAML characteristics

The official website summarizes FLAML’s features as follows:

  • For common machine learning tasks such as classification and regression, FLAML can quickly find high-quality models with minimal resource consumption. It supports classical machine learning models and deep neural networks.
  • It is easy to customize or extend. Users can have very flexible adjustment and customization modes:
    • Minimum customization (set computing resource limits)
    • Medium customizations (such as setting up the SciKit-learn learner, search space, and metrics)
    • Full customization (custom training and evaluation code).
  • It supports fast and low cost auto-tuning and can handle large search Spaces. FLAML is supported by a new method of cost-effective hyperparametric optimization and learner selection developed by Microsoft Research.

2.2 Installation Method

We can easily install FLAML through PIP

pip install flaml
Copy the code

There are some optional installation options, as follows:

(1) Notebook sample support

If you want to run the official notebook code example, install the [notebook] option:

pip install flaml[notebook]
Copy the code

(2) More model learner support

  • If we want Flaml to support the CatBoost model, add the [catBoost] option when installing
pip install flaml[catboost]
Copy the code
  • If we want FlamL to support Vowpal Wabbit, add the [VW] option when installing
pip install flaml[vw]
Copy the code
  • If we want FlamL to support time series predictors Prophet and Statsmodels, we can add [Forecast] when installing flamL
pip install flaml[forecast]
Copy the code

(3) Distributed tuning support

  • ray
pip install flaml[ray]
Copy the code
  • nni
pip install flaml[nni]
Copy the code
  • blendsearch
pip install flaml[blendsearch]
Copy the code

3.FLAML usage examples

3.1 Automatic Mode

Here we use a scenario data case (dichotomy) to demonstrate the fully automatic mode of the FLAML tool library. (as you can in jupyter run the following code in the notebook, about their IDE and environment configuration we can reference ShowMeAI graphic python | installation and environment Settings).

! pip install flaml[notebook]
Copy the code

(1) Loading data and preprocessing

We downloaded the Airlines Dataset from OpenML. The modeling task of this data set is to predict whether a given flight will be delayed given scheduled departure information.

from flaml.data import load_openml_dataset
X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=1169, data_dir='/')
Copy the code

From the running result, you can see the dimension information of training set, test set and tag.

(2) Run FLAML automatic mode

Let’s run FLAML Automl directly. In the actual operation configuration, we can specify task type, time budget, error measure, learner list, whether to downsample, resampling strategy type, etc. All of these parameters use default values if nothing is set (for example, the default classifier is [LGBM, Xgboost, xGB_limitDEPTH, catBoost, RF, EXTRA_tree, LRL1]).

Import the tool library and initialize the AutoML object
from flaml import AutoML
automl = AutoML()
Copy the code
# Parameter setting
settings = {
    "time_budget": 240.Total time limit (in seconds)
    "metric": 'accuracy'.# Candidates can be: 'r2', 'rmse', 'mae', 'mse', 'accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'log_loss', 'mape', 'f1', 'ap', 'ndcg', 'micro_f1', 'macro_f1'
    "task": 'classification'.# Task type
    "log_file_name": 'airlines_experiment.log'.# flaml log file
    "seed": 7654321.# Random seed
}
Copy the code
# Run automated machine learning
automl.fit(X_train=X_train, y_train=y_train, **settings)
Copy the code

As can be seen from the above running results, the automatic machine learning process has carried out experiments on [LGBM, XGBoost, XGB_limitdepth, CATBoost, RF, EXTRA_tree, LRL1] and run corresponding results.

(3) Optimal model and evaluation results

print('Best ML leaner:', automl.best_estimator)
print('Best hyperparmeter config:', automl.best_config)
print('Best accuracy on validation data: {0:.4g}'.format(1-automl.best_loss))
print('Training duration of best run: {0:.4g} s'.format(automl.best_config_train_time))
Copy the code

The result is as follows

Best ML leaner: lgbm Best hyperparmeter config: {'n_estimators': 1071, 'num_leaves': 25, 'min_child_samples': 36, 'learning_rate': 0.10320258241974468, 'log_MAX_bin ': 10,' colsamPLE_bytree ': 1.0, 'reg_alpha': 0.0009765625, 'reg_lambda': 0.08547376339713011, 'FLAML_sample_size': 364083} 0.6696 Training duration of best run: 9.274sCopy the code

You can obtain the corresponding “optimal model”, “optimal model configuration”, “evaluation criteria result” and other information through the automL object attribute after running. The optimal model here is a LightGBM model constructed from 1071 trees.

Further, we can take the optimal model and use it to predict the test set with the following code.

# Optimal model
automl.model.estimator
Copy the code

The result is as follows

LGBMClassifier(Learning_rate =0.10320258241974468, max_bin=1023, min_child_samples=36, n_ESTIMators =1071, num_leaves=25, Reg_alpha = 0.0009765625, reg_lambda = 0.08547376339713011, verbose = 1)Copy the code

(4) Model storage and loading

# Model storage and persistence
import pickle
with open('automl.pkl'.'wb') as f:
    pickle.dump(automl, f, pickle.HIGHEST_PROTOCOL)

# Model loading
with open('automl.pkl'.'rb') as f:
    automl = pickle.load(f)
Copy the code
# Estimate the test set
y_pred = automl.predict(X_test)
print('Predicted labels', y_pred)
print('True labels', y_test)
y_pred_proba = automl.predict_proba(X_test)[:,1]
Copy the code

The running results are as follows:

Predicted labels ['1' '0' '1' ... '1' '0' '0'] True labels 118331 0 328182 0 335454 0 520591 1 344651 0 .. 367080 0 203510 1 254894 0 296512 1 362444 0 Name: Delay, Length: 134846, dtype: category Categories (2, object): [' 0 '<' 1 ']Copy the code

As you can see, the best model for AUTOML predicts the set of tests in the same way as the model you modeled yourself.

# Test set effect evaluation
from flaml.ml import sklearn_metric_loss_score
print('accuracy'.'='.1 - sklearn_metric_loss_score('accuracy', y_pred, y_test))
print('roc_auc'.'='.1 - sklearn_metric_loss_score('roc_auc', y_pred_proba, y_test))
print('log_loss'.'=', sklearn_metric_loss_score('log_loss', y_pred_proba, y_test))
Copy the code

The evaluation results are as follows:

Accuracy = 0.6720332824110467 ROC_AUC = 0.7253276908529442 log_loss = 0.6034449031876942Copy the code

(5) View historical log details

We can view the detailed data of automL’s results for each model experiment through the following code.

from flaml.data import get_output_from_log
time_history, best_valid_loss_history, valid_loss_history, config_history, metric_history = \
    get_output_from_log(filename=settings['log_file_name'], time_budget=240)
for config in config_history:
    print(config)
Copy the code

The results are as follows

{'Current Learner': 'lgbm', 'Current Sample': 10000, 'Current Hyper-parameters': {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20, 'learning_rate': 0.09999999999999995, 'log_max_bin': 8, 'colsample_bytree': 1.0, 'reg_alpha': 0.0009765625, 'reg_lambda': 1.0, 'FLAML_sample_size': 10000}, 'Best Learner': 'lgbm', 'Best Hyper-parameters': {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20, 'learning_rate': 0.09999999999999995, 'log_max_bin': 8, 'colsample_bytree': 1.0, 'reg_alpha': 0.0009765625, 'reg_lambda': 1.0, 'FLAML_sample_size': 10000}}
{'Current Learner': 'lgbm', 'Current Sample': 10000, 'Current Hyper-parameters': {'n_estimators': 4, 'num_leaves': 14, 'min_child_samples': 15, 'learning_rate': 0.22841390623808822, 'log_max_bin': 9, 'colsample_bytree': 1.0, 'reg_alpha': 0.0014700173967242716, 'reg_lambda': 7.624911621832711, 'FLAML_sample_size': 10000}, 'Best Learner': 'lgbm', 'Best Hyper-parameters': {'n_estimators': 4, 'num_leaves': 14, 'min_child_samples': 15, 'learning_rate': 0.22841390623808822, 'log_max_bin': 9, 'colsample_bytree': 1.0, 'reg_alpha': 0.0014700173967242716, 'reg_lambda': 7.624911621832711, 'FLAML_sample_size': 10000}}
{'Current Learner': 'lgbm', 'Current Sample': 10000, 'Current Hyper-parameters': {'n_estimators': 4, 'num_leaves': 25, 'min_child_samples': 12, 'learning_rate': 0.5082200481556802, 'log_max_bin': 8, 'colsample_bytree': 0.9696263001275751, 'reg_alpha': 0.0028107036379524425, 'reg_lambda': 3.716898117989413, 'FLAML_sample_size': 10000}, 'Best Learner': 'lgbm', 'Best Hyper-parameters': {'n_estimators': 4, 'num_leaves': 25, 'min_child_samples': 12, 'learning_rate': 0.5082200481556802, 'log_max_bin': 8, 'colsample_bytree': 0.9696263001275751, 'reg_alpha': 0.0028107036379524425, 'reg_lambda': 3.716898117989413, 'FLAML_sample_size': 10000}}
{'Current Learner': 'lgbm', 'Current Sample': 10000, 'Current Hyper-parameters': {'n_estimators': 23, 'num_leaves': 14, 'min_child_samples': 15, 'learning_rate': 0.22841390623808822, 'log_max_bin': 9, 'colsample_bytree': 1.0, 'reg_alpha': 0.0014700173967242718, 'reg_lambda': 7.624911621832699, 'FLAML_sample_size': 10000}, 'Best Learner': 'lgbm', 'Best Hyper-parameters': {'n_estimators': 23, 'num_leaves': 14, 'min_child_samples': 15, 'learning_rate': 0.22841390623808822, 'log_max_bin': 9, 'colsample_bytree': 1.0, 'reg_alpha': 0.0014700173967242718, 'reg_lambda': 7.624911621832699, 'FLAML_sample_size': 10000}}
{'Current Learner': 'lgbm', 'Current Sample': 10000, 'Current Hyper-parameters': {'n_estimators': 101, 'num_leaves': 12, 'min_child_samples': 24, 'learning_rate': 0.07647794276357095, 'log_max_bin': 10, 'colsample_bytree': 1.0, 'reg_alpha': 0.001749539645587163, 'reg_lambda': 4.373760956394571, 'FLAML_sample_size': 10000}, 'Best Learner': 'lgbm', 'Best Hyper-parameters': {'n_estimators': 101, 'num_leaves': 12, 'min_child_samples': 24, 'learning_rate': 0.07647794276357095, 'log_max_bin': 10, 'colsample_bytree': 1.0, 'reg_alpha': 0.001749539645587163, 'reg_lambda': 4.373760956394571, 'FLAML_sample_size': 10000}}
{'Current Learner': 'lgbm', 'Current Sample': 40000, 'Current Hyper-parameters': {'n_estimators': 101, 'num_leaves': 12, 'min_child_samples': 24, 'learning_rate': 0.07647794276357095, 'log_max_bin': 10, 'colsample_bytree': 1.0, 'reg_alpha': 0.001749539645587163, 'reg_lambda': 4.373760956394571, 'FLAML_sample_size': 40000}, 'Best Learner': 'lgbm', 'Best Hyper-parameters': {'n_estimators': 101, 'num_leaves': 12, 'min_child_samples': 24, 'learning_rate': 0.07647794276357095, 'log_max_bin': 10, 'colsample_bytree': 1.0, 'reg_alpha': 0.001749539645587163, 'reg_lambda': 4.373760956394571, 'FLAML_sample_size': 40000}}
{'Current Learner': 'lgbm', 'Current Sample': 40000, 'Current Hyper-parameters': {'n_estimators': 361, 'num_leaves': 11, 'min_child_samples': 32, 'learning_rate': 0.13528717598813866, 'log_max_bin': 9, 'colsample_bytree': 0.9851977789068981, 'reg_alpha': 0.0038372002422749616, 'reg_lambda': 0.25113531892556773, 'FLAML_sample_size': 40000}, 'Best Learner': 'lgbm', 'Best Hyper-parameters': {'n_estimators': 361, 'num_leaves': 11, 'min_child_samples': 32, 'learning_rate': 0.13528717598813866, 'log_max_bin': 9, 'colsample_bytree': 0.9851977789068981, 'reg_alpha': 0.0038372002422749616, 'reg_lambda': 0.25113531892556773, 'FLAML_sample_size': 40000}}
{'Current Learner': 'lgbm', 'Current Sample': 364083, 'Current Hyper-parameters': {'n_estimators': 361, 'num_leaves': 11, 'min_child_samples': 32, 'learning_rate': 0.13528717598813866, 'log_max_bin': 9, 'colsample_bytree': 0.9851977789068981, 'reg_alpha': 0.0038372002422749616, 'reg_lambda': 0.25113531892556773, 'FLAML_sample_size': 364083}, 'Best Learner': 'lgbm', 'Best Hyper-parameters': {'n_estimators': 361, 'num_leaves': 11, 'min_child_samples': 32, 'learning_rate': 0.13528717598813866, 'log_max_bin': 9, 'colsample_bytree': 0.9851977789068981, 'reg_alpha': 0.0038372002422749616, 'reg_lambda': 0.25113531892556773, 'FLAML_sample_size': 364083}}
{'Current Learner': 'lgbm', 'Current Sample': 364083, 'Current Hyper-parameters': {'n_estimators': 547, 'num_leaves': 46, 'min_child_samples': 60, 'learning_rate': 0.281323306091088, 'log_max_bin': 10, 'colsample_bytree': 1.0, 'reg_alpha': 0.001643352694266288, 'reg_lambda': 0.14719738747481906, 'FLAML_sample_size': 364083}, 'Best Learner': 'lgbm', 'Best Hyper-parameters': {'n_estimators': 547, 'num_leaves': 46, 'min_child_samples': 60, 'learning_rate': 0.281323306091088, 'log_max_bin': 10, 'colsample_bytree': 1.0, 'reg_alpha': 0.001643352694266288, 'reg_lambda': 0.14719738747481906, 'FLAML_sample_size': 364083}}
{'Current Learner': 'lgbm', 'Current Sample': 364083, 'Current Hyper-parameters': {'n_estimators': 1071, 'num_leaves': 25, 'min_child_samples': 36, 'learning_rate': 0.10320258241974468, 'log_max_bin': 10, 'colsample_bytree': 1.0, 'reg_alpha': 0.0009765625, 'reg_lambda': 0.08547376339713011, 'FLAML_sample_size': 364083}, 'Best Learner': 'lgbm', 'Best Hyper-parameters': {'n_estimators': 1071, 'num_leaves': 25, 'min_child_samples': 36, 'learning_rate': 0.10320258241974468, 'log_max_bin': 10, 'colsample_bytree': 1.0, 'reg_alpha': 0.0009765625, 'reg_lambda': 0.08547376339713011, 'FLAML_sample_size': 364083}}
Copy the code

We can plot the learning curve of the verification set as follows:

import matplotlib.pyplot as plt
import numpy as np
plt.title('Learning Curve')
plt.xlabel('Wall Clock Time (s)')
plt.ylabel('Validation Accuracy')
plt.scatter(time_history, 1 - np.array(valid_loss_history))
plt.step(time_history, 1 - np.array(best_valid_loss_history), where='post')
plt.show()
Copy the code

(6) Compare the default XGBoost/LightGBM experimental results

Let’s compare the effect of the XGBoost model on this dataset using all the default parameters as shown below

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Training fit
xgb = XGBClassifier()
cat_columns = X_train.select_dtypes(include=['category']).columns
X = X_train.copy()
X[cat_columns] = X[cat_columns].apply(lambda x: x.cat.codes)
xgb.fit(X, y_train)

lgbm = LGBMClassifier()
lgbm.fit(X_train, y_train)

# Test set estimation
X = X_test.copy()
X[cat_columns] = X[cat_columns].apply(lambda x: x.cat.codes)
y_pred_xgb = xgb.predict(X)

y_pred_lgbm = lgbm.predict(X_test)

# Evaluate effectiveness
print('Default xgboost accuracy'.'='.1 - sklearn_metric_loss_score('accuracy', y_pred_xgb, y_test))
print('Default LGBM accuracy'.'='.1 - sklearn_metric_loss_score('accuracy', y_pred_lgbm, y_test))
print('flaml (4min) accuracy'.'='.1 - sklearn_metric_loss_score('accuracy', y_pred, y_test))
Copy the code

The final result is as follows:

Xgboost accuracy = 0.6676060098186078 LGBM accuracy = 0.6602346380315323 FLAML (4min) accuracy = 0.6720332824110467Copy the code

The comparison results show that the best flamL model is better than the default parameters XGBoost and LightGBM models.

3.2 Custom learner

In addition to using the FLAML tool library in fully automated mode, we can also customize some of its components and achieve custom tuning. For example, we can set “model”, “parameter search space”, “candidate learner”, “model optimization index” and so on.

(1) Custom model

Regularized greedy forests (RGF) is a machine learning approach that is not currently included in FLAML. RGF has many tuning parameters, the most critical of which are: [max_leaf, n_iter, n_tree_search, opt_interval, min_samples_leaf]. To run the custom/new learner, the user needs to provide the following information:

  • Implementation of a custom/new learner
  • A list of hyperparameter names and types
  • Rough range of hyperparameters (i.e. Upper/lower limit)

In the sample code below, RGF information is packed in a python class called MyRegularizedGreedyForest.

from flaml.model import SKLearnEstimator
from flaml import tune
from flaml.data import CLASSIFICATION

class MyRegularizedGreedyForest(SKLearnEstimator) :
    def __init__(self, task='binary', **config) :
        '''Constructor Args: task: A string of the task type, one of 'binary', 'multi', 'regression' config: A dictionary containing the hyperparameter names and 'n_jobs' as keys. n_jobs is the number of parallel threads. '''

        super().__init__(task, **config)

        '''task=binary or multi for classification task'''
        if task in CLASSIFICATION:
            from rgf.sklearn import RGFClassifier

            self.estimator_class = RGFClassifier
        else:
            from rgf.sklearn import RGFRegressor
            
            self.estimator_class = RGFRegressor

    @classmethod
    def search_space(cls, data_size, task) :
        '''[required method] search space Returns: A dictionary of the search space. Each key is the name of a hyperparameter, and value is a dict with its domain (required) and low_cost_init_value, init_value, cat_hp_cost (if applicable). e.g., {'domain': tune.randint(lower=1, upper=10), 'init_value': 1}. '''
        space = {        
            'max_leaf': {'domain': tune.lograndint(lower=4, upper=data_size[0]), 'init_value': 4.'low_cost_init_value': 4},
            'n_iter': {'domain': tune.lograndint(lower=1, upper=data_size[0]), 'init_value': 1.'low_cost_init_value': 1},
            'n_tree_search': {'domain': tune.lograndint(lower=1, upper=32768), 'init_value': 1.'low_cost_init_value': 1},
            'opt_interval': {'domain': tune.lograndint(lower=1, upper=10000), 'init_value': 100},
            'learning_rate': {'domain': tune.loguniform(lower=0.01, upper=20.0)},
            'min_samples_leaf': {'domain': tune.lograndint(lower=1, upper=20), 'init_value': 20}},return space

    @classmethod
    def size(cls, config) :
        '''[optional method] memory size of the estimator in bytes Args: config - the dict of the hyperparameter config Returns: A float of the memory size required by the estimator to train the given config '''
        max_leaves = int(round(config['max_leaf']))
        n_estimators = int(round(config['n_iter']))
        return (max_leaves * 3 + (max_leaves - 1) * 4 + 1.0) * n_estimators * 8

    @classmethod
    def cost_relative2lgbm(cls) :
        '''[optional method] relative cost compared to lightgbm '''
        return 1.0
Copy the code

(2) Run FLAML custom model automl

After adding RGF to the list of learners, we run AUTOML by adjusting the hyperparameters of RGF and the default learner.

automl = AutoML()
automl.add_learner(learner_name='RGF', learner_class=MyRegularizedGreedyForest)
Copy the code
# add configuration
settings = {
    "time_budget": 10.# total running time in seconds
    "metric": 'accuracy'."estimator_list": ['RGF'.'lgbm'.'rf'.'xgboost'].# list of ML learners
    "task": 'classification'.# task type    
    "log_file_name": 'airlines_experiment_custom_learner.log'.# flaml log file 
    "log_training_metric": True.# whether to log training metric
}
automl.fit(X_train = X_train, y_train = y_train, **settings)
Copy the code

(3) User-defined optimization indicators

We can customize optimization metrics for the model. In the example code below, we combine training loss and validation loss as custom optimization metrics and optimize them to minimize the loss.

def custom_metric(X_val, y_val, estimator, labels, X_train, y_train,
                  weight_val=None, weight_train=None, config=None,
                  groups_val=None, groups_train=None) :
    from sklearn.metrics import log_loss
    import time
    start = time.time()
    y_pred = estimator.predict_proba(X_val)
    pred_time = (time.time() - start) / len(X_val)
    val_loss = log_loss(y_val, y_pred, labels=labels,
                         sample_weight=weight_val)
    y_pred = estimator.predict_proba(X_train)
    train_loss = log_loss(y_train, y_pred, labels=labels,
                          sample_weight=weight_train)
    alpha = 0.5
    return val_loss * (1 + alpha) - alpha * train_loss, {
        "val_loss": val_loss, "train_loss": train_loss, "pred_time": pred_time
    }
    # two elements are returned:
    # the first element is the metric to minimize as a float number,
    # the second element is a dictionary of the metrics to log
Copy the code
automl = AutoML()
settings = {
    "time_budget": 10.# total running time in seconds
    "metric": custom_metric,  # pass the custom metric funtion here
    "task": 'classification'.# task type
    "log_file_name": 'airlines_experiment_custom_metric.log'.# flaml log file
}
automl.fit(X_train = X_train, y_train = y_train, **settings)
Copy the code

3.3 Sklearn pipeline tuning

FLAML can coordinate with SKlearn Pipeline for automatic model tuning. Here, we still take the case of Airlines Dataset as the scenario to explain its usage.

(1) Load the data set

Data set preparation
from flaml.data import load_openml_dataset
X_train, X_test, y_train, y_test = load_openml_dataset(
    dataset_id=1169, data_dir='/', random_state=1234, dataset_format='array')
Copy the code

(2) Build a modeling pipeline

import sklearn
from sklearn import set_config
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from flaml import AutoML
set_config(display='diagram')
imputer = SimpleImputer()
standardizer = StandardScaler()
automl = AutoML()
automl_pipeline = Pipeline([
    ("imputuer",imputer),
    ("standardizer", standardizer),
    ("automl", automl)
])
automl_pipeline
Copy the code

The output is as follows

Pipeline(steps=[('imputuer', SimpleImputer()),
                ('standardizer', StandardScaler()),
                ('automl', )])
SimpleImputerSimpleImputer()
StandardScalerStandardScaler()
AutoML
Copy the code

(3) Parameter setting is fitted with AUTOML

# set
settings = {
    "time_budget": 60.Total duration constraint
    "metric": 'accuracy'.# optional: [' accuracy 'and' roc_auc ', 'roc_auc_ovr', 'roc_auc_ovo', 'f1' and 'log_loss', 'mae, mse,' r2 ']
    "task": 'classification'.# Task type
    "estimator_list": ['xgboost'.'catboost'.'lgbm']."log_file_name": 'airlines_experiment.log'.# flaml log file
}

# fitting
automl_pipeline.fit(X_train, y_train, 
                        automl__time_budget=settings['time_budget'],
                        automl__metric=settings['metric'],
                        automl__estimator_list=settings['estimator_list'],
                        automl__log_training_metric=True)
Copy the code

(4) Take out the optimal model

# Get the automl object from the pipeline
automl = automl_pipeline.steps[2] [1]
# Get the best config and best learner
print('Best ML leaner:', automl.best_estimator)
print('Best hyperparmeter config:', automl.best_config)
print('Best accuracy on validation data: {0:.4g}'.format(1-automl.best_loss))
print('Training duration of best run: {0:.4g} s'.format(automl.best_config_train_time))
automl.model
Copy the code

The running results are as follows:

Best ML leaner: xgboost Best hyperparmeter config: {'n_estimators': 63, 'max_leaves': 1797, 'min_child_weight': 0.07275175679381725, 'learning_rate': 0.06234183309508761, 'subsample': 0.9814772488195874, 'colsamPLE_bylevel ': 0.810466508891351, 'colsamPLE_bytree ': 0.8005378817953572, 'reg_alpha': 0.5768305704485758, 'reg_lambda': FLAML_sample_size: 364083} Best accuracy on Validation data: 0.6721 Training duration of Best run: 15.45 s < flaml. Model. XGBoostSklearnEstimator at 0 x7f03a5eada00 >Copy the code

(5) Test set evaluation and model storage

import pickle
with open('automl.pkl'.'wb') as f:
    pickle.dump(automl, f, pickle.HIGHEST_PROTOCOL)
Copy the code
# Test set estimation and effect evaluation
y_pred = automl_pipeline.predict(X_test)
print('Predicted labels', y_pred)
print('True labels', y_test)
y_pred_proba = automl_pipeline.predict_proba(X_test)[:,1]
print('Predicted probas ',y_pred_proba[:5])
Copy the code

The result is as follows

Predicted Labels [0 1 1... 0 1 0] True labels [0 0 0... 1 0 1] Predicted Probas [0.3764987 0.6126277 0.699604 0.27359942 0.25294745]Copy the code

3.4 XGBoost is automatically tuned

Here we briefly describe how to tune one of the most common models, XGBoost, using FLAML.

(1) Tool library import and basic Settings

# import tool library
from flaml import AutoML
automl = AutoML()
Copy the code
# Parameter setting
settings = {
    "time_budget": 120.# total running time in seconds
    "metric": 'r2'.# primary metrics for regression can be chosen from: ['mae','mse','r2','rmse','mape']
    "estimator_list": ['xgboost'].# list of ML learners; we tune xgboost in this example
    "task": 'regression'.# task type    
    "log_file_name": 'houses_experiment.log'.# flaml log file
}
Copy the code

(2) Automated machine learning fitting

automl.fit(X_train=X_train, y_train=y_train, **settings)
Copy the code

(3) Optimal model and evaluation

We can output the optimal model configuration and details

# Optimal model
print('Best hyperparmeter config:', automl.best_config)
print('Best r2 on validation data: {0:.4g}'.format(1 - automl.best_loss))
print('Training duration of best run: {0:.4g} s'.format(automl.best_config_train_time))
Copy the code

Running results:

Best hyperparmeter config: {'n_estimators': 776, 'max_leaves': 160, 'min_child_weight': 32.57408640781376, 'learning_rate': 0.034786853332414935, 'subsample': 0.9152991332236934, 'colsamPLE_bylevel ': 0.5656764254642628, 'colsamPLE_bytree ': 0.7313266091895249, 'reg_alpha': 0.005771390107656191, 'reg_lambda': Best R2 on Validation data: 0.834 Training duration of Best run: 9.471sCopy the code

We can take the optimal model

automl.model.estimator
Copy the code

The results are as follows:

XGBRegressor (base_score = 0.5, booster = 'gbtree', colsample_bylevel = 0.5656764254642628, colsample_bynode = 1, Colsample_bytree =0.7313266091895249, gamma=0, gpu_id=-1, grow_policy=' lossGuide ', importance_type='gain', Interaction_constraints = ", learning_rate=0.034786853332414935, max_delta_step=0, max_depth=0, max_leaves=160, Min_child_weight =32.57408640781376, MISSING =nan, ≤ 1600_CONSTRAINTS ='()', n_ESTIMators =776, n_jobs=-1, Num_parallel_tree =1, random_state=0, reg_alpha=0.005771390107656191, reg_lambda=1.49126672786588, scale_pos_weight=1, Subsample =0.9152991332236934, tree_method='hist', use_label_encoder=False, validate_parameters=1, verbosity=0)Copy the code

Feature importance can also be plotted for the XGBoost model

import matplotlib.pyplot as plt
plt.barh(X_train.columns, automl.model.estimator.feature_importances_)
Copy the code

(4) Model storage

import pickle
with open('automl.pkl'.'wb') as f:
    pickle.dump(automl, f, pickle.HIGHEST_PROTOCOL)
Copy the code

(5) Test set estimation and model evaluation

# Test set estimation
y_pred = automl.predict(X_test)
print('Predicted labels', y_pred)
print('True labels', y_test)

# Test set evaluation
from flaml.ml import sklearn_metric_loss_score
print('r2'.'='.1 - sklearn_metric_loss_score('r2', y_pred, y_test))
print('mse'.'=', sklearn_metric_loss_score('mse', y_pred, y_test))
print('mae'.'=', sklearn_metric_loss_score('mae', y_pred, y_test))
Copy the code

3.5 Automatic Tuning of LightGBM

LightGBM’s tuning process is very similar to XGBoost’s, only in the part of the parameter configuration that specifies that the model needs to be adjusted a little bit. The rest is consistent, as follows:

# import tool library
from flaml import AutoML
automl = AutoML()

# Parameter configuration
settings = {
    "time_budget": 240.# total running time in seconds
    "metric": 'r2'.# primary metrics for regression can be chosen from: ['mae','mse','r2','rmse','mape']
    "estimator_list": ['lgbm'].# list of ML learners; we tune lightgbm in this example
    "task": 'regression'.# task type    
    "log_file_name": 'houses_experiment.log'.# flaml log file
    "seed": 7654321.# random seed
}

# Automated machine learning fitting tuning
automl.fit(X_train=X_train, y_train=y_train, **settings)
Copy the code

The resources

  • Diagram of machine learning algorithm | from entry to master series
  • Airlines dataset
  • Example notebook code

ShowMeAIRecommended series of tutorials

  • Illustrated Python programming: From beginner to Master series of tutorials
  • Illustrated Data Analysis: From beginner to master series of tutorials
  • The mathematical Basics of AI: From beginner to Master series of tutorials
  • Illustrated Big Data Technology: From beginner to master
  • Illustrated Machine learning algorithms: Beginner to Master series of tutorials
  • Machine learning: Teach you how to play machine learning series

Related articles recommended

  • Application practice of Python machine learning algorithm
  • SKLearn introduction and simple application cases
  • SKLearn most complete application guide
  • XGBoost modeling applications in detail
  • LightGBM modeling applications in detail
  • Python Machine Learning Integrated Project – E-commerce sales estimates
  • Python Machine Learning Integrated Project — E-commerce Sales Estimation
  • Machine learning feature engineering most complete interpretation
  • Application of Featuretools
  • AutoML Automatic machine learning modeling