Simple use

    >>> import autosklearn.classification
    >>> cls = autosklearn.classification.AutoSklearnClassifier()
    >>> cls.fit(X_train, y_train)
    >>> predictions = cls.predict(X_test)
Copy the code

This example is from the front page of the official website. The autoskLearn and SKLearn training fit and predict parameters are very similar, but the autoskLearn and SKLearn training fit and predict parameters are very similar The learning algorithm, as well as the data preprocessing algorithm, predict predict is also the prediction result of the selected best model.

The fit() method always defaults to time_left_for_this_task=3600, The automatic tuning time of each algorithm is per_RUN_time_limit =360. The official recommended time is 24 hours, that is, time_left_for_this_task=86400. The calculation time of each model is one hour, that is, per_run_time_limit=360 0.

Note: AutoskLearn does take a lot of time to get a good final result in real tests. If it is only a few hours, it is likely to get a lower result than manual tuning. You can speed up the calculation by limiting the algorithm selection. Although autoSklearn generally achieves better results than manual tuning with sufficient computation time, I think you will often need other methods such as skLearn’s built-in grid search or random search, or a modifier such as Hyperopt.

Restricted search domain

Example:

    automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=120, per_run_time_limit=120,
    include_estimators=["random_forest", ], exclude_estimators=None,
    include_preprocessors=["no_preprocessing", ], exclude_preprocessors=None)
Copy the code

“Include_estimators” indicates the method to be searched, and “exclude_estimators” indicates the method not to be searched. It’s incompatible with include_estimators and include_preprocessors, see the manual

Preprocessing in Auto – Sklearn is divided into two functions: data preprocessing and feature preprocessing. Data preprocessing includes a single thermal coding of classification features, processing of missing values, and normalization of features and samples, which currently cannot be turned off. Feature preprocessing is a single Transformer (skLearn’s name for feature processing algorithm corresponds to machine learning algorithm estimator), which implements feature selection and feature transformation to different Spaces (PCA). As shown in the previous code, this can be turned off by include_preprocessors=[” no_preprocessing “,].

Exclude_preprocessors work similarly to Exclude_estimators, meaning that preprocessors are not used. Incompatible with include_preprocessors

Save computational data and models

Save operational data

By the previous example, we know autosklearn in operation will produce two folders, which holds the run of the file, and the two files have we set the storage location, but because of the operation after the two folder will be deleted by default, so if we want to see the two files in the folder, we need for our parameters Settings:

# the first example saving data automl = autosklearn. Classification. AutoSklearnClassifier (time_left_for_this_task = 120, per_run_time_limit=120, tmp_folder='/home/fonttian/Data/Auto-sklearn/tmp/example_output_example_tmp', output_folder='/home/fonttian/Data/Auto-sklearn/tmp/example_output_example_out', delete_tmp_folder_after_terminate=False, delete_output_folder_after_terminate=False)Copy the code

As shown in the figure above, you must set the location of the two folders, then set delete_tmp_folder_after_terminate=False, Delete_output_folder_after_terminate =False If the two parameters are False, they enable the two files to be deleted after the operation is complete

In the case of the above parameters, you can also set the SMAC shared model, parameter isshared_mode=True,

Save the model

According to the front page of the official website, the persistence of the model is similar to skLearn. We can do something similar.

Pickle s = pickle.dumps(automl) with open('example_output_pickle.pkl', 'wb') as f: f.write(s) with open('example_output_pickle.pkl', 'rb') as f: s2 = f.read() clf = pickle.loads(s2) predictions = clf.predict(X_test) print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))Copy the code

Or use Joblib to store and load the model. There is not much difference between the two, but joblib seems to be better, but the same PKL file may not be loaded and used by different versions as the version changes:

Externals import joblib joblib. Dump (automl, 'example_output_joblib.pkl') clf = joblib.load('example_output_joblib.pkl')Copy the code

The rest of the evaluation function is in a separate note

Prevent overfitting

The arguments to prevent overfitting are resampling_strategy and resampling_STRATEGy_arguments

The former is an optional argument to the string attribute

  • ‘Holdout’ : splits the data set to train:test. The default value is 0.67, that is, train:test = 0.67:0.33
  • ‘holdo-do-fit’ : The splitting is the same as above, but who uses iterative matching where possible
  • ‘CV’ : cross-validation is performed

‘holdout’ : {‘ train_size ‘: float} *’ holdout-iterative fit ‘: {‘ train_size’ : Float} * ‘CV’ : {‘ folds’ : int}

In addition, if the previous two parameters are used, refit is needed once after fit is completed and the best model is selected, so that the original model can be adjusted according to the new data, and then the prediction can be made. The direct refit command cannot be used for prediction, but only fit command is used, and refIT is not applicable The command will also report the following error:

    'strategy %s, please call refit().' % self._resampling_strategy)
NotImplementedError: Predict is currently not implemented for resampling strategy cv, please call refit().
Copy the code

Chinese translation link

  • Website home page
  • AutoSklearn manual