The paper contains 3101 words and is expected to last 6 minutes

Tom and Jerry, Episode 70 — Doraemon (1952)

In my daily work, I often use AutoML (Automated Machine Learning), and I have also used AutoML several times to assist the main model when PARTICIPATING in ML contests, and I have also participated in AutoML contests twice. I think the idea of AutoML automating the modeling process is great, but the field is overhyped. In the future, some important concepts, such as feature engineering and meta-learning hyperparameter optimization, could unlock the potential of AutoML. But for now, one-stop AutoML is a tool that just adds to the cost.

All of the data and operations mentioned below are about tabular data.

What is AutoML?

Data Science Program

Data science projects involve several basic steps: asking questions from a business perspective (selecting tasks and metrics for success), collecting data (collecting, cleaning, exploring), modeling and evaluating model performance, and applying the model in a production environment and observing its performance.

Data mining cross-industry standard process

Each part of the process is critical to the success of the project. However, machine learning experts believe that the modeling process is fundamental. Well-designed machine learning models can bring a lot of potential value to a company.

Given a data set and a goal (the larger the specific index value, the better), the data scientist needs to solve optimization problems during modeling. The process is complex and requires a variety of skills:

1. Feature engineering needs to be treated as an art, not just a science;

2. Optimization of hyperparameters requires a deep understanding of algorithm and ML core concepts;

3. Need to use software engineering technology to output simple and easy to understand, easy to use code.

That’s where AutoML comes in.

ML modeling is an art, a science, and software engineering.

AutoML

The input of AutoML is data and tasks (classification, regression, recommendations, and so on), and the output is a production-ready model that predicts unknown data. Every decision of a data-driven pipeline is a superparameter. The key to AutoML is to find superparameters that give good scores in a reasonable amount of time.

• AutoML for data preprocessing: how to handle data imbalances; How to fill null values; Remove, replace, or retain outliers; How to encode categories and multiple category columns; How to avoid target disclosure; How to avoid memory errors and so on;

• AutoML generates and selects meaningful new features;

• AutoML selection model (linear model, K-nearest Neighbors, Gradient Boosting, neural network, etc.);

• AutoML adjusts the hyperparameters of the selected model (e.g., number of numbers and number of subsamples in the tree model, learning rate and epoch number in the neural network);

• AutoML can generate a stable set of models to improve scores, if possible.

AutoML is the driving force

AutoML will fill the gap between “supply” and “demand” in the data science market

Now, more and more companies are either just starting to collect data or want to capture the potential value of the data they already collect — and they all want a piece of the action. However, there are few data scientists with the experience to meet companies’ needs. Market supply and demand is unbalanced, the gap expands. AutoML fits the bill.

But can AutoML create value as a one-stop solution? In my opinion, the answer is no.

What these companies need is a process, and AutoML is just a tool. The lack of strategy cannot be compensated for by the advancement of tools. Before you start using AutoML, consider working with a consulting firm to develop a data science strategy. That’s why most AutoML vendors offer consulting as well as solutions. There is some skill in this.

Not a good plan, it seems (South Park, S2e17)

AutoML can help data science teams save time

According to a 2018 Kaggle Machine Learning and Data Science Survey, 15 to 26 percent of a typical data science project’s time is spent modeling or selecting models. This process requires both manpower and computation time. The process needs to be repeated once the goal or data is changed (for example, new features are added). AutoML can help data scientists save that time so they can focus on more important things (like chair fencing).

It only takes a few lines of code to use AutoML

If a company’s data science team says that modeling is not their most important task, then there is clearly something wrong with the company’s data science process. Often, a small improvement in model performance can be a huge benefit to the company. In this way, the time spent on modeling is very worthwhile.

Oversimplified rules: If (Gain from model > Costs of DS team time) → time savings are not needed.

If (Gain from model <= Costs of DS team time) → Are you solving the right problem?

Writing down a script for the daily tasks of your data science team can save you time in the future and is much better than using a one-stop solution. I have written scripts to automate daily tasks, including automated feature generation, feature filtering, model training, overparameter tuning, and so on, and now use them every day.

AutoML outperformed the average data scientist

Unfortunately, other than the open source AutoML benchmark, we don’t have any benchmarks that can effectively compare the table AutoML to human data scientists. On July 1, 2019, a few months ago, several authors published an article comparing the performance of a tuned random forest with several AutoML libraries. The result is shown below:

Out of curiosity, I decided to set my own benchmark. I compared the performance of my model with the AutoML solution using three binary data sets, including credit data, KDD promotion data, and mortgage loan data. 60% of each dataset was randomly divided into training sets (stratified by target) and the remaining 40% into test sets.

My base model is very simple, neither digging into the data nor creating advanced features:

Stratified fiedkfold;

2. Use Catboost Encoder for classification columns (for readers unfamiliar with Catboost Encoder, please refer to my previous article: Classification Coding Benchmarks);

3. Use the mathematical operation (+-*/) on the numeric column pair. New feature limit is 500.

4. The model is LightGBM and the default parameters are used;

5. Integrated OOF ranking prediction.

I used two AutoML libraries: H2O and TPOT, and trained them with multiple time intervals ranging from 15 minutes to 6 hours, and the results were quite surprising. The indicators are as follows:

Score = (ROC AUC / ROC AUC of my baseline) * 100%

First, my base model beat AutoML in almost every case. This was a bit of a letdown, since I had planned on letting AutoML do the dirty work, and I was able to relax. But that’s not going to happen either.

Second, AutoML’s score did not improve as the interval between training sessions increased. This means that no matter how long we wait, whether we wait 15 minutes or six hours, we get the same low score.

AutoML doesn’t always get high marks.

conclusion

1. If the company is new to data science, hire a consultant.

2. Automate your work whenever possible.

3. A one-stop solution is not a good option — it scores too low.

PS: Don’t treat the engine like a car

This article is all about tools, but it’s important to realize that modeling is only one part of the pipeline of a data science project. I think the analogy of a project to a car is very appropriate. In this way, the output of the modeling — the machine learning model — can be thought of as a car engine.

There is no doubt that the engine is the most basic component of a car, but it does not represent a car. No amount of time spent creating high-end, thoughtful, complex features, choosing neural network structures, or tuning model parameters is worth the effort if you ignore the rest.

If the problem is being solved in the wrong direction (from a business perspective), or if the data is biased and needs to be retrained (data exploration), or if the model is too complex to be used in production (application phase), then the model will not be practical, even if it scores high.

In the end, you might find yourself in a silly situation: after days or weeks of painstaking modeling, you’re riding a slow bike with a car engine in its basket.

Tools are the foundation; Strategy is key.

Recommended Reading topics

Leave a comment like follow

We share the dry goods of AI learning and development. Welcome to pay attention to the “core reading technology” of AI vertical we-media on the whole platform.



(Add wechat: DXSXBB, join readers’ circle and discuss the freshest artificial intelligence technology.)