How each Kaggle champion wins: Unmasking model integration in Python

From the Dataquest

By Sebastian Flennerhag

Heart of the machine compiles

The integrated approach, which combines the predictions of multiple machine learning models to achieve precise results unmatched by a single model, has become a must for almost all Kaggle contest winners. So how do we integrate models using Python? Sebastian Flennerhag, PhD candidate in the School of Computer Science and Social Statistics at the University of Manchester, provides a brief description.

Stack models efficiently in Python

Ensemble is fast becoming the hottest and most popular method for applying machine learning. Today, nearly every Kaggle champion solution uses integration, as do many data science pipelines.

Simply put, integration combines the prediction results of different models to produce the final forecast. The more models integrated, the better the result. In addition, because the integrations combine different baseline predictions, their performance is at least equivalent to that of the optimal baseline model. Integration allows us to get performance improvements almost for free!

Integration diagram. The input array X is fed to multiple base learners F (I) through two preprocessing pipelines. The integration combines the prediction results of all base learners and derives the final prediction array P. (Photo credit:ml-ensemble.com/)

This article covers the basics of integration: definitions and how it works, and provides hands-on tutorials on building basic integrations. In reading this article, you will:

Understand the basics of integration;
Understand how to write integrations;
Understand the main advantages and disadvantages of integration.

Predict republican and Democratic contributions

In this article, we will use the U.S. political contributions data set to explain how the integration works. The data set was originally produced by FiveThirtyEight’s Ben Wieder, who looked at U.S. government records of political donations and found that when scientists gave money to politicians, they often chose Democrats.

The claim is based on the ratio of donations to Republicans and Democrats. But there’s more to it: which disciplines are most likely to give to Republicans, for example, and which states are most likely to give to Democrats. We will go further and predict where donations are more likely to go.

The data used here after a slightly revised (www.dataquest.io/blog/large_)… . We removed information about donations to parties other than Republicans and Democrats to make the process clearer, and also removed duplicate information and less interesting features. Data script address: www.dataquest.io/blog/large_… . The data are as follows:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

### Import data
# Always good to set a seed for reproducibility
SEED = 222
np.random.seed(SEED)

df = pd.read_csv('input.csv')

### Training and test setfrom sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score def Get_train_test (test_size = 0.95) :"""Split Data into train and test sets."""
   y = 1 * (df.cand_pty_affiliation == "REP")
   X = df.drop(["cand_pty_affiliation"], axis=1)
   X = pd.get_dummies(X, sparse=True)
   X.drop(X.columns[X.std() == 0], axis=1, inplace=True)
return train_test_split(X, y, test_size=test_size, random_state=SEED)

xtrain, xtest, ytrain, ytest = get_train_test()

# A look at the data
print("\nExample data:")
df.head()
Copy the code

df.cand_pty_affiliation.value_counts(normalize=True).plot(
   kind="bar", title="Share of No. donations")
plt.show()
Copy the code

The figure above is the statistical basis for Ben’s claim. Indeed, 75 percent of donations go to Democrats. Let’s look at the features that can be used. We have the donor, transaction details, and recipient data:

We used ROC-AUC to evaluate model performance. If you haven’t used the metric before, random guesses can be 0.5 points, with a perfect recall and accuracy rate of 1.0.

What is integration?

Imagine you’re playing trivia. When playing alone, there may be some questions you don’t understand at all. If we want to score high, we need to assemble a team to cover all relevant topics. This is the basic concept of integration: combining the predictions of multiple models and averaging the specific errors to get a better overall prediction.

An important question is how to combine forecasting. In the case of trivia, it’s easy to imagine that the team might use majority voting to decide which answer to choose. The machine learning classification problem is the same: making the most common category tag predictions corresponds to majority voting rules. But there are many other ways to combine predictions, and often we’ll use a model to learn how best to combine the results of a prediction.

The structure of the base integration. The input is fed to a series of models, and the meta-learner combines the predicted results of multiple models. (Photo credit:Flennerhag.com/2017-04-18-…)

Understanding integration through decision trees

We explain integration with a simple interpretability model: a decision tree using if-then rules. The deeper the decision tree, the more complex the patterns that can be captured, but the more likely it is that a fit will occur. Therefore, we need another way to build complex models of decision trees, and the integration of different decision trees is such a way.

We use the following helper functions to visualize the decision tree:

import pydotplus  # you can install pydotplus with: pip install pydotplus 
from IPython.display import Image
from sklearn.metrics import roc_auc_score
from sklearn.tree import DecisionTreeClassifier, export_graphviz

def print_graph(clf, feature_names):
"""Print decision tree."""
   graph = export_graphviz(
       clf,
       label="root",
       proportion=True,
       impurity=False, 
       out_file=None, 
       feature_names=feature_names,
       class_names={0: "D", 1: "R"},
       filled=True,
       rounded=True
   )
   graph = pydotplus.graph_from_dot_data(graph)  
return Image(graph.create_png())
Copy the code

Use the decision tree to fit a node (decision rule) on the training data and check its performance on the test set:

t1 = DecisionTreeClassifier(max_depth=1, random_state=SEED)
t1.fit(xtrain, ytrain)
p = t1.predict_proba(xtest)[:, 1]

print("Decision tree ROC-AUC score: %.3f" % roc_auc_score(ytest, p))
print_graph(t1, xtrain.columns)
Copy the code

Roc-auc score of decision tree: 0.672

Each leaf node recorded their proportion, category distribution and category label prediction in the training sample. Our decision tree made predictions based on donations over 101.5: It made the same prediction! Given that 75 percent of donations go to Democrats, the results are not surprising. But this doesn’t take full advantage of the data we already have, so let’s use a three-tier decision rule and see what we get:

t2 = DecisionTreeClassifier(max_depth=3, random_state=SEED)
t2.fit(xtrain, ytrain)
p = t2.predict_proba(xtest)[:, 1]

print("Decision tree ROC-AUC score: %.3f" % roc_auc_score(ytest, p))
print_graph(t2, xtrain.columns)
Copy the code

Roc-auc score of decision tree: 0.751

The model was not much better than a simple decision tree: only 5% of Republican donations were predicted, well below 25%. A closer look shows that the decision tree uses many uncertain splitting rules. A whopping 47.3% of the observed results were in the leftmost leaf node, while 35.9% were in the second right leaf node. So a lot of leaves are not related. Making the model deeper only leads to overfitting.

With a fixed depth, decision trees can increase their complexity by increasing their “width” by creating multiple decision trees and connecting them together. That’s the integration of decision trees. To understand why this integration model works, consider how we can get the decision tree to explore more patterns than the upper tree. The simplest solution is to delete features that appear earlier in the tree. If we remove the transfer amount characteristic (transaction_amt), the root node of the tree, the new decision tree is as follows:

drop = ["transaction_amt"]

xtrain_slim = xtrain.drop(drop, 1)
xtest_slim = xtest.drop(drop, 1)

t3 = DecisionTreeClassifier(max_depth=3, random_state=SEED)
t3.fit(xtrain_slim, ytrain)
p = t3.predict_proba(xtest_slim)[:, 1]


print("Decision tree ROC-AUC score: %.3f" % roc_auc_score(ytest, p))
print_graph(t3, xtrain_slim.columns)
Copy the code

Roc-auc score of decision tree: 0.740

The ROC-AUC score was similar to the upper tree score, but the Percentage of Republican donations increased to 7.3 percent. It’s still low, but it’s a little bit higher than before. Importantly, contrary to the first tree (which had most of its rules related to the money transfer itself), this tree is more focused on where the candidate lives. Now we have two models that have similar predictive power but run on different rules. Therefore, they may have different prediction errors, which can be averaged using an integrated approach.

Why does average forecasting work

Suppose we want to generate predictions based on two observations. The first observation is really labeled Republican, the second is Democratic. In this example, assume model 1 predicts the Democratic Party and Model 2 predicts the Republican Party, as shown in the following table:

If we use the standard 50% cutoff rule for category prediction, the prediction results of each decision tree will be one right and one wrong. We average the class probabilities of the model to create an ensemble. In this example, model 2’s prediction of observation 1 is certain, while Model 1 is relatively less certain. The integration measured both predictions and then supported model 2, which correctly predicted the Republican party. As for the second observation, the tables turned and the integration correctly predicted the Democrats:

If the integration contains more than two decision trees, it makes predictions based on the majority rule. Therefore, the ensemble takes an average of the predicted results of the classifier, also known as the Majority Voting classifier. When the integration is averaged based on probability, it is called soft voting, while averaging the predicted results of category labels is called hard voting.

Of course, integration is not a panacea. You may have noticed that in the example above, the average is valid only if the prediction errors are irrelevant. If both models make wrong predictions, the integration cannot be made to correct them. In addition, in the soft voting mechanism, if a model makes a wrong prediction with a high probability value, the integration may make a wrong judgment. Typically, integration does not make every prediction right, but it is expected to perform better than the underlying model.

A forest is a collection of trees

Going back to our prediction problem, let’s see if we can build an integration with two decision trees. First check error correlation: highly correlated errors make for poor integration.

p1 = t2.predict_proba(xtest)[:, 1]
p2 = t3.predict_proba(xtest_slim)[:, 1]

pd.DataFrame({"full_data": p1,
"red_data": p2}).corr()
Copy the code

There is some correlation, but not too much: there is still plenty of room for prediction variance. To build this integration, we simply averaged the predictions of the two models.

p1 = t2.predict_proba(xtest)[:, 1]
p2 = t3.predict_proba(xtest_slim)[:, 1]
p = np.mean([p1, p2], axis=0)
print("Average of decision tree ROC-AUC score: %.3f" % roc_auc_score(ytest, p))
Copy the code

The average ROC-AUC score of decision tree is 0.783

Indeed, the integration step causes the score to increase. But if we have more different trees, we can get even bigger scores. What features should we remove when designing a decision tree?

A quick and effective practice is to randomly select a subset of features, fit a decision tree on each draw and average its predictions. This process, known as bootstrapped averaging (often abbreviated to bagging), is applied to the models that decision-making trees produce, which are random forests. Let’s see what random forests can do for us. We use the SciKit-Learn implementation to build an integration of 10 decision trees, each fitting containing a subset of 3 features.

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
   n_estimators=10,
   max_features=3,
   random_state=SEED
)

rf.fit(xtrain, ytrain)
p = rf.predict_proba(xtest)[:, 1]
print("Average of decision tree ROC-AUC score: %.3f" % roc_auc_score(ytest, p))
Copy the code

The average ROC-AUC score of decision tree was 0.844

Random forest is a huge improvement on our previous model. But there are limits to what you can do with decision trees alone. It’s time to expand our horizons.

As an integration of average predictions

So far, we have seen two important aspects of integration:

1. The lower the correlation of prediction errors, the better the effect

2. The more models, the better

For this reason, it’s a good idea to use as many different models as possible (as long as they perform well). So far we’ve relied on simple averages, but we’ll see how to use more complex combinations later. To document the process, we formulated the integration as follows:

There is no limit to the models covered: decision trees, linear models, kernel models, nonparametric models, neural networks, or even other integrations! Remember that the more models we include, the slower the integration will be.

To build the integration of different models, we first benchmark a set of SciKit-Learn classifiers on a dataset. To avoid code duplication, we use the following helper functions:

# A host of Scikit-learn models
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.kernel_approximation import Nystroem
from sklearn.kernel_approximation import RBFSampler
from sklearn.pipeline import make_pipeline


def get_models():
"""Generate a library of base learners."""
   nb = GaussianNB()
   svc = SVC(C=100, probability=True)
   knn = KNeighborsClassifier(n_neighbors=3)
   lr = LogisticRegression(C=100, random_state=SEED)
   nn = MLPClassifier((80, 10), early_stopping=False, random_state=SEED)
   gb = GradientBoostingClassifier(n_estimators=100, random_state=SEED)
   rf = RandomForestClassifier(n_estimators=10, max_features=3, random_state=SEED)

   models = {'svm': svc,
'knn': knn,
'naive bayes': nb,
'mlp-nn': nn,
'random forest': rf,
'gbm': gb,
'logistic': lr,
             }

return models


def train_predict(model_list):
"""Fit models in list on training set and return preds"""
   P = np.zeros((ytest.shape[0], len(model_list)))
   P = pd.DataFrame(P)

print("Fitting models.")
   cols = list()
for i, (name, m) in enumerate(models.items()):
print("%s..." % name, end="", flush=False)
       m.fit(xtrain, ytrain)
       P.iloc[:, i] = m.predict_proba(xtest)[:, 1]
       cols.append(name)
print("done")

   P.columns = cols
print("Done.\n")
return P


def score_models(P, y):
"""Score model in prediction DF"""
print("Scoring models.")
for m in P.columns:
       score = roc_auc_score(y, P.loc[:, m])
print("%-26s: %.3f" % (m, score))
print("Done.\n")
Copy the code

We are now preparing to create a prediction matrix P, where each feature corresponds to the prediction made by the given model, and score each model according to the test set:

models = get_models()
P = train_predict(models)
score_models(P, ytest)
Copy the code

This is our baseline. Gradient Boosting Machine (GBM) has the best effect, followed by simple Logistic regression. For our integration strategy, the prediction errors must be relatively uncorrelated.

# You need ML-Ensemble for this figure: you can install it with: pip install mlens
from mlens.visualization import corrmat

corrmat(P.corr(), inflate=False)
plt.show()
Copy the code

The error is strongly correlated, as would be expected for a well-behaved model, because it is typically an outlier that is difficult to correct. However, most associations are in the 50-80% range, so there is a lot of room for improvement. In fact, if we look at the correlation of error in terms of category prediction, things look more promising:

Corrmat (p.apply (lambda pred: 1*(pred >= 0.5) -ytest.values).corr(), inflate=False) plt.show()Copy the code

To create the integration, we went ahead and averaged the forecast, and as we expected, the integration performed better than the baseline. Averaging is a simple process, and if we store model predictions, we can start with a simple integration and increase its size at any time as we train new models.

print("Ensemble ROC-AUC score: %.3f" % roc_auc_score(ytest, P.mean(axis=1)))
Copy the code

The integrated ROC-AUC score was 0.884

Visual integration of the working process

We already know the error correlation mechanism of integration. This means that integration smooths decision boundaries through averaging fallacies. Decision boundaries show us how the estimator divides the feature space into neighborhoods where all observations are predicted to have the same classification label. By averaging the base learner to determine the boundaries, the integration is given smoother boundaries and the generalization is more natural.

The figure below shows this. The example here is the iris data set, where the evaluator tries to classify three flowers. Base learners have some undesirable features at their boundaries, but this integration has a relatively smooth decision boundary, consistent with observations. Surprisingly, integration both adds complexity to the model and acts as a regularizer!

Examples of decision boundaries for three models and their integration

When the task is classification, another way to understand integration is to examine the Receiver Operator Curve, which shows us how the evaluator makes the trade-off between accuracy and recall. In general, different base learners make different trade-offs: some achieve higher accuracy by sacrificing recall rates, others do the opposite.

On the other hand, for each training point, the nonlinear meta-learner can adjust its dependent model. This means it can greatly reduce unnecessary sacrifice and maintain high accuracy while increasing recall (and vice versa). In the figure below, integration makes a smaller sacrifice in accuracy to increase recall.

from sklearn.metrics import roc_curve

def plot_roc_curve(ytest, P_base_learners, P_ensemble, labels, ens_label):
"""Plot the roc curve for base learners and ensemble."""
   plt.figure(figsize=(10, 8))
   plt.plot([0, 1], [0, 1], 'k--')

   cm = [plt.cm.rainbow(i)
for i inNp.linspace (0, 1.0, P_base_learners. Shape [1] + 1)for i in range(P_base_learners.shape[1]):
       p = P_base_learners[:, i]
       fpr, tpr, _ = roc_curve(ytest, p)
       plt.plot(fpr, tpr, label=labels[i], c=cm[i + 1])

   fpr, tpr, _ = roc_curve(ytest, P_ensemble)
   plt.plot(fpr, tpr, label=ens_label, c=cm[0])

   plt.xlabel('False positive rate')
   plt.ylabel('True positive rate')
   plt.title('ROC curve')
   plt.legend(frameon=False)
   plt.show()


plot_roc_curve(ytest, P.values, P.mean(axis=1), list(P.columns), "ensemble")
Copy the code

Integration beyond simple averages

But wouldn’t you expect more of an improvement if you had a certain margin of error? Some models perform worse than others, but the impact is just as big. This is devastating for lopsided data sets: if a model makes extreme predictions (that is, close to zero or one), it is recalled through soft voting because it has a significant impact on the forecast average.

An important factor for us was whether the model could capture the full percentage of republican donations. A quick check showed that all models predicted too low a share of Republican donations, some worse than others.

Pply (lambda x: 1*(x >= 0.5).value_counts(normalize=True)) p.index = ["DEM"."REP"]
p.loc["REP", :].sort_values().plot(kind="bar") PLT. Axhline (color = 0.25"k"PLT, our linewidth = 0.5). The text (0., 0.23,"True share republicans")
plt.show() 
Copy the code

We tried to improve integration by removing the worst, such as the MLP:

include = [c for c in P.columns if c not in ["mlp-nn"]]
print("Truncated ensemble ROC-AUC score: %.3f" % roc_auc_score(ytest, P.loc[:, include].mean(axis=1)))
Copy the code

The ROC-AUC score of truncated integration was 0.883

It’s not really promotion: we need a smarter way to prioritize models. Obviously, removing models from an integration is quite drastic, because it is possible to remove models with important information. What we really want is a reasonable set of weights to use when learning about average forecasting. This turns the integration into a parameterized model that needs training.

Learning combined with prediction

Learning the weighted mean means that for each model f_i there is a weight parameter ω_i∈(0,1) that assigns the weight to the prediction of the model. The weighted average requires the sum of all the weights to be 1. Now, integration is defined as follows:

This is a small change from the previous definition, but it is interesting to note that once the model generates the prediction P_i =f_i(x), the learning weight is equivalent to fitting the linear regression based on the prediction:

Weights have some constraints. And then we don’t have to just fit the linear model. Suppose we fit the nearest neighbor model. The integration takes a local mean based on the nearest neighbor of a given observation so that the integration can adapt to changes in model performance as input changes.

To achieve integration

To build this type of integration, we need:

1. Base learner library for generating prediction;

2. Learn how to best combine the predicted results of the meta-learner;

3. Method of dividing training data between base learner and meta-learner.

The base learner takes raw inputs and generates a series of predictions. If our original data set is matrix X of morphology (n_samples, n_features), then the base learner library outputs a new prediction matrix P_base of morphology (n_samples, n_base_learners), where each column represents the prediction result of a base learner. The meta learner is trained on P_base.

This means that it is critical to properly handle training set X. In particular, if we train base learners on X and use them to predict X, the meta-learners will train on the training errors of the base learners, but will face the test errors of the base learners when tested.

We need a strategy to generate a prediction matrix P that reflects test errors. The simplest strategy is to split the complete dataset X into two parts: train base learners on one half, then have them predict the other half, and then make the predictions input meta-learners. This method is simple and fast, but some data is lost. For small and medium-sized data sets, information loss may be serious, resulting in poor performance of base and meta-learners.

To ensure full coverage of the data set, we can use cross validation. There are many ways to perform cross-validation, but before that, let’s implement the integration step by step.

Step 1: Define the library of the base learner

They are models that process input data and generate predictions, which can be linear regression, neural networks, or even another integration. As always, diversity is powerful! The only thing to note is that the more models we add, the slower the integration will run. Here, I’ll use the previous set of models:

base_learners = get_models()
Copy the code

Step 2: Define a meta-learner

There is no consensus on which meta-learner should be used, but the popular choices are linear models, kernel-based models (support vector machines and KNN algorithms), and decision tree-based models. You can also use another integration as a “meta-learner” : in this particular case, you end up with a two-tier integration, which is somewhat similar to a feedforward neural network.

In this case, we will use a gradient hoist. To ensure that GBM can explore local features, we need to limit every 1000 decision trees to be trained on a random subset of 4 base learners and 50% of the input data. In this way, GBM will express the predicted content of each base learner in different nearest neighbor input Spaces.

meta_learner = GradientBoostingClassifier(
   n_estimators=1000,
   loss="exponential", max_features=4, max_depth=3, subsample=0.5, learning_rate=0.005, random_state=SEED)Copy the code

Step 3: Define steps and generate training and test sets

For simplicity we divide the complete training set into the training set and the prediction set of the base learner. This method is sometimes called “Blending.” However, terminology varies from community to community, so it is sometimes not easy to know what type of cross validation is being used for integration.

Xtrain_base, xpred_base, ytrain_base, ypred_base = train_test_split(xtrain, ytrain, test_size=0.5, random_state=SEED)Copy the code

We now have a training set (X_train_base,y_train_base) and a prediction set (X_pred_base,y_pred_base) for the base learner and are ready to generate the prediction matrix for the meta-learner.

Step 4: Train the base learner on the training set

To train the base learner on the training data, we run as usual:

def train_base_learners(base_learners, inp, out, verbose=True):
"""Train all base learners in the library."""
if verbose: print("Fitting models.")
for i, (name, m) in enumerate(base_learners.items()):
if verbose: print("%s..." % name, end="", flush=False)
       m.fit(inp, out)
if verbose: print("done")
Copy the code

For the train-based learner, we need to perform:

train_base_learners(base_learners, xtrain_base, ytrain_base)
Copy the code

Step 5: Generate base learner predictions

With the base learner fitted, we can now generate a series of predictions for training meta-learners. Note that the observation-based predictions we generate will not be used for the training of the base learner, for each observation:

In the base learner prediction set, we generate the set of base learner prediction results:

If you implement your own integration, pay special attention to how you index the rows and columns of the prediction matrix — splitting the data into two pieces is not difficult, but it can be challenging for later cross-validation.

def predict_base_learners(pred_base_learners, inp, verbose=True):
"""Generate a prediction matrix."""
   P = np.zeros((inp.shape[0], len(pred_base_learners)))

if verbose: print("Generating base learner predictions.")
for i, (name, m) in enumerate(pred_base_learners.items()):
if verbose: print("%s..." % name, end="", flush=False)
       p = m.predict_proba(inp)
# With two classes, need only predictions for one class
       P[:, i] = p[:, 1]
if verbose: print("done")

return P
Copy the code

To generate the forecast, we need to perform:

P_base = predict_base_learners(base_learners, xpred_base)
Copy the code

Step 6: Train the meta-learner

The prediction matrix P_base reflects the performance of the test time and can be used to train the primitive learner:

meta_learner.fit(P_base, ypred_base)
Copy the code

That’s it! We now have a fully trained integration that can be used to predict new data. To generate a prediction of the observation x(j), we first feed it into the base learner. They output a series of predictions

And we feed that into the meta-learner. The meta-learner will give us the final prediction of the integration

Now that we have a clear idea of what integrated learning can do, it’s time to see what it can predict about political donations data sets:

def ensemble_predict(base_learners, meta_learner, inp, verbose=True):
"""Generate predictions from the ensemble."""
   P_pred = predict_base_learners(base_learners, inp, verbose=verbose)
return P_pred, meta_learner.predict_proba(P_pred)[:, 1]
Copy the code

To generate the forecast, we need to perform:

P_pred, p = ensemble_predict(base_learners, meta_learner, xtest)
print("\nEnsemble ROC-AUC score: %.3f" % roc_auc_score(ytest, p))
Copy the code

Integrated ROC-AUC score: 0.881

As expected, the integration beat the best estimate of the previous benchmark, but it still doesn’t beat simple average integration. This is because we only train base and meta-learners on half of the data, so a lot of information is lost. To prevent this, we need to use a cross-validation strategy.

Use cross-validation training

When cross-validating the training base learner, k-1 folds were fitted for each base learner backup, and the remaining folds were predicted. This process is repeated until each fold is predicted. The more folds we specify, the less data we have during each training session. This allows cross-validated predictions to be less noisy and perform better during testing. But this significantly increases training time. Fitting an integration by cross validation is often called stacking, and the integration itself is called Super Learner.

To understand how cross-validation works, we can think of it as an outer loop from the previous integration. The outer loop iterates over different test folds in each cell, while the rest of the data is used for training; Internal loop training base learner and generate predictive data. Here’s a simple stack implementation:

from sklearn.base import clone

def stacking(base_learners, meta_learner, X, y, generator):
"""Simple training routine for stacking."""

# Train final base learners for test time
print("Fitting final base learners...", end="")
   train_base_learners(base_learners, X, y, verbose=False)
print("done")

# Generate predictions for training meta learners
# Outer loop:
print("Generating cross-validated predictions...")
   cv_preds, cv_y = [], []
for i, (train_idx, test_idx) in enumerate(generator.split(X)):

       fold_xtrain, fold_ytrain = X[train_idx, :], y[train_idx]
       fold_xtest, fold_ytest = X[test_idx, :], y[test_idx]

# Inner loop: step 4 and 5
       fold_base_learners = {name: clone(model)
for name, model in base_learners.items()}
       train_base_learners(
           fold_base_learners, fold_xtrain, fold_ytrain, verbose=False)

       fold_P_base = predict_base_learners(
           fold_base_learners, fold_xtest, verbose=False)

       cv_preds.append(fold_P_base)
       cv_y.append(fold_ytest)
print("Fold %i done" % (i + 1))

print("CV-predictions done")

# Be careful to get rows in the right order
   cv_preds = np.vstack(cv_preds)
   cv_y = np.hstack(cv_y)

# Train meta learner
print("Fitting meta learner...", end="")
   meta_learner.fit(cv_preds, cv_y)
print("done")

return base_learners, meta_learner
Copy the code

Let’s look at the steps involved here. First, we fit the base learner on all the data: this is in contrast to the previous hybrid integration, where the base learner trains all the data in the test time. We then iterate over all folds, and then over all base learners to generate cross-validation predictions. These predictions stack up to form the meta-learner’s training set — which also trains all the data.

The basic difference between blending and stacking is that stacking allows base learners and meta-learners to train over the entire data set. Using double cross validation, we can measure the difference in this case:

from sklearn.model_selection import KFold

# Train with stacking
cv_base_learners, cv_meta_learner = stacking(
   get_models(), clone(meta_learner), xtrain.values, ytrain.values, KFold(2))

P_pred, p = ensemble_predict(cv_base_learners, cv_meta_learner, xtest, verbose=False)
print("\nEnsemble ROC-AUC score: %.3f" % roc_auc_score(ytest, p))
Copy the code

Integrated ROC-AUC score: 0.889

Stacking provides considerable performance gains: in fact, it gets the best score so far. This performance is typical for small – and medium-sized datasets, where mixed influence is significant. With the increase of the collective amount of data, the performance of mixing and stacking will gradually converge.

Stacking has its drawbacks, especially speed. In general, in the case of cross-validation, we need to know these questions:

1. Computational complexity

2. Structural complexity (risk of information leakage)

3. Memory usage

Understanding them is important for efficient use of the integration approach, so let’s go through them.

1. Computational complexity

Suppose we need to use a 10 fold stack. This requires training all base learners on 90% of the data 10 times, and then training them again on all the data. With four base learners, integration takes about 40 times longer than optimal base learners.

But each CV-Fit is independent, so we don’t need to fit the model sequentially. If we were able to fit all folds in parallel, the integration would be only 4 times slower than the best base learner — a huge improvement. Integration is the best beneficiary of parallelization, and it is critical that it be able to take full advantage of this mechanism. By fitting all folds for all models, the integration time penalty is negligible. To illustrate this, the figure below shows a benchmark on ML-Ensemble that shows the time taken to stack or mix fit on four threads in sequence or in parallel.

Even with this degree of parallelism, we can reduce a lot of computing time. However, parallelization is associated with a number of potentially thorny issues, such as race conditions, locking, and memory explosions.

2. Structural complexity

When we decide to use an entire training set on a meta-learner, we have to be concerned about “information leakage”. This phenomenon occurs when the sample used during training is incorrectly predicted, such as mixing different folds, or using the wrong training subset. When the meta-learner leaks information on the training set, prediction errors are generated: garbage in, garbage out. Finding such bugs is very difficult.

3. Memory usage

The last problem with parallelization, especially when multitasking in Python, comes up a lot. In this case, each child process has its own memory and needs to copy all the data in the parent process. Therefore, an unoptimized implementation copies all of the program’s data, consuming a lot of memory and wasting time serializing data. To prevent this, we need to share data storage, which in turn can easily lead to data corruption.

Result: Use toolkits

The result of this problem is to use a software package that has been tested and built for your machine learning approach. In fact, once you use the integration kit, building the integration becomes very simple: all you need to do is select the base learner, the meta-learner, and the way to train the integration.

Fortunately, there are many toolkits available for every popular programming language today — albeit in different styles. I’ll list some of them at the end of this article. Now, let’s take a look at one of these and see how the integration approach works with the political donations data set. Here, we use ML-ensemble to build the generalized set we mentioned earlier, but now use 10 fold cross validation.

from mlens.ensemble import SuperLearner

# Instantiate the ensemble with 10 folds
sl = SuperLearner(
   folds=10,
   random_state=SEED,
   verbose=2,
   backend="multiprocessing"
)

# Add the base learners and the meta learner
sl.add(list(base_learners.values()), proba=True) 
sl.add_meta(meta_learner, proba=True)

# Train the ensemble
sl.fit(xtrain, ytrain)

# Predict the test set
p_sl = sl.predict_proba(xtest)

print("\nSuper Learner ROC-AUC score: %.3f" % roc_auc_score(ytest, p_sl[:, 1]))

Fitting 2 layers
Processing layer-1             done | 00:02:03
Processing layer-2             done | 00:00:03
Fit complete                        | 00:02:08

Predicting 2 layers
Processing layer-1             done | 00:00:50
Processing layer-2             done | 00:00:02
Predict complete                    | 00:00:54

Super Learner ROC-AUC score: 0.890
Copy the code

It’s that simple!

Observe the ROC curve of the simple average set of the super learner, which shows how the super learner can use all the data to achieve a given accuracy by sacrificing only a small recall rate.

plot_roc_curve(ytest, p.reshape(-1, 1), P.mean(axis=1), ["Simple average"]."Super Learner") 
Copy the code

Continue to head

In addition to the integrations described in this article, there are many other integrations, but the basics are the same: a base learner library, a meta-learner, and a trainer. By adjusting the fit of these components, we can design various specific collection forms. For more advanced collections, see this article: mlwave.com/kaggle-ense… .

When it comes to software, everyone has their own preference. With the popularity of integration methods, the number of integration toolkits is increasing. The integration approach actually became popular in the statistics community first, so there are many libraries designed for it in the R language. More tools have appeared in Python and other languages in recent years. Each toolkit meets different needs and is at a different stage of maturity, so I recommend browsing before making a decision.

The following table lists some of these tools:

Original link:www.dataquest.io/blog/introd…

This article is compiled for machine heart, reprint please contact this public number for authorization.

How each Kaggle champion wins: Unmasking model integration in Python

Related Posts

Why is RPA so popular? Is technology? Is ecology? Capital?

[Cloud Weekly] issue 128: The secret behind supporting hundreds of billions of revenue — the first Alibaba R&D Efficiency Carnival

Awards and rights and interests of “Huixin Cup” Emerging Technology + Internet Innovation Competition