Editor’s note: Insurance industry data scientist Alan Marazzi demonstrates in R the power and simplicity of decision tree-based models.

This is a brief introduction to the decision tree-based model, using non-technical terms as much as possible, and also gives an R language implementation of the model. Since this article is already long enough, we have omitted some code. But don’t worry, you can find the complete code in the making of form a complete set warehouse: https://github.com/alanmarazzi/trees-forest

Why the enduring popularity of decision tree-based models

A decision tree is a set of machine learning algorithms commonly used for classification. Because they are simple and effective, they are also one of the first algorithms that beginners learn. You probably won’t find them in the latest machine learning papers and research, but decision tree-based models are still widely used in real-world projects.

One of the main reasons for their widespread use is their simplicity and interpretability. Here’s a simple decision tree that predicts whether the weather will be cloudy or not.

This approach allows us to predict a variable from incoming data, but perhaps more importantly we can infer the relationships between the predictors. That means we can start at the bottom and see what factors contribute to the cloudiness.

For example, if the wind is light and it looks like it’s going to rain, it’s cloudy. For simple models, these rules can be learned and applied by humans, or we can generate a list to aid the decision-making process. By visualizing the decision tree, we can understand how the machine works, classifying some days as cloudy and others as not cloudy.

Although this may seem trivial, in many cases we need to know why the model makes certain predictions. Consider a model that predicts whether a patient will receive chest pain. After testing many advanced models, doctors wanted to figure out why the algorithm sent certain patients home if they were at risk. So they ran a decision tree-based model on the data, and found that the algorithm rated asthmatic patients with chest pain as having little risk.

This was a huge mistake. Doctors are well aware that asthma and chest pain must be treated immediately, which means patients with asthma and chest pain will be admitted immediately. You see the problem, right? The data used for the modeling suggested that these patients were at low risk because all of them were treated, so few died after that.

When to use a decision tree-based model

As mentioned earlier, a decision tree is great when interpretability is important, even though it may only be used to understand what went wrong. In fact, decision tree-based models can become very complex, increasing accuracy at the expense of interpretability. There is a trade-off.

Another reason to use decision trees is that they are easy to understand and explain. With some strong predictors, decision trees can be used to create models that can be used by both machines and humans. An example that just occurred to me is a decision tree model that predicts whether a customer will eventually buy something.

Benchmarks are also where these methods shine: you quickly discover that even fairly simple decision tree-based models are hard to beat by a large amount when it comes to categorization. I personally often run random forests (described below) on the data sets I’m working on, and then try to beat it.

R Language Configuration

Before you start, you may need to configure the R environment.

Install the following packages:

trees_packages <- c("FFTrees",    "evtree",    "party",    "randomForest",    "intubate",    "dplyr")install.packages(trees_packages)Copy the code

These are the main packages using decision tree-based models and data processing in the R language, but they are not the only ones. There are dozens of packages available for almost any decision tree-based model you want to use, so just do a Crantastic search.

It’s tree planting time! I decided to use the Titanic dataset, one of the most well-known datasets in the machine learning community. You can get this data set from Kaggle (C/Titanic) or GitHub (Alanmarazzi/trees-Forest). I’ll start with cleaning Data and modeling directly, but if you need help with Data downloading, loading, or lack of clarity, you can refer to my previous article Data Science in Minutes or the full code in the GitHub repository.

Prepare data

First, let’s look at what the data looks like:

I really don’t like data sets with uppercase names. Fortunately, we can use the tolower() function to convert one line of code to lowercase:

names(titanic) <- tolower(names(titanic))Copy the code

The sex and embarked variables are then converted to factors (category variables) :

titanic$sex <- as.factor(titanic$sex)titanic$embarked <- as.factor(titanic$embarked)Copy the code

One of the most important steps in modeling is dealing with missing values (NA). Many R models can handle missing values automatically, but most simply remove observations containing missing values. This means there is less training data for models to learn from, which almost certainly leads to lower accuracy.

There are various techniques for populating NA: populating means, medians, modes, or using a model to predict their values. Our example will use linear regression to replace the missing values of the age variable in the data set.

At first glance this idea seems a little scary, a little weird, and you might be thinking, “Are you saying that in order to improve my model, I should use another model? !” But it’s not as difficult as it looks, especially if we use linear regression.

First let’s look at how much NA there is in the age variable:

Many years ago scheme (is. Na ($age)) [1] of 0.1986532Copy the code

Nearly 20% of the passengers had no age record, which means that if we had run the model directly on the dataset without replacing the missing values, we would have only 714 training items instead of 891.

It’s time to run down linear regression on the data:

age_prediction <- lm(age ~ survived + pclass + fare, data = titanic)summary(age_prediction)Copy the code

What did we do? We tell R to solve the following linear equation to find the appropriate values of alpha and beta n.

Age = α + β1∗survived + β2∗pclass + β3∗fareCopy the code

We then call the summary() function on the model we created to see the results of the linear regression. R will give some statistics that we need to look at to understand the data:

Call:lm(formula = age ~ survived + pclass + fare, data = titanic)Residuals: Min 1Q Median 3Q max-37.457-8.523-1.128 8.060 47.505 Coefficients: T value Estimate Std. Error (Pr > | | t) (Intercept) 54.14124 2.04430 26.484 < 2-16 * * * e survived - 6.81709-1.06801-6.383 3 e-10 *** pclass-9.12040 0.72469-12.585 < 2e-16 ***fare -0.03671 0.01112-3.302 0.00101 ** -- codes: 0 '* * *' 0.001 '* *' 0.01 '*' 0.05 '. '0.1' '1 residual standard error: 13.03 ON 710 Degrees of Freedom (177 Observations deleted due to missingness)Multiple R-squared: Adjusted R-squared: 0.195F-statistic: 0.2E-16 and 710 DF, p-value: < 2.2E-16Copy the code

The first line above (Call) tells us which model produced this result, the second line shows the residuals, followed by the coefficients. Here we can look at our estimates of the coefficients, their standard deviations, their t-values and p-values. Then there are some other statistics. We see that R actually removes the data containing NA (177 Observations deleted due to missingness).

Now we can use this model to fill NA. We use the predict() function:

titanic$age[is.na(titanic$age)] <- predict(age_prediction,    newdata = titanic[is.na(titanic$age),])Copy the code

Logistic regression baseline

Logistic regression is difficult to overcome the binary classification of survival. We will use logistic regression to predict Titanic survivors and use this result as a baseline.

Don’t worry, logistic regression in R is pretty straightforward. We import the dplyr and Intubate libraries, and then call the GLM () function to run logistic regression. The GLM () takes three parameters, with predictors as predictors, such as age, cabin, etc., and response as result variables. Here we pass survives, and family specifies the type of result to return. Here we pass binomial.

Library (dplyr) # Data Processing Library (Intubate) # Modeling workflow # btbt_glb is %>% version of the GLM function logi < -titanic %>% select(demolish, pclass, sex, age, sibsp) %>% ntbt_glm(survived ~ ., family = binomial)summary(logi)Copy the code

Let’s take a look at what the logistic regression model predicts:

Logi_pred < -predict (logi, type = "response") We convert it to 'survived' or 'not' survivors_logi < -rep (0, Nrow (Titanic))survivors_logi[logi_pred >.5] < -1 # This will become our baseline table(model = survivors_logi, real = survivors_logi).Copy the code

The obfuscation matrix above gives the results of the model on the training data: 572 deaths were predicted (0) and 319 survived (1). The diagonal lines of the matrix showed that items 480 and 250 were correctly predicted, while 92 predicted deaths actually survived and 69 predicted survival did not.

For such an out-of-the-box model, an accuracy of 82% is pretty good. But we want to test it on unseen data, so let’s load the test set and see how the model looks on the test set.

test <- read.csv(paste0("https://raw.githubusercontent.com/", "alanmarazzi/trees-forest/master/data/test.csv"), StringsAsFactors = FALSE, na.strings = "") Names (test) < -tolower (names(test))test$sex < -as.factor (test$sex)Copy the code

The survival rates are predicted on the test data below:

test_logi_pred <- predict(logi, test, type = "response")surv_test_logi <- data.frame(PassengerId = test$passengerid,     Survived = rep(0, nrow(test)))surv_test_logi$Survived[test_logi_pred > .5] <- 1write.csv(surv_test_logi, "results/logi.csv", row.names = FALSE)Copy the code

We saved the results as CSV because the test data was not labeled and we did not know if the prediction was correct. We need to upload the results to Kaggle to see the results. The final model was correct 77.5% of the time.

Fast and low-cost decision tree

Finally planting trees! The first model we will try is a fast and low-cost decision tree. This is basically the simplest model. We will use the FFTrees package for R.

Titanicc < -TitanicLibrary (FFTrees) Titanic < -titanicCRM (TitanICC)Copy the code

To load the package, we simply apply FFTrees to the selected variables.

fftitanic <- titanic %>%     select(age, pclass, sex, sibsp, fare, survived) %>%     ntbt(FFTrees, survived ~ .)Copy the code

The model needs to run for a while because there is more than one FFTree to train and test. The final result is an FFTree object that contains all the ffTrees tested:

fftitanic[1] "An FFTrees object containing 8 trees using 4 predictors {sex,pclass,fare,age}"[1] "FFTrees AUC: "[1] "My favorite training tree is #5, Here is how it performed:" Trainn 891.00p(Correct) 0.79Hit Rate (HR) 0.70False Alarm Rate (FAR) 0.16D-Prime 1.52Copy the code

We see that the algorithm tested eight trees with up to four predictors, with the best performance being tree number five. Then we saw some statistics about the tree. These outputs are helpful, but visualizations help us better understand what’s going on:

plot(fftitanic,      main = "Titanic", decision.names = c("Not Survived", "Survived"))Copy the code

There is a lot of information in this graph, from top to bottom: number of observations, number of classifications, decision tree, diagnostic data. Let’s focus on the decision tree.

The first node of the decision tree considers the sex variable: if it is a woman (sex! = male), we will simply exit the decision tree and predict survival. Brutal, but quite effective. If it is male, the second node pclass will be passed. Here, if it’s third-class, we exit the decision tree and predict death. Then, if the boat fare exceeds £26.96 (fare), the forecast survives. The last node considers age: death is predicted if the age is greater than 21.35 years.

In the Performance area of the chart, we are most concerned about the confusion matrix on the left, which we can compare with the confusion matrix obtained by logistic regression.

In addition, we can also look at the ROC curve on the right. The FFTrees package automatically runs logistic regression and CART (another decision tree-based model) on the data for comparison. Looking closely at the diagram, we see that the circle representing logistic regression is almost completely covered by the circle of tree 5, meaning that the two models are comparable.

Now we categorize the test data and submit the results to Kaggle. As I said before, these decision trees are extremely simple. When I explained how decision trees work above, there is an “if” in the sentence explaining each node, which means we can create a list-based classifier following the same structure, or we can even memorize the rules and classify them manually.

ffpred <- ifelse(test$sex ! = "male", 1, ifelse(test$pclass > 2, 0, ifelse(test$fare < 26.96, 0, ifelse(test$age >= 21.36, 0, 1))))ffpred[is.na(ffpred)] <- 0Copy the code

With just four nested ifelse statements, we can classify the entire dataset. We only had 2 Nas, so I decided to classify them as “not surviving.” We then simply upload the CSV results to Kaggle to see how the model behaves.

Our four if-else statements performed just 1% worse than the benchmark. Considering the simplicity of the model, this is an excellent result.

At the party

The Party package uses conditional inference trees, more complex decision trees than FFTrees. To put it simply, conditional inference tree considers not only importance but also data distribution when deciding to split nodes. Although conditional inference trees are more complex, they are simple to use and can be created by simply using the Ctree function after the package is loaded.

library(party)partyTitanic <- titanic %>%     select(age, pclass, sex, sibsp, fare, survived) %>%     ntbt(ctree, as.factor(survived) ~ .)Copy the code

After running the model, we can call the package’s plotting function to visualize the resulting decision tree, plot(ctree_relust). We don’t care about all the bells and whistles here, we just care about the decision tree we end up with. So some optional arguments are used to make the output a little cleaner.

plot(partyTitanic, main = "Titanic prediction", type = "simple",     inner_panel = node_inner(partyTitanic,                               pval = FALSE),     terminal_panel = node_terminal(partyTitanic,                                    abbreviate = TRUE,                                    digits = 1,                                    fill = "white"))Copy the code

Unfortunately, larger trees take up more space, and if you add a few nodes, the graph becomes almost illegible. Comparing this tree to the FFTree above, we see that this tree is now more complex: whereas before we predicted the death of every male directly, the model now attempts to divide males into multiple conditions.

The added complexity reduces the training error by 15%. This is an improvement over the FFTree above.

train_party <- Predict(partyTitanic)table(tree = train_party, real = titanic$survived)Copy the code

Unfortunately, we’re about to learn the most important lesson of machine learning. In fact, only 73.7 percent of the tests were correctly classified!

How is that possible, you may ask? What we just saw was an overfitting phenomenon. Some of the variables considered by the model turned out to be noise. The results improved on the training set but deteriorated on the unseen data. There are several ways to deal with this, such as pruning. Pruning means cutting branches, for example by reducing the maximum depth of a tree. Pruning and cross-validation are likely to improve results on test data.

Integrated model

So far, we’ve worked with individual learners, which means we find solutions through a model. Another set of machine learning algorithms is the integration of models created by many so-called weak learners. The theory behind it is that by using many learners (decision trees in our case) and combining their choices, we can get good results.

The integration model is different because of the model creation method and the combination result way. It may seem messy, but the partial integration approach is usually out of the box and is a good choice for optimizing results.

The purpose of integration is to reduce variance. For example, we had good results on the training set above, but a large error rate on the test set. If we have different training sets and different models, they will have different deviations, and better results will be obtained after integration.

We’ll look at three different integration algorithms: Bagging, Random Forest, and Boosting.

Bagging

Bagging’s main idea is fairly simple: if we train many large decision trees on different training sets, we will end up with many high-variance, low-bias models. By averaging the predictions per tree, we get a relatively low variance and bias classification.

One problem you may have noticed is that we don’t have many training sets. To address this problem, we created these training sets using the bootstrap method. Bootstrap is simply a repeated sampling method with a fallback.

Boot_x < -function (x, size) {sample(x, size, replace = TRUE)} Bootstrapping < -function (x, reps, size) {y < -list () for (I in seq_len(reps)) {y[[I]] < -boot_x (x, Size)} y}# Result is a list z < -bootstrapping (x, 500, 20)Copy the code

To run Bagging on Titanic data, we can use the randomForest package. This is because bagging is similar to random forest, the only difference being how many predictors are considered when creating a decision tree. In Bagging, we consider each predictor in the data set, which we can do by setting the Mtry parameter.

Library (randomForest)# If you want to reproduce the result, Titanic_bag < -titanic_bag %>% select(survived, age, pclass, sex, sibsp, fare, parch) %>% ntbt_randomForest(as.factor(survived) ~ ., mtry = 6)Copy the code

Note that here we pass in the survives as the as.factor, which allows the function to create classification trees instead of regression trees (yes, decision trees can also be used for regression).

Bagging creates 500 trees by default. If you want to add more trees, you can pass in the ntree parameter and set it to a higher value.

One problem with the code above is that it skips NA without making predictions. In order to generate Kaggle results without further feature engineering, I decided to replace the NA in the test set with the median. Unfortunately, this problem limits predictive power, resulting in a 66.5% correct prediction rate.

Random forests

Random forest is one of the best known machine learning algorithms because it works ridiculously well out of the box. Random forest is almost like Bagging, except that it uses weaker learners and only a limited number of predictors are considered when creating a decision tree.

You may ask what is the difference between using all of the predictors and using only some of them. The answer is that when using all the predictors, the first two splits are likely to be the same when creating a decision tree on different bootstrap sampled datasets, because the importance of the predictors is considered when creating the decision tree. So the 500 trees created using Bagging will be very similar and, accordingly, the predictions will be very similar.

To limit this behavior, we use a random forest, limiting the predictors with the Mtry parameter. We use cross-validation to determine the “best” parameter values, or try some rules of thumb such as ncol(data)/3 and SQRT (nCOL (data)), but in this case I set the mtry parameter value to 3.

I recommend that you experiment with different values and then see what happens to better understand the random forest algorithm.

set.seed(456)titanic_rf <- titanic %>%     select(survived, age, pclass, sex, sibsp, fare, parch) %>%     ntbt_randomForest(as.factor(survived) ~ ., mtry = 3, n.trees = 5000)Copy the code

The result is 74.6%, much better than Bagging (bagging used only 500 trees, whereas random Forest used 5,000), but still worse than logistic regression.

There are many implementations of random forest, maybe we could try the party package and use the conditional inference tree to make a random forest.

set.seed(415)titanic_rf_party <- titanic %>%     select(survived, age, pclass, sex, sibsp, fare, parch) %>%     ntbt(cforest, as.factor(survived) ~ .,             controls = cforest_unbiased(ntree = 5000, mtry = 3))Copy the code

As you can see, the code is pretty much the same as before, but is the result pretty much the same?

This result is almost a tie with logistic regression.

Boosting

Boosting is a slow learning algorithm, different from the previous “strenuous” learning algorithm. In fact, to avoid overfitting, Bagging and random forest create several thousand decision trees and average all predictions. Boosting works differently: a tree is created, and the results create another tree based on the results of the first tree, and so on.

Boosting is slower than other decision tree-based algorithms, which helps prevent overfitting, but also requires us to carefully adjust the learning rate. Boosting has a similar parameter to random forest, as you can see from the following code.

library(gbm)set.seed(999)titanic_boost <- titanic %>%     select(survived, age, pclass, sex, sibsp, fare, parch) %>%     ntbt(gbm, survived ~ .,         distribution = "bernoulli",         n.trees = 5000,         interaction.depth = 3)Copy the code

We use the function of the same name in the GBM package (NTBT) and specify the distribution parameter as Bernoulli to tell the function that this is a classification problem. The n.rees parameter specifies the number of decision trees to create, and interaction.depth specifies the maximum depth of the tree.

76%, similar to the results of logistic regression, random forest and FFTrees.

We learned that

  • Complex model > simple model == false. Logistic regression and FFTrees are hard to beat, and we can further improve the performance of simple models with a little feature engineering.

  • Feature engineering > Complex model == true. Feature engineering is an art. It’s one of the most powerful weapons data scientists have, and we can use feature engineering to improve predictions.

  • Create a model == fun! Data scientists are interesting. Although R can be a bit frustrating at times, studying R is rewarding on the whole. If you want further details, or a step-by-step guide, you can visit the GitHub repository mentioned at the beginning of this article, which has the complete code.

If you like this post, please leave a comment and forward. You can also subscribe to my blog rdisorder.eu

The original address: https://www.rdisorder.eu/2016/12/21/dont-get-lost-in-a-forest/