The author | PRATEEK JOSHI compile | source of vitamin k | Analytics Vidhya

A simple analogy between decision trees and random forests

Let’s start with a thought experiment that will illustrate the difference between a decision tree and a random forest model.

Suppose a bank has to approve a small loan for a customer, and the bank needs to make a quick decision. The bank checked the person’s credit history and finances and found they hadn’t paid back the old loan. So the bank rejected the application.

But the problem is that the amount of lending is so small for the banks’ huge coffers that they could have approved it with very little risk. As a result, banks lose the opportunity to make money.

Now, another loan application is due in a matter of days, but this time the bank has come up with a different strategy — multiple decision-making processes. Sometimes it checks the credit history first, and sometimes it checks the customer’s financial status and the amount of the loan. The bank then combines the results of these multiple decision-making processes to make a decision to extend a loan to the customer.

Even if this process takes longer than the previous one, banks can take advantage of this method. This is a classic example of collective decision making being superior to a single decision making process. Now, you know what these two processes represent, right?

These represent decision trees and random forests respectively! We’ll explore this idea in detail here, delving into the key differences between the two approaches, and answering the key question, which algorithm should you use?

directory

  1. Introduction to decision Trees

  2. Introduction to Random Forest

  3. Conflict between random forest and decision tree

  4. Why is a random forest better than a decision tree?

  5. Decision trees and Random Forests – When should you choose which algorithm?

Introduction to decision Trees

Decision tree is a supervised machine learning algorithm that can be used for classification and regression problems. A decision tree is simply a series of sequential decisions made to achieve a particular outcome. Here is an example of a running decision tree (using the example above) :

Let’s understand how the tree works.

First, it checks whether the customer has a good credit history. On this basis, the customers are divided into good credit records and bad credit records of customers. It then checks the customer’s income and again divides him/her into two groups. Finally, it checks the amount of loan requested by the customer. Based on the results of examining these three characteristics, the decision tree decides whether a customer’s loan should be approved.

Characteristics/attributes and conditions can change depending on the complexity of the data and the problem, but the overall idea remains the same. Thus, a decision tree makes a series of decisions based on a set of characteristics/attributes in the data (in this case, credit history, income, and loan amount).

Now, you might be thinking:

Why does the decision tree examine credit scores and not income in the first place?

This is called feature importance, and the sequence of attributes to be examined is determined according to criteria such as the Gini coefficient or information gain. It is beyond the scope of this article to explain these concepts, but you can refer to any of the following resources to learn all about decision trees:

Note: The idea of this paper is to compare decision trees and random forests. Therefore, I won’t go into the basic concepts in detail.

Introduction to Random Forest

Decision tree algorithm is easy to understand and explain. But often, one tree is not enough to produce an effective result. This introduces the concept of a random forest.

Random forest is a tree-based machine learning algorithm that utilizes the ability of multiple decision trees to make decisions. As the name suggests, this is a “forest” of trees!

But why do we call them “random” forests? That’s because it’s a forest of randomly created decision trees. Each node in the decision tree performs an operation on a random subset of features to compute the output. Then, the random forest combines the outputs of each decision tree to generate the final output.

To put it simply:

The random forest algorithm combines the outputs of multiple (randomly created) decision trees to produce the final output.

The process of combining the outputs of multiple individual models (also known as weak learners) is called integrated learning.

Now the question is, how do we decide which algorithm to choose between a decision tree and a random forest? Let’s see what they look like in practice before we draw any conclusions!

Conflict between random forest and decision tree

In this section, we will use Python to solve dichotomy problems using decision trees and random forests. We will then compare their results to see which one best fits our problem.

We will develop loan forecasting datasets. This is a binary classification problem where we have to determine whether a person should get a loan based on a specific set of characteristics.

Address: data sets datahack.analyticsvidhya.com/contest/pra…

Step 1: Load the library and dataset

Start by importing the required Python libraries and data sets:

import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.metrics import f1_score from Sklearn.model_selection import train_test_split df=pd.read_csv('dataset.csv') df.head()Copy the code

The dataset consists of 614 lines and 13 characteristics, including credit history, marital status, loan amount and gender. Here, the target variable is Loan_Status, which indicates whether a loan should be made to someone.

Step 2: Data preprocessing

Now, it’s the most critical part of any data science project — data preprocessing and feature engineering. In this section, I’ll work with the classification variables in the data and enter the missing values.

I will use the model to estimate missing values in categorical variables and, for continuous variables, the average value (for individual columns). In addition, we will label the classified values in the data.

# # data pretreatment with null value interpolation label coding df [' Gender '] = df [' Gender ']. The map ({' Male ': 1, the' Female ': 0}) df['Married']=df['Married'].map({'Yes':1,'No':0}) df['Education']=df['Education'].map({'Graduate':1,'Not Graduate':0}) df['Dependents'].replace('3+',3,inplace=True) df['Self_Employed']=df['Self_Employed'].map({'Yes':1,'No':0}) df['Property_Area']=df['Property_Area'].map({'Semiurban':1,'Urban':2,'Rural':3}) Df [' Loan_Status] = df [' Loan_Status]. The map ({" Y ": 1, 'N' : 0}) # zero interpolation rev_null=['Gender','Married','Dependents','Self_Employed','Credit_History','LoanAmount','Loan_Amount_Term'] df[rev_null]=df[rev_null].replace({np.nan:df['Gender'].mode(), np.nan:df['Married'].mode(), np.nan:df['Dependents'].mode(), np.nan:df['Self_Employed'].mode(), np.nan:df['Credit_History'].mode(), np.nan:df['LoanAmount'].mean(), np.nan:df['Loan_Amount_Term'].mean()})Copy the code

Step 3: Create training and test sets

Now, we split the data set into training set and test set respectively in the ratio of 80:20:

X=df.drop(columns=['Loan_ID','Loan_Status']).values
Y=df['Loan_Status'].values
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42)Copy the code

Let’s look at the shape of the training set and test set created:

print('Shape of X_train=>',X_train.shape)
print('Shape of X_test=>',X_test.shape)
print('Shape of Y_train=>',Y_train.shape)
print('Shape of Y_test=>',Y_test.shape)Copy the code

That’s great! Now we are ready to move on to the next stage, where we will build decision trees and random forest models!

Step 4: Build and evaluate the model

Since we have training sets and test sets, it’s time to train our model and categorize loan applications. First, we will train the decision tree on this data set:

Dt = DecisionTreeClassifier(criterion = 'entropy', random_state = 42) dt.fit(X_train, Y_train) dt_pred_train = dt.predict(X_train)Copy the code

Next, we will evaluate the model using F1 scores. F1 score is the harmonic mean of accuracy and recall rate, and the formula is as follows:

Let’s use F1 scores to evaluate the performance of our model:

Print ('Training Set Evaluation f1-score =>',f1_score(Y_train,dt_pred_train))Copy the code

Here, you can see that the decision tree performs well for in-sample evaluation, but its performance drops dramatically for out-of-sample evaluation. Why do you think that is? There’s an overfitting going on here. Can random forests solve this problem?

Establish random forest model

Let’s look at a random forest model in action:

Ensemble import RandomForestClassifier RFC = RandomForestClassifier(criterion = 'entropy', random_state = 42) rfc.fit(X_train, Rfc_pred_train = RFC. Predict (X_train) print('Training Set Evaluation ' F1-Score=>',f1_score(Y_train,rfc_pred_train))Copy the code

Here, we can clearly see that the random forest model performs better than the decision tree in the out-of-sample evaluation. Let’s discuss the reasons behind this in the next section.

Why is a Random forest model better than a decision tree?

Random forests utilize the power of multiple decision trees. It does not depend on the importance of features given by a single decision tree. Let’s take a look at the feature importance that different algorithms assign to different features:

feature_importance=pd.DataFrame({ 'rfc':rfc.feature_importances_, 'dt':dt.feature_importances_ },index=df.drop(columns=['Loan_ID','Loan_Status']).columns) feature_importance.sort_values(by='rfc',ascending=True,inplace=True) index = np.arange(len(feature_importance)) fig, Ax = PLT. Subplots (figsize = (18, 8)) rfc_feature = ax. Barh (index, feature_importance [' RFC], 0.4, color = 'purple', label = 'Random Forest ') dt_feature = ax. Barh (index + 0.4, feature_importance [' dt], 0.4, color = 'lightgreen', label = 'Decision Tree') Ax. Set (yticks = index + 0.4, yticklabels = feature_importance index) ax. Legend (PLT), show ()Copy the code

As shown in the figure above, the decision tree model places a high value on a specific set of features. But the characteristics were randomly selected during training. Therefore, it is not highly dependent on any particular feature set. This is a special effect of random forest.

Therefore, random forest can better generalize data. This random feature selection makes the random forest more accurate than the decision tree.

So which should you choose, a decision tree or a random forest?

When we have a large data set, random forests are appropriate and interpretability is not the main concern.

Decision trees are easier to interpret and understand. Because a random forest combines multiple decision trees, it becomes more difficult to interpret. The good news is that it’s not impossible to explain a random forest.

Moreover, random forest has higher training time than single decision tree. You should take this into account, because as we increase the number of trees in a random forest, so does the time it takes to train each one. This is often critical when your DDL is tight on machine learning projects.

But I would say that although unstable and dependent on a specific set of characteristics, decision trees are very useful because they are easier to interpret and train. Anyone with little knowledge of data science can also use decision trees to make data-driven decisions quickly.

At the end

Decision trees versus random forests, that’s what you need to know. Machine learning can be tricky when you’re new to it, but this article should clarify the differences and similarities for you.

The original link: www.analyticsvidhya.com/blog/2020/0…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/