The author | Prashant Gupta

The translator | AI100 (rgznai100)



In real life, the tree analogy follows. It turns out that tree structure also has a broad impact on machine learning, especially for classification and regression tasks.


In decision analysis, decision tree can show the process and result of decision very clearly. As the name suggests, a decision tree uses a tree-shaped decision model. Decision trees are often used in data mining to search for solutions to given problems, and are also widely used in machine learning. That will be the theme of this blog post.


How can an algorithm be represented as a tree?


For this, let’s look at a basic example: using the Titanic data set to predict the survival of each passenger. Next, the model selected three features/attributes/columns in the data set: gender, age, and SIBSP (Sibling and Spouse: denoting the number of relatives accompanying them).

The decision tree is drawn from top to bottom, and the top node is called the root node. As shown above: The black bold text represents the condition/internal nodes, branches/edges based on which the edges are split. Nodes that are no longer split are the calculated decision/leaf nodes. In this case, the survival or death of the passenger corresponds to the green and red leaf nodes respectively.


You can’t deny the simplicity of the algorithm, although the actual data set will have more features and the previous tree will only count as a partial branch of a large tree. The importance of features is obvious, and the relationships between data are immediately apparent. The method here is often referred to as the Learning decision Tree from data, and the tree above is also known as the Classification tree, because the goal of the algorithm here is to categorize passengers as surviving or dead. Regression trees are presented in the same way, except that they predict continuous values, such as housing prices. In general, decision Tree algorithms refer to CART (Classification And Regression Tree).


So what is the mechanism behind the decision tree? In the process of forming a decision tree, splitting involves the problem of which feature to choose, what conditions to split under, and knowing when to stop splitting. Since tree generation is relatively arbitrary, you need to prune it to make it look better. Let’s start by looking at one of the more common splitting methods.


Recursive binary splitting

In Recursive Binary Splitting, all features are taken into account, different Splitting modes are tried, and cost functions are used to evaluate the merits of each Splitting mode. The method with the best (lowest) cost will be chosen to split.


Take the classification tree of the previous Titanic data set: at the first split or at the root node, all attributes/features are taken into account and the training data is divided into different groups based on this. We have three characteristics, so there will be three undetermined splits. We then use a function to calculate the cost of each split. The algorithm automatically selects the one with the least loss, in this case the passenger’s gender. The algorithm is recursive in nature, because each split creates a data set that can be divided again in the same way. Because of this step, the algorithm is also called a greedy algorithm because we are desperate to reduce losses. This also makes the root node the best predictor/classification point.


Split the cost


Let’s take a closer look at cost functions for classification and regression. In both cases, the cost function is trying to find the most similar structure after splitting. The implication is that we can be more sure which path the test data will follow.


Predicting house prices: all the features of the training data need to be considered when the decision tree starts to split; For a particular grouping of training data, the mean of the input responses is used as the predicted value for that group. The above functions are used at all data points to calculate the cost of all possible splits. The fission method with the lowest loss will be screened out. Another kind of cost function involves reduced and the standard deviation, more information can be reference here: http://www.saedsayad.com/decision_tree_reg.htm.



To evaluate the effectiveness of a split method, we used Gini score to measure the degree of confusion after splitting training data. Where, PK represents the proportion of the same input category in a particular grouping. When all inputs to a data set come from the same category, we have a perfect classification, where the PK value is either 1 or 0, and G must be 0. However, if a data set is evenly split between the two categories, the worst case scenario occurs, where pk=0.5 for binary classification and G=0.5.


When to stop splitting?


Then you might ask, when does the decision tree stop splitting? Because a problem usually has a large number of related features, it generates a large number of splits, forming a large tree. Such a complex tree is prone to overfitting. So we need to know when to stop splitting.


One method is to set the minimum threshold of training input at each leaf node. For example, we can set each leaf node to have a minimum of 10 passengers, ignoring leaves with fewer than 10 passengers. Another approach is to specify the maximum depth of the model. The maximum depth of a decision tree refers to the split length corresponding to the longest path from root to leaf.


trim


Decision trees can be further improved by pruning, which involves removing branches whose features are not important. In this way, we reduce the complexity of the decision tree, that is, to improve its predictive ability by reducing the degree of overfitting.


Pruning can begin at either the root or leaf nodes. The easiest way to do this is to start with a leaf node and remove all nodes that use the leaf’s primary classification, which can be performed if this operation does not impair the accuracy of the decision tree. This method is called reduced error pruning. You can also use more sophisticated pruning methods, such as cost complexity Pruning, which uses a learning parameter to measure the subtree size of the current node and determine whether to keep it. This method is also known as challenging Link pruning.


Classification regression tree advantages


  • Easy to understand, explain, and visualize.

  • The potential of decision tree is variable screening or feature selection.

  • Able to handle both numerical and annotated data, and handle multiple output problems.

  • For users, decision trees require relatively little data preprocessing

  • The nonlinear relation between parameters does not affect the performance of decision tree.


Disadvantages of classification regression tree


  • Decision trees tend to create overly complex trees, resulting in insufficient data generalization. This is known as overfitting.

  • Decision trees are not stable because small changes in data can produce a completely different tree. This is called variance and needs to be optimized.

  • Greedy algorithms cannot guarantee the global optimization of the generated decision tree. This can be mitigated by training multiple trees whose characteristics and samples can be obtained by resetting random sampling.

  • If some categories are too heavily weighted, the decision tree generates biased trees. Therefore, before using data to generate a decision tree, attention should be paid to balancing the characteristics of the data set.


These concepts of decision trees are very basic. Currently, a very popular library for implementing this algorithm is SciKit-Learn. It has a very good API, and you can easily build models with just a few lines of Python code.


The original link: https://medium.com/towards-data-science/decision-trees-in-machine-learning-641b9c4e8052