Problem 1.

Zhang Wuji is a bug-level presence in the world of The Golden Warrior. Wherever you go, there are beautiful women. Although Jin Yong himself said that xiao Zhao was his favorite of all the characters, it was zhao Min who walked behind Zhang Wuji, which was regrettable.

So I wondered, if god put them in front of me, what would I choose?

2. The analysis

This is a real problem, so we need tools. Decision trees, for example, are a great way to do this.

Decision Tree is the most concise algorithm in supervised learning. We can simply understand it as a series of If/Then judgment statements, and finally give the result through continuous refinement. Each judgment starts from the optimal feature and determines the decision path by minimizing the loss function. The resulting tree, for each condition, is unique and complete.

For example, teachers can predict grades based on characteristics such as attendance, number of questions answered, and rate of assignment submission. If the attendance rate is greater than 90% and the assignment submission rate is greater than 90%, it can be predicted as A; if the attendance rate is less than 90% and the assignment submission rate is greater than 90%, it can be predicted as B……

In principle, decision trees select features based on information gain. For the training data set, the information gain of each feature is calculated, and then the feature with the maximum information gain is selected for judgment. Each judgment is subdivided in the previous judgment, as shown in the figure below

It should be noted that the decision tree will judge each case, which is easy to lead to over-fitting. There are two ways to avoid over-fitting: pre-pruning and post-pruning. The pre-treatment is to specify the maximum depth of the tree at the beginning, and the post-treatment is to make the complete tree, and then prune it, incorporating unimportant features into its parent. Scikit-learn uses pre-processing.

Although decision trees are easy to understand, they are not simple in theory. The specific principles behind can be explained in detail in reference [5].

3. The implementation

We use the classic Breast Cancer dataset to realize the Decision Tree

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import sklearn.datasets as datasets

X, y = datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

tree = DecisionTreeClassifier()
tree = tree.fit(X_train, y_train)

print("Training set score: {:.2f}".format(tree.score(X_train, y_train)))
print("Test set score: {:.2f}".format(tree.score(X_test, y_test)))
Copy the code

The test results

Training set score: 1.00
Test set score: 0.95
Copy the code

Sounds good. But when we look at the results of that decision, it turns out that the model is too complicated

| - feature_22 < = 117.45 | | - feature_27 < = 0.16 | | | - feature_13 < = 35.44 | | | | - feature_22 < = 112.80 | | | | | - feature_21 < = 30.15 | | | | | | - class: 1 | | | | | - feature_21 > 30.15 | | | | | | - feature_22 < = 101.30 | | | | | | | - feature_20 < = 14.41 | | | | | | | | - class: 1 | | | | | | | - feature_20 > 14.41 | | | | | | | | - feature_0 < = 13.08 | | | | | | | | | - class: 0 | | | | | | | | - feature_0 > 13.08 | | | | | | | | | - class: 1 | | | | | | - feature_22 > 101.30 | | | | | | | - feature_2 < = 95.75 | | | | | | | | - class: 0 | | | | | | | - feature_2 > 95.75 | | | | | | | | - class: 1 | | | | - feature_22 > 112.80 | | | | | - feature_26 < = 0.35 | | | | | | - class: 1 | | | | | - feature_26 > 0.35 | | | | | | - feature_15 < = 0.04 | | | | | | | - class: 0 | | | | | | - feature_15 > 0.04 | | | | | | | - class: 1 | | | - feature_13 > 35.44 | | | | - feature_23 < = 874.85 | | | | | - feature_25 < = 0.08 | | | | | | - class: 0 | | | | | - feature_25 > 0.08 | | | | | | - feature_26 < = 0.29 | | | | | | | - class: 1 | | | | | | - feature_26 > 0.29 | | | | | | | - feature_18 < = 0.03 | | | | | | | | - class: 0 | | | | | | | - feature_18 > 0.03 | | | | | | | | - class: 1 | | | | - feature_23 > 874.85 | | | | | - class: 0 | | - feature_27 > 0.16 | | | - feature_22 < = 97.16 | | | | - feature_21 < = 28.01 | | | | | - class: 1 | | | | - feature_21 > 28.01 | | | | | - class: 0 | | | - feature_22 > 97.16 | | | | - class: 0 | - feature_22 > 117.45 | | - feature_19 < = 0.00 | | | - feature_2 < = 116.60 | | | | - class: 1 | | | - feature_2 > 116.60 | | | | - class: 0 | | - feature_19 > 0.00 | | | - class: 0Copy the code

Such a complex result in a dataset with hundreds of data is unacceptable in practical application. The accuracy rate of the test model is 100%, obviously there is overfitting. Let’s set a maximum depth limit

tree = DecisionTreeClassifier(max_depth=3)
tree = tree.fit(X_train, y_train)

print("Training set score: {:.2f}".format(tree.score(X_train, y_train)))
print("Test set score: {:.2f}".format(tree.score(X_test, y_test)))
Copy the code

The results for

Training set score: 0.95
Test set score: 0.94
Copy the code

The test results decreased by only 1%, but the model complexity was significantly reduced

| - feature_22 < = 117.45 | | - feature_27 < = 0.16 | | | - feature_13 < = 35.44 | | | | - class: 1 | | | - feature_13 > 35.44 | | | | - class: 1 | | - feature_27 > 0.16 | | | - feature_22 < = 97.16 | | | | - class: 1 | | | - feature_22 > 97.16 | | | | - class: 0 | - feature_22 > 117.45 | | - feature_27 < = 0.09 | | | - feature_19 < = 0.00 | | | | - class: 1 | | | - feature_19 > 0.00 | | | | - class: 0 | | - feature_27 > 0.09 | | | - class: 0Copy the code

4. To summarize

Today we gave an overview of the Decision Tree and its implementation. Decision trees are very useful models, and the results are easy to interpret, so they are widely used.

Back to the original question, using the decision tree model, how to decide whether to choose Zhao Min or Xiao Zhao? The answer, of course, is to choose lai Sister.

Lijie is the best girl in the world, ho ho ~

Relevant code have been uploaded to the lot (https://github.com/jetorz/Data2Science), welcome to Star.

5. Communicate

Learning alone without a friend is ignorance. There is an “Data and Statistical Science” wechat exchange group, which includes senior practitioners in the data industry, overseas doctors and masters, etc. Welcome friends who are interested in data science, data analysis, machine learning and artificial intelligence to join and study together.

You can scan the following QR code and add a wechat invitation from Lijie. The code is machine learning to add groups.


6. Extension

6.1. Read more

  1. Linear regression model – machine learning
  2. Principle and implementation of logistic regression model – machine learning
  3. Parameter standardization – machine learning
  4. What exactly is the neural network behind the fire? – Machine learning
  5. How to make a spam recognizer from scratch with Support Vector machines – machine learning
  6. K- Nearest Neighbors, How are you? – Machine learning
  7. For the alliance or for the horde? – K means

6.2. References

  1. G. James, D. Witten, T. Hastie R. Tibshirani, An introduction to statistical learning: with applications in R. New York: Springer, 2013.
  2. T. Hastie, R. Tibshirani, J. H. Friedman, The elements of statistical learning: data mining, inference, and prediction, 2nd ed. New York, NY: Springer, 2009.
  3. W. Härdle, L. Simar, Applied multivariate statistical analysis, 3rd ed. Heidelberg ; New York: Springer, 2012.
  4. Zhou Zhihua, Machine Learning = Machine Learning. Beijing: Tsinghua University Press, 2016.
  5. Li Hang, Statistical learning methods. Beijing: Tsinghua University Press, 2012.


This article is formatted using MDNICE