# The first chapter

## 1.1 Basic Terms

- The data set
- Example, sample
- Attributes, characteristics
- Attribute values
- Attribute space, sample space, input space
- The feature vectors
- dimension
- Training data, training samples, training sets
- Label
- Example:

According to whether the training data has marker information, learning tasks can be roughly divided into two categories:

- Supervised Learning
- “Unsupervised Learning”

Classification and regression are the representatives of the former, while clustering is the representative of the latter.

It should be noted that the goal of machine learning is to make the learned model work well with the “new sample” rather than just working well with the training sample. The generalization ability of the learned model to fit new samples is called generalization ability. The model with strong generalization ability can be well applied to the whole sample space.

# The second chapter

## 2.1 Empirical error and overfitting

Error rate: The proportion of the number of misclassified samples A to the total number of samples M E= A/M

Accuracy =1- Error rate

Error: The difference between the actual predicted output of the learner and the actual output of the sample

“Training error” or “empirical error” : The error of the learner on the training set

“Generalization error” : error in a new sample

There are many factors that may lead to over-fitting, among which the most common situation is that the learning ability is too strong to learn all the unusual characteristics contained in the training sample, while the under-fitting is usually caused by the low learning ability. Underfitting is easy to overcome, such as extending branches in decision tree learning and increasing the number of training rounds in neural network learning, while overfitting is troublesome. Overfitting is a key obstacle to machine learning.

## 2.2 Assessment methods

In general, we can evaluate the generalization error of the learner by experimental testing and then make a selection. To do this, a “testingset” is used to test the discriminant ability of the learner on the new sample, and then the “testingerror” on the test set is used as the approximation of the generalization error.

The training set S and test set T are generated from the data set by proper processing. Here are a few common practices.

### 2.2.1 set aside method

The set aside method directly divides dataset D into two mutually exclusive parts, one of which is used as training set S and the other as test set T.

The usual ratio of training sets to test sets is 70% : 30%. At the same time, there are two points for attention in the division of training set and test set:

- Keep data distribution as consistent as possible. Avoid any additional bias introduced by the data partitioning process that may affect the final results. In the classification task, the sampling method that retains the proportion of categories is called “stratified sampling”.
- Several random partitions are used to avoid the instability of the single use of the reserved method.
- Note the ratio of training sets to test sets

### 2.2.2 Cross validation method

In the cross-validation method, data set D is divided into K mutually exclusive subsets of similar size. Each time, the union of K − 1 subsets is used as the training set, and the remaining subset is used as the test set. Perform k training and tests, and eventually return the mean of k test results. Also known as k-fold cross validation.

The retention method is a special case of k-folding cross-validation k = m (m is the number of samples). That is, one sample at a time is used as the test set. This method is expensive to calculate.

### 2.2.3 self-help method

The self-help method is based on self-help sampling (with back sampling). Each time a sample is randomly selected from D, put into D ‘, and then put the sample back into D. After repeated m times, a data set containing M samples is obtained.

In this way, m training samples are still used, but about 1/3 of the samples that do not appear in the training set are used as test sets.

Advantages:

- Self-help is useful when the data set is small and it is difficult to effectively divide the training/test set.
- The self-help method can generate multiple different training sets from the initial data set, which is of great benefit to methods such as ensemble learning

Disadvantages:

- However, self-help changes the distribution of the initial data set, which introduces an estimation bias.

When the initial amount of data is sufficient, the set aside method and cross validation method are more commonly used

### 2.2.4 Tuning participation in the final model

Parameter tuning is to select a range and change step for each parameter and select the appropriate parameter from the candidate parameters.

## 2.3 Performance Measurement

To evaluate the generalization performance of the learner, we need not only effective experimental estimation methods, but also evaluation criteria to measure the generalization ability of the model, which is called performance measurement. Performance measures reflect the task requirements, so the “good or bad” of the model is relative, not only depending on the data and algorithm, but also depending on what the task requirements are.

The most commonly used performance measure for regression tasks is “mean square error”

The following describes the performance measures commonly used for classification tasks

### 2.3.1 Error rate and accuracy

Error rate and accuracy are often used in classification tasks. Error rate is the proportion of the number of incorrectly classified samples in the total number of samples in the test sample, and accuracy is the proportion of the number of correctly classified samples in the total number of samples in the test sample.

Classification error rate:

Accuracy:

If the accuracy of the test data set is high or the error rate is small, the generalization ability of the model is strong. Otherwise, the generalization ability is weak.

The generalization ability of the model expressed by the accuracy of the test data set is not applicable when the ratio of positive and negative samples is large. Accuracy is used to represent the generalization ability of the test data set, and the ratio of positive and negative samples in the test data set should be balanced (1:1).

### 2.3.2 Precision, recall and F1

For example, in information retrieval, we often care about “how much of the retrieved information is of interest to users” and “how much of the information that users are interested in is retrieved”. In video surveillance, we look at “what percentage of criminals whose faces are recognized are real criminals” and “what percentage of all criminals are identified.”

Error rate calculation is more general, precision and recall are more suitable for such requirements of performance measures.

Precision is concerned with the percentage of positive samples that are screened.

Recall rate is concerned with what percentage of positive samples are screened out.

Obfuscation matrix is the basis for computing precision and recall or other model performance evaluation methods.

Definition of confusion matrix:

TP: True positive, that is, both the real and predicted results are positive. FP: False positive, that is, the real result is a negative example and the predicted result is a positive example. TN: True negative, where both the real result and the predicted result are negative examples. FN: False negative example, that is, the real result is a positive example, the predicted result is a negative example. (The predictions were false, and so were the facts)

TP+FP+TN+FN= Total number of samples

Precision P(Precision) and Recall R(Recall) are defined as:

How many of the predicted (filtered) examples are true

The amount that is predicted to be true, as a percentage of the amount that is actually true

Learning model, therefore, P (Y | X) = 1 for test data set output probability, a series of positive samples according to probability by order, and then in turn set threshold, if it is greater than the threshold, the positive samples; Otherwise, it’s a negative sample. Each threshold setting has the corresponding recall rate and recall rate. Therefore, by taking the recall rate as the abscissa and the recall rate as the ordinate, the recall rate-recall curve can be obtained and the “P-R” curve can be detected.

(1) If the p-R curve of one learning model completely covers the p-R curve of the other learning model, the performance of the former is better than the latter. That is, with the same recall rate, the higher the recall rate, the better the generalization performance of the model, for example, model A is better than model B. (2) If p-R curves of two learning models cross each other, the model can be evaluated by “break-event Point” (BEP), which is the value of “precision = recall rate”. According to the figure above, the equilibrium point of model A is greater than that of model B, that is, model A is superior to model B. (3) Because BEP is too simplified, F1 metric is more commonly used:

The bigger F1, the better performance.

(4) F1 metric believes that recall rate and precision rate are of the same importance. If recall rate and precision rate are of different importance, if the information recommended to users is as interesting as possible, then precision rate is more important; When capturing fugitives, we hope to miss as few fugitives as possible. At this time, the recall rate is more important.

To describe the relative importance of precision and recall, the general form of the F1 metric is used$F\beta$

Among them,The relative importance of recall to precision is measured.When degenerates into standard F1;Time recall rate is more important;, accuracy is more important.

In many cases, we have multiple binary confusion matrices. For example, we get one confusion matrix each time after many training/tests. Or training/testing on multiple data sets, hoping to estimate the “global” performance of the algorithm; Even when performing multi-classification tasks, each combination of two categories corresponds to a confusion matrix; . In short, we hope to comprehensively investigate the precision and recall on n binary confusion matrices.

### 2.3.3 ROC与AUC

#### ROC

In different tasks, we adopt different cut-off points according to different task requirements. If more attention is paid to the “accuracy” (want to find more accurate), should be in the first place to truncate (find positive examples more sure); If we pay more attention to the “recall rate” (want to check positive cases more fully), we should truncate them in the lower position of the ranking.

The P-R curve measures the generalization performance of the learning model from the perspective of precision and recall, while the ROC curve measures the generalization performance of the learning model from a more general situation. If there is no prior condition, it is recommended to use the ROC curve to measure the generalization performance of the model. The idea of drawing ROC curve is consistent with p-R curve. The learning model estimates the probability that test samples are positive samples from large to small, and then sets threshold values according to the probability in turn. It is considered that test samples larger than the threshold are positive samples, and those smaller than the threshold are negative samples. The True Positive Rate (TPR) and False Positive Rate (FPR) are calculated for each threshold setting.

TPR and FPR are defined as follows:

The abscissa of the ROC curve is the false positive case rate, and the ordinate is the true case rate. The curve is as follows:

This paper explains the first and last two points of ROC curve: The test data set contains N positive samples and M negative samples. If the threshold is set to the maximum, the learning model predicts negative samples for all test samples, and the confusion matrix is as follows:

TPR = TP / (TP + FN) = 0 / (0 + N) = 0

FPR = FP / (TN + FP) = 0 / (0 + M) = 0

Therefore, when the threshold is set to the maximum, both TPR and FPR are 0\

If the threshold value is less than the value estimated by all models that the test samples are positive samples, the test samples are all positive samples, and the confusion matrix is as follows:

TPR = TP / (TP + FN) = N / (N + 0) = 1

FPR = FP / (TN + FP) = M / (M + 0) = 1

Therefore, when the threshold is set to the minimum, both TPR and FPR are 1.

Area Under Curve (AUC) refers to the Area of ROC Curve, which can be solved by trapezoidal Area method.

#### Definition of AUC values

The AUC value is the area covered by the ROC curve. Obviously, the greater the AUC, the better the classification effect of the classifier.

AUC = 1 is a perfect classifier. When adopting this prediction model, perfect prediction can be obtained no matter what threshold is set. For the vast majority of prediction situations, there is no perfect classifier.

0.5 < AUC < 1, better than random guess. This classifier (model) can have predictive value if the threshold is set properly.

AUC = 0.5, the model has no predictive value as a random guess (e.g., lost copper plate).

AUC < 0.5, worse than random guess; But as long as you always make a counter-prediction, it’s better than a random guess.

#### The physical significance of AUC values

Assuming that the output of classifier is socRE (confidence degree) of samples belonging to positive class, the physical meaning of AUC is, the probability that score of positive sample is greater than score of negative sample when taking any pair of samples (positive and negative).

#### Calculation of AUC value

(1) The first method: AUC is the area under the ROC curve, so we can calculate the area directly. The area is the sum of each small trapezoidal area, and the precision of calculation is related to the precision of threshold.

(2) The second method: According to the physical meaning of AUC, we calculate the probability of score of positive samples being greater than that of negative samples. The binary groups NM (N is the number of positive samples, M is the number of negative samples) were taken to compare score, and finally AUC was obtained. The time complexity is O(NM).

(3) The third method: Similar to the second method, directly calculate the probability of score with positive sample greater than negative sample. We first rank all samples according to score and represent them by rank, such as the sample with the largest score, rank=n(n= n +M), followed by n-1. So for the sample with the largest rank in the positive sample (rank_max), there are m-1 other positive samples that are lower than his score, then there are (rank_max- (m-1) negative samples that are lower than his score. And then rank_second-1 -(M-2). Finally, the probability that positive samples are greater than negative samples is:

### 2.3.4 Cost sensitive error rate and cost curve

#### Cost sensitive error rate

It is called cost sensitive error rate evaluation method if the factors of different loss costs caused by different types of classification errors are considered when evaluating the performance of learning model. Suppose the training data set D contains positive example set D+ and negative example set D-. If the loss cost of classification error is the same, it is equal cost. The cost matrix is as follows:

Cost sensitive error rate of the model:

II(.) Is an indicator function, 1 if * is true, 0 otherwise

The loss costs of classification errors are different, that is, non-equal costs. In the figure below, costij represents the cost of predicting class I samples as Class J samples. The cost matrix is as follows:

Cost sensitive error rate of the model:

II(*) is the indicator function, 1 if true, 0 otherwise

This definition is the generalization of the loss cost of the calculation model. If the loss cost of the classification error is the same, let, is consistent with the previous expression of cost sensitive error rate.

No matter what loss cost function is used, model optimization is equivalent to minimizing cost sensitive error rate.

#### The cost curve

The ROC curve reflects the generalization ability of the learning model under the premise of equal cost (the loss cost of misclassification is the same), and the “cost curve” reflects the expected total cost of the learning model under the premise of non-equal cost. The smaller the expected total cost, the stronger the generalization ability of the model.

The abscissa of the cost curve is the probability cost of the normalized positive example, and the probability of the positive example is P. The given probability of the positive example is the prior probability, which ranges from 0 to 1. The vertical axis is the normalized loss cost. Cost curve studies the relationship between positive sample prior probability and loss cost.

Normalized positive example probability cost:

The cost curve is shown as follows:

Where, the grey shaded part is the expected total cost of the model. The smaller the expected total cost, the better the generalization performance of the model. On the contrary, the worse the model generalization performance is.

Meaning of expected total cost: learning the minimum loss cost of the model under positive example prior probability, and summing the minimum loss cost under all positive example prior probability.

Reference links:

- zhuanlan.zhihu.com/p/53692548
- zhuanlan.zhihu.com/p/53781144