reprinted


Link: https://www.zhihu.com/question/30643044/answer/225095821








This answer will take a look at model evaluation of classification problems in the context of MNIST, the “Hello World” dataset of machine learning. There are many kinds of classification problems, Such as Binary Classification, Multiclass Classification, Multilabel Classification and Multioutput Classification Classification, etc. This paper introduces the evaluation method and evaluation index of Classification problem model by taking the binary Classification problem as the main research object.

MNIST data set

Corinna Cortes of Google LABS and Yann LeCun of New York University’s Klein Institute have built a handwritten digital database of 70,000 handwritten digital images that MNIST can easily access from SciKit-Learn. (Note: When the following code is used to obtain the data set, it is downloaded online. Sometimes, incomplete data set download may lead to errors due to network problems. I have given the complete file at the end of this article: mnist-original.mat)

<img src=”https://pic1.zhimg.com/50/v2-a301620063f3fb397c9fa36121134290_hd.jpg” data-rawwidth=”455″ data-rawheight=”248″ class=”origin_image zh-lightbox-thumb” width=”455″ data-original=”https://pic1.zhimg.com/v2-a301620063f3fb397c9fa36121134290_r.jpg”>

Each image is 28 by 28 pixels, and each pixel is a feature, so the data in the MNIST dataset has 784 features, In this paper, k-fold cross-validation, Precision, Recall, F1 score, P-R curve and ROC curve are introduced to identify whether an image is 5 or not, which is a dichotomy problem.

70,000 sets of data were divided into training sets and data sets in a ratio of 6:1. In order to ensure that there was no shortage of numbers in each verification set in the subsequent cross-validation, the sorted data were shuffled.


<img src=”https://pic3.zhimg.com/50/v2-f714af8767f128c4c86dec8399786f8d_hd.jpg” data-rawwidth=”557″ data-rawheight=”107″ class=”origin_image zh-lightbox-thumb” width=”557″ data-original=”https://pic3.zhimg.com/v2-f714af8767f128c4c86dec8399786f8d_r.jpg”>

For convenience, this paper trains a random gradient descent Classifier (SGD Classifier) by taking the binary classification problem of whether a picture is 5 as an example.

<img src=”https://pic1.zhimg.com/50/v2-b60466688d3f47b93b3fc3c4a8b06e25_hd.jpg” data-rawwidth=”364″ data-rawheight=”298″ class=”content_image” width=”364″>


Performance Evaluation method: K-fold cross-validation

In the modeling of a problem, we can choose a variety of learning algorithms, and even a learning algorithm with different parameter configuration will produce significantly different models. Then which learning algorithm should we choose and which parameter configuration should we use? This is “model selection problem in machine learning, the ideal solution, of course, is to choose the generalization of the minimum error of the model, but we can’t direct access to the generalization error [1], we usually through the method of experimental test apparatus in study to evaluate the generalization error and to make a choice, zhou teacher referred to in the” machine learning “the three methods: Set aside method, cross – validation method and self-help method. This paper mainly introduces the commonly used cross validation method (also known as k-fold cross validation method).

<img src=”https://pic1.zhimg.com/50/v2-ae15e1de1cc23d09e48cbabd33b7fc44_hd.jpg” data-rawwidth=”459″ data-rawheight=”243″ class=”origin_image zh-lightbox-thumb” width=”459″ data-original=”https://pic1.zhimg.com/v2-ae15e1de1cc23d09e48cbabd33b7fc44_r.jpg”>


Generally, data sets are divided into training sets and test sets. In case of insufficient sample size, in order to make full use of data sets to test algorithm effects, data set D is randomly divided into K packages (stratified sampling), and one package is taken as the test set each time. The remaining k-1 packages are trained as training sets. As shown above

Sklearn can be used to achieve:

<img src=”https://pic4.zhimg.com/50/v2-590bd49c3fa90417ae2e9e0b6c5d169c_hd.jpg” data-rawwidth=”514″ data-rawheight=”79″ class=”origin_image zh-lightbox-thumb” width=”514″ data-original=”https://pic4.zhimg.com/v2-590bd49c3fa90417ae2e9e0b6c5d169c_r.jpg”>

The parameter CV data is divided into three parts, namely fold cross-validation k k, they look, very high score in more than 95% on average, use the evaluation standard is accuracy, the proportion of the correct prediction, but it is not so simple, because the number 5, in all of the data set probably accounts for only about 10%, even if I write an algorithm, All the pictures I encountered were considered to be 90% accurate even if they were not 5! Therefore, evaluation indicators are very important. According to sklearn documents, scoring parameters can be selected in many ways. For details, see here. Here are some metrics for evaluating a model.

Performance evaluation indicators

Before I introduce some metrics, I’ll introduce the Confusion Matrix


<img src=”https://pic4.zhimg.com/50/v2-c5d9056d4ee103b23bc65696906b459c_hd.jpg” data-rawwidth=”727″ data-rawheight=”356″ class=”origin_image zh-lightbox-thumb” width=”727″ data-original=”https://pic4.zhimg.com/v2-c5d9056d4ee103b23bc65696906b459c_r.jpg”>

TN (true negatives) : The actual number is not 5, and the predicted value is not 5

FN (false negatives to communicate the actual number is not 5, but the predicted value is 5

FP (false positives) : The actual number was 5, but the predicted value was not

TP (True positives) : The actual number was 5, as was the forecast

<img src=”https://pic2.zhimg.com/50/v2-4753f84de6fb91571bfcbe6d5ad422f6_hd.jpg” data-rawwidth=”491″ data-rawheight=”136″ class=”origin_image zh-lightbox-thumb” width=”491″ data-original=”https://pic2.zhimg.com/v2-4753f84de6fb91571bfcbe6d5ad422f6_r.jpg”>

Precision

Formula:

<img src=”https://pic1.zhimg.com/50/v2-cb250eb0b1ee0ae82648c2d539a5dd4e_hd.jpg” data-rawwidth=”179″ data-rawheight=”52″ class=”content_image” width=”179″>

The precision ratio reflects the proportion of the real positive samples in the positive samples judged by the classifier, that is, the proportion of the actual 5 in all the images predicted to be 5. Accuracy = (53272+4344)/60000 = 96.03% according to the above confusion matrix, precision = 4344/(4344+1307) = 76.87%, is it much less than the previous accuracy? If I’m still not clear, take a closer look at the confusion matrix diagram above.

Recall rate

Formula:

<img src=”https://pic4.zhimg.com/50/v2-78365c3411fd5af07082962a5b18a7e8_hd.jpg” data-rawwidth=”149″ data-rawheight=”47″ class=”content_image” width=”149″>

Recall rate reflects the proportion of positive cases correctly judged as positive cases in the total number of positive cases that my classifier predicted out of all the pictures that are actually 5.

F1 measure

Formula:

<img src=”https://pic2.zhimg.com/50/v2-e4a4afdb45d29d85170ae3b950c1e9a4_hd.jpg” data-rawwidth=”512″ data-rawheight=”69″ class=”origin_image zh-lightbox-thumb” width=”512″ data-original=”https://pic2.zhimg.com/v2-e4a4afdb45d29d85170ae3b950c1e9a4_r.jpg”>

F1 score is the harmonic mean of precision and recall, which takes into account both precision and recall of classification models


<img src=”https://pic1.zhimg.com/50/v2-310a9a1ec8f76e371621b48eb5bd166b_hd.jpg” data-rawwidth=”491″ data-rawheight=”240″ class=”origin_image zh-lightbox-thumb” width=”491″ data-original=”https://pic1.zhimg.com/v2-310a9a1ec8f76e371621b48eb5bd166b_r.jpg”>


Why have so many metrics? This is determined by the nature of our classification task. For example, in the product recommendation system, we hope to understand customer needs more accurately and avoid pushing content that users are not interested in, so the accuracy rate is more important. In disease detection, we do not want to miss any disease, then the recall rate is more important. When both are considered, the F1 metric is a reference metric.

In order to make more intuitive analysis, it can also be evaluated by drawing:

P – R diagram

As the name implies, a graph with precision and recall coordinates


<img src=”https://pic1.zhimg.com/50/v2-7738b849e41949545ad00189f44599a8_hd.jpg” data-rawwidth=”596″ data-rawheight=”305″ class=”origin_image zh-lightbox-thumb” width=”596″ data-original=”https://pic1.zhimg.com/v2-7738b849e41949545ad00189f44599a8_r.jpg”>


<img src=”https://pic4.zhimg.com/50/v2-d4e2eae8b439418877c1ab9ab58f7298_hd.jpg” data-rawwidth=”433″ data-rawheight=”316″ class=”origin_image zh-lightbox-thumb” width=”433″ data-original=”https://pic4.zhimg.com/v2-d4e2eae8b439418877c1ab9ab58f7298_r.jpg”>

The ROC curve

Receiver Operating characteristic (ROC) is a comprehensive indicator reflecting the continuous variables of sensitivity and specificity. It reveals the relationship between sensitivity and specificity by using the method of composition. It calculates a series of sensitivity and specificity by setting different critical values of the continuous variables. A curve was drawn with sensitivity as ordinate and (1-specificity) as abscissa. The greater the area under the curve (AUC), the higher the diagnostic accuracy. On the ROC curve, the point closest to the upper left of the graph is the critical value with high sensitivity and specificity.

The horizontal axis of ROC curve is the “false positive example rate” (FPR), and the formula is:

<img src=”https://pic3.zhimg.com/50/v2-d9ede36040761d827f5826b7bbd1b032_hd.jpg” data-rawwidth=”155″ data-rawheight=”51″ class=”content_image” width=”155″>

The vertical axis is the “true case rate” (TPR), and the formula is:

<img src=”https://pic3.zhimg.com/50/v2-d27521f553c077fa286afc202cf4efcb_hd.jpg” data-rawwidth=”157″ data-rawheight=”48″ class=”content_image” width=”157″>

Area under Curve (AUC) : the Area under the Roc Curve, which is between 0.1 and 1. As a numerical value, AUC can directly evaluate the quality of the classifier, and the larger the value, the better.

The AUC value is a probability value, when you randomly selected a positive samples and negative samples, according to the classification of the current algorithm to calculate the Score values will be the probability of the sample is ahead of the negative samples is the AUC value, the AUC value is, the greater the current classification algorithm, the more likely it will be positive samples ahead of negative samples, so that they can better classification.


<img src=”https://pic1.zhimg.com/50/v2-1ee2f890bfd795ce491bcbfd6cec041b_hd.jpg” data-rawwidth=”432″ data-rawheight=”242″ class=”origin_image zh-lightbox-thumb” width=”432″ data-original=”https://pic1.zhimg.com/v2-1ee2f890bfd795ce491bcbfd6cec041b_r.jpg”>


<img src=”https://pic3.zhimg.com/50/v2-e8507f5d6c826fa0fa6ac3d6dda6c6ea_hd.jpg” data-rawwidth=”437″ data-rawheight=”314″ class=”origin_image zh-lightbox-thumb” width=”437″ data-original=”https://pic3.zhimg.com/v2-e8507f5d6c826fa0fa6ac3d6dda6c6ea_r.jpg”>


So when do you use the P-R curve and when do you use the ROC curve to evaluate a model? When the gap between positive and negative samples is small, the ROC and PR trends are similar, but when there are a lot of negative samples, they are completely different. The ROC curve can remain unchanged, but the p-R curve reflects a larger change. On the other hand, when we are more concerned with FP (false positives) than FN (false negatives), then the p-R curve should be used to evaluate it.

reference

[1] ZHOU Zhihua. Machine Learning [M]. Tsinghua University Press, 2016.

[2] The ROC curve and AUC values of machine learning classifier performance indicators

[3]ROC curve and PR curve

[3]Hands On Machine Learning with Scikit Learn and TensorFlow

———————————————————————————————————————— —


<img src=”https://pic4.zhimg.com/50/v2-d97674419f51f3229602ca35bd87e794_hd.jpg” data-rawwidth=”1235″ data-rawheight=”1677″ class=”origin_image zh-lightbox-thumb” width=”1235″ data-original=”https://pic4.zhimg.com/v2-d97674419f51f3229602ca35bd87e794_r.jpg”>