Several evaluation indexes of machine learning classification: Accuracy, AUC, Precision, Recall, F1, MAPE, SMAPE(including code implementation)

Classification evaluation indexes include accuracy, precision, recall, AUC area, F value, etc.

1. Accuracy

Also known as precision, the proportion of the number of accurately classified samples to the total number of such samples.

In addition, the proportion of the number of misclassified samples to the total number of samples is called the “error rate”.

Let’s say that a of m samples are misclassified.

\\ acc = 1 – a/m

# calculate accuracy
from sklearn import metrics
print( metrics.accuracy_score(y_test, y_pred) )
Copy the code

Null accuracy

Accuracy in prediction using the category with the largest sample proportion.

y_count = np.bincount(y_test)
ii = np.nonzero(y_count)[0]
print(y_count)
zip(ii, y_count[ii])
print('null acc', y_count[0] /len(y_test) )
Copy the code

Each type of the accuracy of

def cal_acc(true_labels, pred_labels) :
    True_labels: like[2,1,4], pred_labels: like[1,2,3] output: the acc of each class.
    from collections import Counter
    total = Counter(true_labels)
    keys = np.unique(true_labels)
    acc = {}
    for key in keys:
        acc[key] = 0
    for i in range(len(pred_labels)):
        if pred_labels[i] == true_labels[i]:
            acc[pred_labels[i]] += 1
    for key in keys:
        acc[key] = acc[key] / total[key]
    return acc
Copy the code

2. AUC, Precision, Recall, ROC curve

AUC is an evaluation index to measure the advantages and disadvantages of binary models, and represents the probability that the predicted value of positive cases is larger than that of negative cases.

The accurate rate of P Precision = TP/(TP +FP)Is the ratio of the number of positive prediction pairs to the number of all positive prediction samples.

Recall rate R = TP/(TP +FN), that is, the ratio between the predicted positive samples and the original total positive samples.

Acc Accuracy = (TP + TN) /(P + N)

ROC curve and AUC area

FPR: false positive case rate/false positive rate (X-axis of ROC);

FPR means to mispredict the number of negative classes as a percentage of the total number of negative classes.

$FPR = \frac{FP}{TN + FP}$

TPR: true case rate/true positive rate (Y-axis of ROC);

TPR means the number of correctly predicted positive classes as a percentage of the total predicted positive classes.

$TPR = \frac{TP}{TP+FN}$

ROC Curve, Receiver Operating Characteristic Curve, and Receiver Operating Characteristic Curve are drawn with false positive rate (FPR) as abscissa and true positive rate (TPR) as ordinate.

The area enclosed by ROC curve and X coordinate axis is called AUC area, which can also be used as the performance evaluation index of classifier. The larger the area, the better the classifier performance. It is worth noting that since we need to obtain a series of TPR, FPR, we need to give the probability that the prediction is positive.

Suppose we have got the probability output of all samples (the probability belonging to the positive sample), now the question is how to change the “discrimination threashold”? We rank each test sample according to its probability of being a positive sample.

The figure below is an example. There are 20 test samples in the figure. The column “Class” represents the real label of each test sample (P represents positive sample, n represents negative sample), and “Score” represents the probability that each test sample belongs to positive sample 4.

In order of probability:

Next, we take “Score” value as threshold successively from high to low. When the probability of a test sample belonging to a positive sample is greater than or equal to this threshold, we consider it as a positive sample; otherwise, it is a negative sample.

For example, for the fourth sample in the figure whose “Score” is 0.6, samples 1, 2, 3 and 4 are considered positive because their “Score” is greater than or equal to 0.6, while the other samples are considered negative. Each time a different threshold is selected, a set of FPR and TPR can be obtained, that is, a point on the ROC curve. In this way, a total of 20 groups of FPR and TPR values were obtained, which were drawn on the ROC curve as follows:

When threshold is set to 1 and 0, two points (0,0) and (1,1) on the ROC curve can be obtained respectively. Connecting these (FPR,TPR) pairs yields the ROC curve. When the threshold is larger, the ROC curve is smoother. Here, the threshold is 1, and the sample predicted to be positive is 0; The threshold is 0, and the samples predicted to be positive are all samples. The number of positive samples predicted pairs = the number of positive samples, that is, TPR=1; The number of negative samples predicted to be positive samples = the number of negative samples, FPR = 1.

Several properties of ROC curve:

The closer the ROC curve is to the upper left corner, the higher the accuracy is. And the threshold corresponding to the point of the ROC curve at the top left corner is the threshold with the smallest error (i.e., FP and the total number of FN);
ROC curve does not change with the change of class distribution and can be used to evaluate the effect of sample imbalance model.

Calculate AUC code:

Formula 1: According to the definition of AUC, AUC=∑FPN∗TPPAUC = \sum \frac{FP}{N}*\frac{TP}{P}AUC=∑NFP∗PTP, discretization, can be obtained

$AUC = \ frac {\ sum \ I left (P_ {\ text {are sample}}, P_ {\ text {negative samples}} \ right)} {M ^ N} {*}$

Where, M,NM,NM and N distributions represent the number of positive and negative samples. Here I is the indicator function. If the prediction probability corresponding to the positive sample is larger than that of the negative sample, the value is 1; Values equal to 0.5, otherwise 0.

import numpy as np
fromMetrics import roc_auc_score def naive_AUC (Labels, preds): "" metrics import roc_auc_score def naive_AUC (Labels, Preds): "" If they are equal, count 0.5 and divide by the total number of positive and negative sample pairs O(NlogN), where N is the number of samples auC = I(positive, negative)/M*N M: the number of positive samples; N: negative class sample number """ n_pos= sum(labels)
    n_neg = len(labels) - n_pos
    total_pair = n_pos * n_neg

    labels_preds = zip(labels, preds)
    # O((M+N)log(M+N))
    labels_preds = sorted(labels_preds, key=lambda x: (x[1],x[0]))
    accumulated_neg = 0
    satisfied_pair = 0
    for i in range(len(labels_preds)):
        if labels_preds[i][0] == 1If it is positive, add the cumulative number of negative samples. satisfied_pair+= accumulated_neg
            j = i- 1# Because the probability of positive and negative predictions will be equal, this time is required0.5.
            while j> =0:
                if labels_preds[j][0] == 0:
                    if labels_preds[i][1] == labels_preds[j][1]:
                        satisfied_pair -= 0.5
                    else:
                        break
                j -= 1
        else:
            accumulated_neg += 1
    auc = satisfied_pair / float(total_pair)
    return auc

if __name__ == "__main__":
    y_true = np.array([1.1.0.0.1.1.0])
    y_scores = np.array([0.8.0.7.0.5.0.5.0.5.0.5.0.3])

    auc1 = roc_auc_score(y_true, y_scores)
    auc2 = naive_auc(y_true,y_scores)
    print("auc1",auc1)
    print("auc2",auc2)
Copy the code

Running results:

auc1 0.8333333333333334
auc2 0.8333333333333334
Copy the code

Formula 2:

Deformation: AUC=∑ I ∈positiveclassrank⁡ I −M(1+M)2M×NA UC= \frac{\sum_{I \in \text {positiveclass}} \ operatorname {rank} _ {I} – \ frac {M (1 + M)} {2}} {M \ times N} AUC = ∑ I M * N ∈ positiveclassranki – 2 M (1 + M)

Rank = 1 and M+N; rank = 1 and M+N;

def cal_auc2(labels,preds):
    sort_label = sorted(zip(labels,preds),key=lambda x: (x[1],x[0]))
    pos = sum(labels)
    neg = len(labels) - pos
    total_pairs = pos*neg
    satisfied_pair = 0
    for rank,item in enumerate(sort_label):
        if item[0] == 1:
            satisfied_pair += (rank+1)
            j = rank - 1
            while j > = 0:
                if sort_label[j][0] == 0:
                    if item[1] == sort_label[j][1]:
                        satisfied_pair -= 0.5
                    else:
                        break
                j -= 1
    auc = (satisfied_pair - pos*(1+pos)/2)/total_pairs
    return auc
Copy the code

The difference between ROC curve and P-R curve

The p-R curve is plotted with Precision as the ordinate and recall as the abscissa. When the number of negative samples in the test set increases, the P-R curve will change significantly, and the ROC curve will remain basically unchanged:

3. F1 evaluation indicators

$\mathrm{F}_1=\frac{2}{1 / \text { precision }+1 / \text { recall }}$

Or write:

$F_1 = \frac{N- TN}{N + TP – TN}$

Where, N is the total number of samples.

In practice, to traverse to find the threshold that maximizes F1F_1F1 (the probability predicted by the model is generally greater than 0.5 and predicted as pos.).

import numpy as np

def f1_smart(y_true, y_pred) :
    ''' f1 = 2*P*R/(P + R); P = TP/(TP + FP) R = TP/(TP + FN)'''
    args = np.argsort(y_pred)
    tp = y_true.sum()
    fs = (tp - np.cumsum(y_true[args[:-1]])) / np.arange(y_true.shape[0] + tp - 1, tp, -1)
    res_idx = np.argmax(fs)
    return 2 * fs[res_idx], (y_pred[args[res_idx]] + y_pred[args[res_idx + 1/]])2

y_true = np.array([1.1.0.0.0])
y_pred = np.array([0.2.0.3.0.5.0.1.0.1])
f1, threshold = f1_smart(y_true, y_pred)
Copy the code

Similar methods to find the optimal threshold are,

def threshold_search(y_true, y_proba) :
    best_threshold = 0
    best_score = 0
    for threshold in tqdm([i * 0.01 for i in range(100)], disable=True):
        score = f1_score(y_true=y_true, y_pred=y_proba > threshold)
        if score > best_score:
            best_threshold = threshold
            best_score = score
    search_result = {'threshold': best_threshold, 'f1': best_score}
    return search_result
Copy the code

4. Averaging

Firstly, F1, ROC and AUC are all dichotomous evaluation indicators, but they can also be applied to multiple classifications. For example, of the three categories, each could be calculated with an accuracy of P, a recall rate of R, and then macro-averaging. Or we would have calculated the dichotomies of the three categories TN, TP, FN and FP, and then calculated P and R.

Macro average Macro – averaging:

The macro average is the arithmetic average of each statistical indicator across all categories. Macro precision, macro recall R_macro and macro F values are defined as follows.

$P_{\text {macro}}=\frac{1}{n} \sum_{i=1}^{n} P_{i}$

$R_{\text {macro}}=\frac{1}{n} \sum_{i=1}^{n} R_{i}$

$F_{\text {macro}}=\frac{2 \times P_{\text {macro}} \times R_{\text {macro}}}{P_{\text {macro}}+R_{\text {macro}}}$

The average micro – averaging:

Confusion matrix statistics are performed for each category.

We would run statistics on every example of a data set, regardless of class, to create a global confusion matrix, and then calculate the corresponding metrics. Its calculation formula is as follows:

The difference between:

The macro average treats all categories equally, which leads to the bias of macro average to reflect the small number of categories. The micro-average treats all sample decisions equally, leading to a bias in favor of the judgment reflecting the large number of categories. If the categories are balanced, both evaluation indexes can be used.

5. Confusion matrix

The obfuscation matrix is a matrix containing precision, recall and F1-score values.

import numpy as np
from sklearn.metrics import classification_report, \
confusion_matrix


ref = np.array([1.1.2.2.3.3])

pred = np.array([1.2.2.3.3.1])

report = classification_report(ref, pred, digits=4)
print("report", report)
conf_mat = confusion_matrix(ref, pred)
print("Accuracy", np.sum(np.diag(conf_mat))/len(ref))
Copy the code

Here, accuracy P, recall rate R, F1 can be calculated by using classification_report. Accuracy requires calculation of the number of diagonal elements of the confusion matrix and the ratio of (TP, FN) to the total number of elements.

report              precision    recall  f1-score   support

          1     0.5000    0.5000    0.5000         2
          2     0.5000    0.5000    0.5000         2
          3     0.5000    0.5000    0.5000         2

avg / total     0.5000    0.5000    0.5000         6

Accuracy 0.5
Copy the code

6. MAPE

In practice, ALTHOUGH RMSE loss can describe the deviation between the regression predicted value and the real value, if there are outliers, RMSE index will be poor. MAPE is a more robust metric than RMSE.

Mape, mean absolute percent error

where At is the actual value and Ft is the forecast value.

Compared with RMSE, MAPE normalized the error of each point and reduced the influence of absolute error brought by an outlier point.

from sklearn.utils import check_arrays
def mean_absolute_percentage_error(y_true, y_pred) : 
    y_true, y_pred = check_arrays(y_true, y_pred)

    ## Note: does not handle mix 1d representation
    #if _is_1d(y_true): 
    # y_true, y_pred = _check_1d_array(y_true, y_pred)

    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
Copy the code

7. SMAPE

SMAPE is a modified, symmetric mean absolute percentage error for MAPE.

If the actual number itself is small, a prediction deviation of a percentage can be a lot worse.

where At is the actual value and Ft is the forecast value.

Implementation:

def smape(A, F) :
    return 100/len(A) * sum(2 * abs(F - A) / (abs(A) + abs(F)))
Copy the code

Reference:

Blogs are great for explaining AUC;
CSDN Blog classification index;
The performance evaluation index of CSDN Blog classifier;
Zhihu [Machine learning Theory] commonly used performance evaluation indicators in classification problems;
Calculation of CSDN blog multi-category classification problem from Confusion matrix to accuracy;
CSDN blog [sciKit-learn] measures for evaluating the performance of classifiers, such as obfuscation matrix, ROC, AUC, etc.
Mean absolute percentage error;
MAPE implementation;
AUC details and Python implementation;
Hundred-sided machine learning;

Several evaluation indexes of machine learning classification: Accuracy, AUC, Precision, Recall, F1, MAPE, SMAPE(including code implementation)

1. Accuracy

2. AUC, Precision, Recall, ROC curve

3. F1 evaluation indicators

4. Averaging

5. Confusion matrix

6. MAPE

7. SMAPE

Related Posts

Learning with Noisy Labels

[WSN layout] Achieve WSN node optimization and coverage of MATLAB code based on improved whale algorithm

Structural analysis of deep learning convolutional Neural networks