This is the 24th day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

The paper

Intuitively speaking, the accuracy rate refers to the ability of the classifier not to mark negative samples as positive ones, and the recall rate refers to the ability of the classifier to find all positive samples.

The F values (F1F_1F1 and FβF_{\beta}Fβ values) can be interpreted as weighted harmonic mean values of accuracy and recall.

FβF_{\beta}Fβ values are in the range of [0,1], where 1 indicates the best effect of the model and 0 indicates the worst effect of the model.

When β=1\beta =1β=1, F1F_1F1 and FβF_{\beta}Fβ are equivalent, indicating that accuracy and recall are equally important.

For details on accuracy, recall, and F values, please refer to my other blog post: 10 Minutes to Master classification Algorithms for evaluation metrics.

Precision_recall_curve: Calculate the accuracy-recall curve based on the scores given by real labels and classifiers by changing the decision thresholds.

Average_precision_score: Calculated average accuracy (AP) based on predicted scores. This value is between 0 and 1, the higher the better. AP is defined as


AP = n ( R n R n 1 ) P n \text{AP} = \sum_n (R_n – R_{n-1}) P_n

Where, PnP_nPn and RnR_nRn are the accuracy rate and recall rate of the NTH threshold.

This implementation is interpolation-free, unlike the trapezoidal rule used to calculate the area under the Precision-Recall curve, which uses linear interpolation and may be overly optimistic.

Note: This implementation is limited to binary or multi-label categorization tasks.

The following functions allow you to calculate accuracy, recall, and F-score:

function instructions
average_precision_score(y_true, y_score, *) Average accuracy calculated from predicted scores (AP)
f1_score(y_true, y_pred, *[, labels,… ) Calculation formula one value
fbeta_scoreY_true, y_pred, *, beta[,…] ) Calculate F – beta value
precision_recall_curve(y_true, probas_pred, *) Calculate the accuracy recall rate pairs at different probability thresholds
precision_recall_fscore_support(y_true,…). Calculate the number of tags for accuracy, recall, F-value, and true value for each class
precision_score(y_true, y_pred, *[, labels,… ) Accuracy of calculation
recall_score(y_true, y_pred, *[, labels,… ) Calculate recall rate

Note:

The precision_recall_curve function is limited to binary scenarios. The average_Precision_score function is only applicable to binary classification and multi-label classification scenarios.

Dichotomy scenario

In the binary classification task, the terms “positive” and “negative” refer to the prediction of the classifier, and the terms “true” and “false” refer to whether the prediction results correspond to external (actual value) judgments. In view of these definitions, we can formulate the following table:

In this case, the formula for accuracy, recall and F value is as follows:


precision = t p t p + f p . \text{precision} = \frac{tp}{tp + fp},

recall = t p t p + f n . \text{recall} = \frac{tp}{tp + fn},

F Beta. = ( 1 + Beta. 2 ) precision x recall Beta. 2 precision + recall . F_\beta = (1 + \beta^2) \frac{\text{precision} \times \text{recall}}{\beta^2 \text{precision} + \text{recall}}.

Sample code:

from sklearn import metrics
import numpy as np
import pprint

y_pred = [0.1.0.0]
y_true = [0.1.0.1]

The default value of the # average parameter is binary, where "binary" is used for dichotomies
print(metrics.precision_score(y_true, y_pred))
print(metrics.precision_score(y_true, y_pred, average='binary'))

print(metrics.recall_score(y_true, y_pred))
print(metrics.recall_score(y_true, y_pred, average='binary'))

print(metrics.f1_score(y_true, y_pred))
print(metrics.f1_score(y_true, y_pred, average='binary'))


print(metrics.fbeta_score(y_true, y_pred, beta=0.5))

print(metrics.fbeta_score(y_true, y_pred, beta=1))

print(metrics.fbeta_score(y_true, y_pred, beta=2))
print("-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --")

pprint.pprint(metrics.precision_recall_fscore_support(y_true, y_pred, beta=0.5))
print("-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --")

y_true = np.array([0.0.1.1])
y_scores = np.array([0.1.0.4.0.35.0.8])

In fact, the # PR curve is made based on precision (precision) and recall (recall), where recall is the abscissa and precision is the ordinate.
# Set a series of thresholds, calculate the recall and precision corresponding to each threshold, and then calculate each point of PR curve.
precision, recall, threshold = metrics.precision_recall_curve(y_true, y_scores)
print("", precision, "\n",recall,"\n", threshold)
Y_true is the correct tag, y_score is the probability output value, thresholds is the threshold.
# if y_score>=thresholds, then the forecast is positive; if y_score
# Note that the last value of precision and recall output is 1 and 0, respectively, and there is no corresponding threshold.

In this example, in the actual data set, the actual number of positive samples is 2, and the actual number of negative samples is 2.

If index=0, thresholds[index]=0.35, thresholds[index]= thresholds[index]
# tp=2, fp=1, fn=0, so precision=0.67, recall=1

If index=1, thresholds[index]=0.4, thresholds[index]= 0,1,0,1]
# tp=1, fp=1, fn=1, so precision=0.5, recall=0.5

If index=2, thresholds[index]=0.8
# tp=1, fp=0, fn=1, so precision=1, recall=0.5


print("-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --")
print(metrics.average_precision_score(y_true, y_scores))
Copy the code

Running results:

1.0 1.0 0.5 0.5 0.6666666666666666 0.666666666666 0.83333333333334 0.66666666666666 0.555555555556 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- (array ([0.66666667, 1]), array ([1, 0.5]), array ([0.71428571, 0.83333333]), array ([2, 2])) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- [0.66666667 0.5 1. 1.] [1. 0.5 0.5 0.] [0.35 0.4 0.8] -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 0.8333333333333333Copy the code

Multi-category and multi-label scenarios

In multi-classification and multi-label classification tasks, the concepts of accuracy, recall and F value can be applied independently to each label.

There are several ways to iterate over the tag combination result, From average_PRECision_SCORE (multi-label only), F1_SCORE, FBETA_SCORE, PRECision_RECall_FSCORE_support, precision_SCORE And the average parameter of the recall_score function. Note that if all labels are included, the “micro” average in the multi-class setting will produce precision, recall, and these are all identical to accuracy. Also note that the “weighted” average may produce an F score that is not between accuracy and recall.

Note: If all labels are included, then the “micro” average set in the multi-classification scenario produces accuracy, recall, and F values, which are all the same as accuracy.

Also note that “weighted” averages may produce AN F score that is not between accuracy and recall.

To make this clearer, refer to the following notation:

  • Yyy stands for the set of prediction pairs (sample,label)(sample, label)(sample,label) (predicted value of classifier)
  • Y ^\hat{y}y^ denotes the set of true pairs (sample,label)(sample, label)(sample,label) (actual value)
  • LLL indicates a label set
  • SSS stands for sample set
  • Ysy_sys indicates the subset of YYY, the sample SSS
  • Yly_lyl indicates a subset of YYY. The label is LLL
  • Similarly, y ^ s \ hat {} y _sy ^ ^ s and y l \ hat {} y _ly ^ l ^ y \ hat is {} y y ^ subset
  • P (A, B) : = ∣ A studying B ∣ ∣ A ∣ P (A, B) : = \ frac {\ left | | \ \ cap B right} {\ left | A \ right |} P (A, B) : = ∣ A ∣ ∣ A studying B ∣, said accurate rate, which is true for B is A collection of A is right in the prediction of collection
  • R (A, B) : = ∣ A studying B ∣ ∣ ∣ R (A, B) : B = \ frac {\ left | | \ \ cap B right} {\ left \ | B right |} R (A, B) : = B ∣ ∣ ∣ A studying B ∣, said the recall rate, which is true for B is A collection of A is right in the prediction of collection

  • F Beta. ( A . B ) : = ( 1 + Beta. 2 ) P ( A . B ) x R ( A . B ) Beta. 2 P ( A . B ) + R ( A . B ) F_\beta(A, B) := \left(1 + \beta^2\right) \frac{P(A, B) \times R(A, B)}{\beta^2 P(A, B) + R(A, B)}

Indicators are defined as follows:

averageparameter Precision Recall F_beta
"micro"
P ( y . y ^ ) P(y, \hat{y})

R ( y . y ^ ) R(y, \hat{y})

F Beta. ( y . y ^ ) F_\beta(y, \hat{y})
"samples"
1 S s S P ( y s . y ^ s ) \frac{1}{S} \sum_{s \in S} P(y_s, \hat{y}_s)

1 S s S R ( y s . y ^ s ) \frac{1}{S} \sum_{s \in S} R(y_s, \hat{y}_s)

1 S s S F Beta. ( y s . y ^ s ) \frac{1}{S} \sum_{s \in S} F_\beta(y_s, \hat{y}_s)
"macro"
1 L l L P ( y l . y ^ l ) \frac{1}{L} \sum_{l \in L} P(y_l, \hat{y}_l)

1 L l L R ( y l . y ^ l ) \frac{1}{L} \sum_{l \in L} R(y_l, \hat{y}_l)

1 L l L F Beta. ( y l . y ^ l ) \frac{1}{L} \sum_{l \in L} F_\beta(y_l, \hat{y}_l)
"weighted"
1 l L y ^ l l L y ^ l P ( y l . y ^ l ) \frac{1}{\sum_{l \in L} \hat{y}_l} \sum_{l \in L} \hat{y}_l P(y_l, \hat{y}_l)

1 l L y ^ l l L y ^ l R ( y l . y ^ l ) \frac{1}{\sum_{l \in L} \hat{y}_l} \sum_{l \in L} \hat{y}_l R(y_l, \hat{y}_l)

1 l L y ^ l l L y ^ l F Beta. ( y l . y ^ l ) \frac{1}{\sum_{l \in L} \hat{y}_l} \sum_{l \in L} \hat{y}_l F_\beta(y_l, \hat{y}_l)
None
P ( y l . y ^ l ) . l L \langle P(y_l, \hat{y}_l) , l \in L \rangle

R ( y l . y ^ l ) . l L \langle R(y_l, \hat{y}_l), l \in L \rangle

F Beta. ( y l . y ^ l ) . l L \langle F_\beta(y_l, \hat{y}_l) , l \in L \rangle

Remark:

For a detailed description of the classification evaluation metrics (accuracy, recall, and F), please refer to my other blog post: 10 Minutes to Master the Classification Algorithm evaluation Metrics

For a detailed description of the Average parameter, please refer to my other blog: : Sklearn overview of the metrics function for evaluating models for different classification scenarios

Sample code:

from sklearn import metrics
import pprint

y_true = [0.1.2.0.1.2]
y_pred = [0.2.1.0.0.1]

print(metrics.precision_score(y_true, y_pred, average='macro'))
print(metrics.recall_score(y_true, y_pred, average='micro'))
print(metrics.f1_score(y_true, y_pred, average='weighted'))
print(metrics.fbeta_score(y_true, y_pred, average='macro', beta=0.5))
print("+ + + + + + + + + + + +")

# For multiple categories with "negative classes", some tags can be excluded:
# For example, no tag is correctly recalled after excluding 0
print(metrics.recall_score(y_true, y_pred, labels=[1.2], average='micro'))

# Labels not present in the data sample may be taken into account in the macro average.
print(metrics.precision_score(y_true, y_pred, labels=[0.1.2.3], average='macro'))
print("-- -- -- -- -- -- -- -- -- -- -- --")

# Calculate the number of each tag in the data set for accuracy, recall, F value and true value
pprint.pprint(metrics.precision_recall_fscore_support(y_true, y_pred, beta=0.5, average=None))
print("-- -- -- -- -- -- -- -- -- -- -- --")

Calculate the metrics for each tag specified
print(metrics.precision_score(y_true, y_pred, average=None, labels=[0.2]))
print(metrics.recall_score(y_true, y_pred, average=None, labels=[0.2]))
print("-- -- -- -- -- -- -- -- -- -- -- --")

# Count all tags, metrics for each tag
print(metrics.precision_score(y_true, y_pred, average=None))
print(metrics.recall_score(y_true, y_pred, average=None))
print("-- -- -- -- -- -- -- -- -- -- -- --")

Calculates the macro average of the specified tag
print(metrics.precision_score(y_true, y_pred, average='macro', labels=[0.2]))
print(metrics.recall_score(y_true, y_pred, average='macro', labels=[0.2]))

Copy the code

Running results:

0.222222222222 0.3333333333333333 0.2666666666666 0.23809523809523805 ++++++++++++ 0.0 0.16666666666666666 -- -- -- -- -- -- -- -- -- -- -- -- (array ([0.66666667, 0. 0.]), array ([. 1, 0, 0.]), array ([0.71428571, 0. 0.]), array ([2, 2, 2])) -- -- -- -- -- -- -- -- -- -- -- -- [0.66666667 0.] [1. 0.] -- -- -- -- -- -- -- -- -- -- -- -- [0.66666667 0. 0.] [1. 0. 0.] -- -- -- -- -- -- -- -- -- -- -- -- 0.3333333333333333 0.5Copy the code

In the multi-category label scenario, the sample code for micro-average is as follows:

from sklearn import metrics
y_true = [0.1.2.0.4.2.3]
y_pred = [0.1.1.0.4.1.2]
print(metrics.recall_score(y_true, y_pred, average='micro'))
print(metrics.f1_score(y_true, y_pred, average='micro'))
print(metrics.precision_score(y_true, y_pred, average='micro'))
print(metrics.accuracy_score(y_true, y_pred))
Copy the code

Running results:

0.5714285714285714
0.5714285714285714
0.5714285714285714
0.5714285714285714
Copy the code

conclusion

In binary scenarios, the average parameter is “binary”; In multi-classification scenarios, the average parameter is usually “micro”,”macro”,”weighted”. In the multi-label classification scenario, the average parameter is “samples”.

Reference documentation

  • Precision, recall and F-measures