# Original link:tecdat.cn/?p=11160

For classification problems, classifier performance is usually defined in terms of the confusion matrix associated with the classifier. From the obfuscation matrix, sensitivity (recall rate), specificity and accuracy can be calculated.

All of these performance metrics are readily available for binary classification problems.

## Non-scoring classifier data

To demonstrate the performance indicators of non-scoring classifiers in multi-category Settings, let us consider the classification problems observed \ (N = 100 \) and the five classification problems observed \ (G = \ {1, \ ldots, 5 \}) :

```
ref.labels <- c(rep("A", 45), rep("B" , 10), rep("C", 15), rep("D", 25), rep("E", 5))
predictions <- c(rep("A", 35), rep("E", 5), rep("D", 5),
rep("B", 9), rep("D", 1),
rep("C", 7), rep("B", 5), rep("C", 3),
rep("D", 23), rep("C", 2),
rep("E", 1), rep("A", 2), rep("B", 2))
df <- data.frame("Prediction" = predictions, "Reference" = ref.labels)
Copy the code
```

## Accuracy and weighting accuracy

In general, multi-class accuracy is defined as the average of correct predictions:

Where \ (I \) is an indicator function that returns 1 if the class matches, and 0 otherwise.

More sensitive to the performance of each class, we can distribute the weights for each class \ (w_k \), in order to make \ (\ sum_ {k = 1} ^ {| G |} w_k = 1 \). The higher the \ (w_k \) value of a single class, the greater the influence of the observations of that class on the weighting accuracy. The weighting accuracy depends on:

For all classes in the weighted average, we can set up \ (w_k = \ frac {1} {| G |} \ and \ \ forall k in \ {1 \ ldots, G \} \). Note that it is difficult to find a sound argument for a particular combination of weights when using any value other than equal weights.

### Calculation accuracy and weighting accuracy

The accuracy is easy to calculate:

`calculate.accuracy <- function(predictions, ref.labels) { return(length(which(predictions == ref.labels)) / length(ref.labels)) } calculate.w.accuracy <- function(predictions, ref.labels, weights) { lvls <- levels(ref.labels) if (length(weights) ! = length(lvls)) { stop("Number of weights should agree with the number of classes.") } if (sum(weights) ! = 1) { stop("Weights do not sum to 1") } accs <- lapply(lvls, function(x) { idx <- which(ref.labels == x) return(calculate.accuracy(predictions[idx], ref.labels[idx])) }) acc <- mean(unlist(accs)) return(acc) } acc <- calculate.accuracy(df$Prediction, df$Reference) print(paste0("Accuracy is: ", round(acc, 2)))Copy the code`

```
## [1] "Accuracy is: 0.78"
Copy the code
```

```
## [1] "Weighted accuracy is: 0.69"
Copy the code
```

## Micro and macro mean of F1 scores

Micro and macro means represent two ways of interpreting the confusion matrix in a multi-class setting. Here we need to compute a confounding matrix for each class \ (g_i \ in G = \ {1, \ ldots, K \} \) so that the first confounding matrix considers class \ (g_i \) as an affirmative class, While all the other classes \ (g_j \) are \ (j \ neq I \) are \ negated.

To illustrate why adding real negative numbers can be problematic, imagine that there are 10 categories, each with 10 observations. Then, the confusion matrix for one of the categories might have the following structure:

Forecast/reference | Class 1 | Other classes |
---|---|---|

Class 1 | 8 | 10 |

Other classes | 2 | 80 |

Based on this matrix, the specificity would be \ (\ frac {80} {80 + 10} = 88.9 \ % \), even though class 1 was correctly predicted in only 8 of 18 instances (accuracy 44.4%).

In the following, we will use \ (TP_i \), \ (FP_i \) and \ (FN_i \) respectively to indicate the true positive, false positive and false negative classes in the confusion matrix associated with (I). Furthermore, let the precision be represented by \ (P \), and by \ (R \).

### Calculate the micro and macro averages in R

Here I demonstrate how to calculate the micro and macro average of F1 scores in R.

We will use the confusionMatrix function caret in the package to determine the confusionMatrix:

Now we can summarize the performance of all classes:

```
metrics <- c("Precision", "Recall")
print(cm[[1]]$byClass[, metrics])
Copy the code
```

```
## Precision Recall
## Class: A 0.9459459 0.7777778
## Class: B 0.5625000 0.9000000
## Class: C 0.8333333 0.6666667
## Class: D 0.7931034 0.9200000
## Class: E 0.1666667 0.2000000
Copy the code
```

These data indicate that, overall, performance is high. However, our hypothetical classifier does not perform well for individual categories such as class B (accuracy) and class E (accuracy and recall). We will now examine how the micro and macro averages of F1 scores are affected by model predictions.

### The overall performance of miniature average F1

The function then simply sums up the count and computes the F1 score defined above.

```
micro.f1 <- get.micro.f1(cm)
print(paste0("Micro F1 is: ", round(micro.f1, 2)))
Copy the code
```

```
## [1] "Micro F1 is: 0.88"
Copy the code
```

The value of 0.88\ (F_1 {\ rm {micro}} \) is quite high, indicating good overall performance.

### The class-specific performance of macro average F1

Since each of the confounding matrices cm already stores a one-to-many predictive performance, we only need to extract these values from one of the matrices and then calculate \ (F1 _ {\ rm {macro}} \) as defined above:

```
get.macro.f1 <- function(cm) {
c <- cm[[1]]$byClass # a single matrix is sufficient
re <- sum(c[, "Recall"]) / nrow(c)
pr <- sum(c[, "Precision"]) / nrow(c)
f1 <- 2 * ((re * pr) / (re + pr))
return(f1)
}
macro.f1 <- get.macro.f1(cm)
print(paste0("Macro F1 is: ", round(macro.f1, 2)))
Copy the code
```

```
## [1] "Macro F1 is: 0.68"
Copy the code
```

Value 0.68, \ (F _ {\ RM {macro}} \) is decidedly smaller than the micro-mean F1 (0.88).

Note that the population (0.78) and weighted accuracy (0.69) of micro and macro mean F1 have similar relationships for the current data set.

## Exact calls to curves and AUC

The area under ROC curve (AUC) is a useful tool for evaluating the classification and separation quality of soft classifiers. In multi-category Settings, we can visualize the performance of multi-category models in terms of their relationship to all precision recall curves. AUC can also be generalized to multi-category Settings.

### One to one exact recall curve

We can visualize the performance of a multiclass model by plotting the performance of a \ (K \) binary classifier.

The method is based on fitting \ (K \) pairs for all classifiers, where in iteration (I) group (g_i \) is set to positive and all classes ((g_j \)) are treated as negative together with \ (j \ neq I \). Note that this method should not be used to plot the conventional ROC curve (TPR versus FPR), as the large number of negative instances due to demethylimide would cause the FPR to be underestimated. Instead, consider accuracy and recall:

```
for (i in seq_along(levels(response))) {
model <- NaiveBayes(binary.labels ~ ., data = iris.train[, -5])
pred <- predict(model, iris.test[,-5], type='raw')
score <- pred$posterior[, 'TRUE'] # posterior for positive class
test.labels <- iris.test$Species == cur.class
pred <- prediction(score, test.labels)
perf <- performance(pred, "prec", "rec")
roc.x <- unlist([email protected])
roc.y <- unlist([email protected])
lines(roc.y ~ roc.x, col = colors[i], lwd = 2)
# store AUC
auc <- performance(pred, "auc")
auc <- unlist(slot(auc, "y.values"))
aucs[i] <- auc
}
Copy the code
```

```
print(paste0("Mean AUC under the precision-recall curve is: ", round(mean(aucs), 2)))
Copy the code
```

`## [1] "Mean AUC under the precision-recall curve is 0.97"Copy the code`

The graph shows that Setosa is fairly predictable, while virginica is even more so. The mean AUC of 0.97 indicates that the model separates the three categories well.

### Universalization of AUC for multi-class Settings

### Generalized AUC of a single decision value

When a single quantity allows classification, the AUC can be determined using the multiclass.roc function pROC in the wrapper.

```
## Multi-class area under the curve: 0.654
Copy the code
```

The calculated AUC of the function is just the average AUC of all pairwise category comparisons.

### The generalized AUC

The following describes the generalization of AUC from Hand and Till, 2001.

It seems that due to Hand and Till (2001), there is no publicly available implementation of the MULTI-class generalization of AUC. So, I wrote an implementation. Compute the function. The A.c onditional determine \ (\ hat {A} (I | j) \]. The multiclass.auc function computes \ (\ hat {A} (I, j) \) for all pairs of classes with \ (I

```
multiclass.auc <- function(pred.matrix, ref.outcome) {
labels <- colnames(pred.matrix)
c <- length(labels)
pairs <- unlist(lapply(combn(labels, 2, simplify = FALSE), function(x) paste(x, collapse = "/")))
A.ij.joint <- sum(unlist(A.mean))
M <- 2 / (c * (c-1)) * A.ij.joint
attr(M, "pair_AUCs") <- A.mean
return(M)
}
model <- NaiveBayes(iris.train$Species ~ ., data = iris.train[, -5])
pred <- predict(model, iris.test[,-5], type='raw')
pred.matrix <- pred$posterior
ref.outcome <- iris.test$Species
M <- multiclass.auc(pred.matrix, ref.outcome)
print(paste0("Generalized AUC is: ", round(as.numeric(M), 3)))
Copy the code
```

```
## [1] "Generalized AUC is: 0.988"
Copy the code
```

```
print(attr(M, "pair_AUCs")) # pairwise AUCs
Copy the code
```

`## setosa/versicolor /virginica ## 1.0000000 1.0000000 0.9627329Copy the code`

Using this method, the generalized AUC is 0.988. The resulting pair of AUC interpretations are similar.

## Abstract

For multiple classes of problems.

- For hard classifiers, you can use (weighted) accuracy as well as micro or macro average F1 scores.
- For soft classifiers, you can determine a pair of full-precision recall curves, or you can use the AUC in Hand and Till.