The author | MUSKAN097 compile | source of vitamin k | Analytics Vidhya

Introduction to the

You have successfully constructed the classification model. What do you do now? How do you evaluate the performance of the model, how well the model does at predicting the outcome. To answer these questions, let’s look at the metrics used when evaluating classification models through a simple case study.

Let’s take a closer look at the concept through a case study

In this age of globalization, people often travel from one place to another. Airports can pose risks as passengers wait in line, check in, visit food suppliers and use facilities such as restrooms. Tracking passengers with the virus at airports can help prevent its spread.

Consider that we have a machine learning model that categorizes passengers as COVID-19 positive and negative. There are four types of results that can occur when making categorical predictions:

True Example (TP) : When you predict that an observation belongs to a class when in fact it belongs to that class. In this case, passengers who are predicted to be COVID-19 positive and who are actually positive.

True counterexample (TN) : When you predict that an observation does not belong to a class, it actually does not belong to that class. In this case, that is, passengers who are predicted to be non-COVID-19 positive (negative) and are not actually COVID-19 positive (negative).

FalsePositive cases (FP) : When you predict an observation to belong to a class when in fact it does not. In this case, that is, passengers who are predicted to be COVID-19 positive but are not actually COVID-19 positive (negative).

False counterexample (FN) : When you predict that an observation does not belong to a class when it actually belongs to that class. In this case, that is, passengers who are predicted to be non-COVID-19 positive (negative) and are actually COVID-19 positive.

Confusion matrix

To better visualize the performance of the model, these four results are plotted on an obfuscation matrix.

The accuracy of

Right! You’re right. We want our model to focus on real positive and negative examples. Accuracy is an indicator that gives the score that our model correctly predicts. Formally, accuracy is defined as follows:

Accuracy = number of correct predictions/total number of predictions.

Now, let’s consider an average of 50,000 passengers per day. Ten of them were COVID-19 positive.

A simple way to improve accuracy would be to classify every passenger as COVID-19 negative. So our confusion matrix is as follows:

The accuracy of this case is:

Accuracy =49990/50000=0.9998 or 99.98%

Amazing!!!!! Is that correct? So, does this really solve our purpose of properly classifying COVID-positive passengers?

For this particular example, where we were trying to flag passengers as COVID-19 positive and negative in hopes of identifying the right passengers, I was able to achieve 99.98% accuracy by simply marking everyone as COVID-19 negative.

Clearly, this is a more accurate approach than we’ve seen in any model. But that doesn’t solve the problem. The goal here is to identify passengers who are COVID-19 positive. Accuracy is a scary metric in this case, because it’s easy to get very good accuracy, but that’s not what we’re interested in.

So accuracy is not a good way to evaluate a model in this case. Let’s look at a very popular measure called recall rate.

Recall rate (sensitivity or true case rate)

Recall rate gives you the score that you correctly identify as positive.

Now, this is an important measure. Out of all the positive passengers, what score did you correctly identify? Go back to our old strategy, mark every passenger negative, so the recall rate is zero.

Recall = 0/10 = 0

Therefore, recall rate is a good measure in this case. It said the dire strategy of identifying every passenger as COVID-19 negative had led to zero recall rates. We want to maximize recall.

As another positive answer to each of the above questions, consider each of the COVID-19 questions. Everyone walks into the airport and the model gives them a positive label. Putting a positive label on every passenger is bad because the cost of actually checking every passenger before they board a plane is huge.

The confusion matrix is as follows:

The recall rate will be:

Recall = 10/(10+0) = 1

That’s a big problem. So, the conclusion is that accuracy is a bad idea, because putting negative labels on everyone improves accuracy, but hoping recall rates would be a good measure in this case, but then realizing that putting positive labels on everyone would also increase recall rates.

So individual recall rates are not a good measure.

There is another measure called accuracy

precision

Accuracy gives the score correctly identified as positive out of all predicted positive results.

Considering our second error strategy, which is to mark each passenger as positive, the accuracy would be:

Precision = 10 / (10 + 49990) = 0.0002

While this faulty strategy has a good recall value of 1, it has a terrible accuracy value of 0.0002.

This shows that recall alone is not a good measure, we need to consider precision.

Consider another scenario (and this will be the last one, I promise: P) to flag passengers at the top of the list as COVID-19 positive, i.e. flag passengers with the highest likelihood of contracting COVID. Let’s say we only have one passenger. In this case, the confusion matrix is:

Accuracy: 1/ (1+0) =1

In this case, the accuracy value is fine, but let’s check the recall rate:

Recall = 1 / (1 + 9) = 0.1

In this case, the accuracy value is good, but the recall value is low.

scenario The accuracy of The recall rate precision
Classify all passengers as negative high low low
Classify all passengers as positive low high low
Passengers at the top of the list were flagged as COVID-19 positive high low low

In some cases, we’re pretty sure we want to maximize recall or accuracy at the expense of others. In this case of tagging passengers, we really want to be able to correctly predict the COVID-positive passengers, because it’s very expensive not to predict the correctness of passengers, because allowing people who are COVID-positive through leads to an increase in transmission. So we’re more interested in the recall rate.

Unfortunately, you can’t have it both ways: improving accuracy reduces recall, and vice versa. This is called the accuracy/recall trade-off.

Accuracy/recall trade-off

Some classification models output probabilities between 0 and 1. In cases where we categorize passengers as COVID-19 positive and negative, we want to avoid omitting actual cases that are positive. In particular, if a passenger was indeed positive, but our model could not identify it, that would be very bad, because there is a good chance that the virus could spread by allowing those passengers to board the plane. So, even if there is a slight suspicion of COVID, we have to label it as positive.

So our strategy is that if the probability of output is greater than 0.3, we mark them as COVID-19 positive.

This results in higher recall rates and lower accuracy.

Consider the opposite, when we determine that the passenger is positive, we want to classify the passenger as positive. We set the probability threshold to 0.9, that is, classify passengers as positive when the probability is greater than or equal to 0.9, and negative otherwise.

So in general, for most classifiers, when you change the probability threshold, there’s a tradeoff between recall and accuracy.

If you need to compare different models with different precise recall values, it is often convenient to combine precision and recall into a single metric. Right!!!!! We need a metric that considers both recall rate and accuracy to calculate performance.

F1 score

It is defined as the harmonic mean of model accuracy and recall.

You must be wondering why harmonic averaging rather than simple averaging? We use harmonic averages because they are insensitive to very large values, unlike simple averages.

For example, we have a model with accuracy of 1, with a recall rate of 0 giving a simple average of 0.5 and an F1 score of 0. If one of the parameters is low, the second parameter is no longer important in the F1 score. F1 scores tend to favor classifiers with similar accuracy and recall rates.

So if you’re looking for a balance between accuracy and recall, an F1 score is a better yardstick.

AUC/ROC curve

ROC is another common evaluation tool. It gives the sensitivity and specificity of the model for every possible decision point between 0 and 1. For classification problems with probability outputs, thresholds can convert probability outputs into classifications. So by changing the threshold, you can change some of the numbers in the confusion matrix. But the most important question here is, how do you find the right threshold?

For each possible threshold, the ROC curve plots the false positive and true case rates.

False positive example rate: The proportion of negative example instances that are misclassified as positive examples.

True case rate: The proportion of positive cases that are correctly predicted to be positive.

Now, consider a low threshold. Thus, of all the probabilities in ascending order, anything below 0.1 is considered negative, and anything above 0.1 is considered positive. The choice of threshold is free

But if you set your bar high, like 0.9.

The ROC curves of the same model under different thresholds are shown below.

As can be seen from the figure above, the true case rate increases at a higher rate, but at a certain threshold, TPR begins to decrease gradually. Every time we add TPR, we pay a price — an increase in FPR. In the initial stage, TPR increases more than FPR

Therefore, we can choose a threshold with high TPR and low FPR.

Now, let’s see what the different values of TPR and FPR tell us about this model.

We have different ROC curves for different models. Now, how do you compare the different models? As can be seen from the graph above, the curve at the top represents the model is good. One way to compare classifiers is to measure the area under the ROC curve.

AUC (Model 1) >AUC (Model 2) >AUC

So model 1 is the best.

conclusion

We looked at the different metrics used to evaluate the classification model. When and which metrics are used depends largely on the nature of the problem. So now go back to your model, ask yourself what are the main goals you want to address, choose the right metrics, and evaluate your model.

The original link: www.analyticsvidhya.com/blog/2020/1…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/