The author | PROCRASTINATOR compile | source of vitamin k | Analytics Vidhya

An overview of the

  • Understand how class weight optimization works and how to implement the same method using Sklearn in logistic regression or any other algorithm

  • Know how you can overcome the problem of class unbalanced data by modifying class weights without using any sampling method

introduce

The classification problem in machine learning is that we are given some input (independent variables), and we have to predict a discrete target. The distribution of discrete values is likely to be very different. Due to the difference of each class, the algorithm tends to favor most of the existing values, and the processing effect of a few values is not good.

This difference in class frequency affects the overall predictability of the model.

It’s not hard to get good accuracy on these issues, but that doesn’t mean the model is good. We need to check whether the performance of these models makes any business sense or has any value. That’s why it’s essential to understand the problem and the data so you can use the right metrics and optimize it using the appropriate methods.

directory

  • What is class imbalance?

  • Why deal with category imbalances?

  • What are category weights?

  • Class weights in logistic regression

  • Python implementation

    • Simple logistic regression
    • Weighted logistic regression (‘ equilibrium ‘)
    • Weighted Logistic regression (manual weight)
  • Tips for further improving your score

What is class imbalance?

Class imbalance is a problem in machine learning classification problems. It simply states that the frequency of the target class is highly unbalanced, that is, one of the classes has a very high frequency compared to the other existing classes. In other words, there is a bias against most classes in the target.

Suppose we consider a dichotomy in which most target classes have 10,000 and a few have only 100. In this case, the ratio is 100:1, meaning that for every 100 majority classes, there is only one minority class. The problem is what we call the category imbalance. The general areas where we can find this data are fraud detection, churn prediction, medical diagnostics, email sorting, and so on.

We will work with a data set in the medical field to properly understand class imbalances. Here, we have to predict whether a person will develop heart disease based on given attributes (independent variables). To skip data cleansing and preprocessing, we use a cleaned version of the data.

In the image below, you can see the distribution of the target variables.

Draw a bar chart of the target
plt.figure(figsize=(10.6))
g = sns.barplot(data['stroke'], data['stroke'], palette='Set1', estimator=lambda x: len(x) / len(data) )

# Graph statistics
for p in g.patches:
        width, height = p.get_width(), p.get_height()
        x, y = p.get_xy() 
        g.text(x+width/2, 
               y+height, 
               '{:.0%}'.format(height), 
               horizontalalignment='center',fontsize=15)

# set the tag
plt.xlabel('Heart Stroke', fontsize=14)
plt.ylabel('Precentage', fontsize=14)
plt.title('Percentage of patients will/will not have heart stroke', fontsize=16)
Copy the code

Here,

0: indicates that the patient does not have heart disease.

1: indicates that the patient is suffering from heart disease.

As you can see from the distribution, only 2% of the patients had heart disease. So, this is a classic class imbalance.

Why deal with category imbalances?

So far, we have an intuition about the category imbalance. But why do you need to overcome this problem, and what problems arise when you use this data for modeling?

Most machine learning algorithms assume that data is evenly distributed across classes. In class imbalance problems, the broader problem is that the algorithm will be biased toward predicting most categories (no heart disease in our case). The algorithm does not have enough data to learn patterns in a few classes (heart disease).

Let’s take a real life example to understand this better.

Suppose you have moved from your hometown to a new city and you have lived here for a month. When you come to your hometown, you will be very familiar with all the places, such as your home, routes, important shops, tourist attractions and so on, because you spent your whole childhood there.

But when you’re in a new city, you don’t have too many ideas about where each place is, and the chances of getting lost by taking the wrong route are very high. Here, your hometown is your majority and your new town is your minority.

This also happens with category imbalances. Minority classes don’t have enough information about your class, which is why minority classes have high misclassification errors.

Note: To check the performance of the model, we will use f1 scores as a measure, not accuracy.

The reason is that if we build a stupid model and predict that every new training data is zero (no heart disease), even then we will get very high accuracy because the model is biased towards most classes.

Here, the model is very accurate, but of no value to our problem statement. This is why we will use f1 scores as an evaluation indicator. The F1 score is nothing more than a harmonic average of accuracy and recall. However, the metrics are selected based on business problems and the types of errors we want to reduce. However, f1 scores are the key to measuring the problem of category imbalance.

Here is the formula for f1 scores:

f1 score = 2*(precision*recall)/(precision+recall)
Copy the code

Let’s confirm this by training a model based on the target variable pattern, and check the scores we get:

Train the model with target patterns
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix
pred_test = []
for i in range (0.13020):
    pred_test.append(y_train.mode()[0])

# Print F1 and accuracy scores
print('The accuracy for mode model is:', accuracy_score(y_test, pred_test))
print('The f1 score for the model model is:',f1_score(y_test, pred_test))

Draw the obfuscation matrix
conf_matrix(y_test, pred_test)
Copy the code

The accuracy of the mode model is 0.9819508448540707

Mode Mode has an F1 score of: 0.0

Here, the model has an accuracy of 0.98 on the test data, which is a good score. F1, on the other hand, has a score of zero, indicating that the model does not perform well in minority populations. We can confirm this by looking at the confusion matrix.

The model predicts 0 per patient (no heart disease). According to this model, no matter what symptoms a patient has, he or she will never have a heart attack. Does it make sense to use this model?

Now that we know what class imbalance is and how it affects the performance of our model, we’ll shift our focus to what class weights are and how they can help improve model performance.

What is the category weight?

Most machine learning algorithms are not very useful for biased class data. However, we can modify the existing training algorithm to take into account the skewed distribution of classes. This can be done by giving different weights to the majority and minority categories. In the training phase, the difference in weight will affect the classification of categories. The overall purpose is to punish the misclassification of a few classes by setting a higher class weight while reducing the weight of the majority classes.

To illustrate this point more clearly, we will revert to the city example we considered earlier.

Think of it this way: you spent the last month in this new city, and instead of going out when you needed to, you spent an entire month exploring it. You spend more time learning about the city’s routes and locations throughout the month. Giving you more time to research will help you get to know the new city better and reduce the chances of getting lost. This is exactly how class weights work.

In the training process, we give more weight to a few classes in the cost function of the algorithm, so that it can provide higher punishment to a few classes, so that the algorithm can focus on reducing the error of a few classes.

Note: There is a threshold at which you should increase and decrease class weights for minority and majority classes, respectively. If you give very high class weights to a few classes, the algorithm is likely to favor a few classes and increase errors in the many classes.

Most SkLearn classifier modeling libraries, and even some boosting based libraries such as LightGBM and CatBoost, have a built-in parameter “class_weight” that helps us optimize scores for a few classes, as we’ve learned so far.

By default, the value of class_weights is “None”, meaning that the two classes have equal weights. In addition, we can “balanced” it or pass a dictionary with artificial weights for both classes.

When class weights = ‘balance’, the model automatically assigns class weights inversely proportional to their respective frequencies.

To be more precise, the formula is:

wj=n_samples / (n_classes * n_samplesj)
Copy the code

Here,

  • Wj is the weight of each class (j for class)

  • N_samples is the total number of samples or rows in the dataset

  • N_classes is the total number of unique classes in the target

  • N_samplesj is the total number of rows of the corresponding class

For our heart case:

n_samples= 43400, n_classes= 2(0&1), n_sample0= 42617, n_samples1= 783

Weight of class 0:

W0 = 43400 / (2 * 42617) = 0.509Copy the code

Weight of Class 1:

W1 = 43400 / (2 * 783) = 27.713Copy the code

I hope this makes things a little clearer that category weights = ‘balanced’ helps us give higher weights to a few categories and lower weights to many categories.

While passing values as “balanced” will yield good results in most cases, sometimes we can try to design weights for extreme class imbalances. Later we’ll learn how to find the best value for class weights in Python.

Class weights in Logistic regression

We can modify each machine learning algorithm by adding different class weights to the algorithm’s cost function, but here we will pay special attention to logistic regression.

For logistic regression, we use logarithmic loss as a cost function. We did not use mean square error as the cost function of logistic regression because we used sigmoid curves as prediction functions rather than fitting lines.

Flattening the Sigmoid function results in a non-convex curve, which makes the cost function have a large number of local minima, and it is very difficult to converge to the global minimum by gradient descent method. But the logarithmic loss is a convex function, and we only have one minimum that converges.

Log loss formula:

Here,

  • N is the number of values

  • Yi is the actual value of the target class

  • Yi is the predicted probability of the target class

Let’s form a pseudo-table that contains the actual forecast, the forecast probability, and the cost calculated using the log loss formula:

In this table, we have 10 observations, 9 from class 0 and 9 from class 1. In the next column, we’ll give the predicted probabilities for each observation. Finally, using the logarithmic loss formula, we get the cost penalty.

After adding weights into the cost function, the modified logarithmic loss function is:

here

W0 is the class weight of class 0

W1 is the class weight of class 1

Now we’ll add the weight to see what effect it has on the cost penalty.

For weight values, we will use the class_weights=’balanced’ formula.

w0= 10/(2*1) = 5

W1 = 10 / (2 * 9) = 0.55

Calculate the cost of the first value in the table:

Cost = - (5 (0 * log (0.32) + 0.55 (1-0) * log (1-0.32)) = - (0 + 0.55 * log (. 68)) = - (0.55 * (0.385)) = 0.211Copy the code

Similarly, we can calculate the weighted cost of each observation value, and the updated table is as follows:

Through the table, we can determine that a smaller weight is applied to the cost functions of most classes, which results in a smaller error value, thus reducing the updating of model coefficients. A larger weight value applied to the cost functions of a few classes leads to a larger error calculation, which in turn leads to more updates to the model coefficients. In this way, we can change the bias of the model to reduce the error of a few classes.

Conclusion:

Smaller weights result in smaller penalties and minor updates to model coefficients

Larger weights result in larger penalties and a large number of updates to model coefficients

Python implementation

Here, we will use the same heart disease data to predict. First we will train a simple logistic regression and then we will implement weighted logistic regression with class weights as “balanced”. Finally, we’ll try to use grid search to find the best value for class weights. The metric we are trying to optimize will be the F1 score.

1. Simple logistic regression

Here, we use the SkLearn library to train our model, and we use the default logistic regression. By default, the algorithm assigns equal weight to both classes.

# Import and train the model
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver='newton-cg')
lr.fit(x_train, y_train)

# Test data prediction
pred_test = lr.predict(x_test)

# Calculate and print the F1 score
f1_test = f1_score(y_test, pred_test)
print('The f1 score for the testing data:', f1_test)

# function to create confusion matrix
def conf_matrix(y_test, pred_test) :    
    
    Create an obfuscation matrix
    con_mat = confusion_matrix(y_test, pred_test)
    con_mat = pd.DataFrame(con_mat, range(2), range(2))
   
    plt.figure(figsize=(6.6))
    sns.set(font_scale=1.5) 
    sns.heatmap(con_mat, annot=True, annot_kws={"size": 16}, fmt='g', cmap='Blues', cbar=False)
    
# call function
conf_matrix(y_test, pred_test)
Copy the code

Test data F1 score: 0.0

In a simple logistic regression model, the score of F1 is 0. By looking at the confusion matrix, we were able to confirm that our model predicted every observation because there was no heart attack. This model is no better than the schema model we created earlier. Let’s try adding some weight to a few classes and see if that helps.

2 logic regression (class_weight=’balanced’) :

We added the class weight parameter into the logistic regression algorithm, and the value passed is “balanced”.

# Import and train the model
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver='newton-cg', class_weight='balanced')
lr.fit(x_train, y_train)

# Test data prediction
pred_test = lr.predict(x_test)

# Calculate and print the F1 score
f1_test = f1_score(y_test, pred_test)
print('The f1 score for the testing data:', f1_test)

Draw the obfuscation matrix
conf_matrix(y_test, pred_test)
Copy the code

Test data F1 score: 0.10098851188885921

By adding a single class weight parameter to the logistic regression function, we improved the F1 score by 10%. We can see in the confusion matrix that the model captures class 1 (heart disease) well, although the misclassification of class 0 (no heart disease) has increased.

Can we further improve the measurement by changing the class weights?

3 Logistic regression (manually set class weight) :

Finally, we try to use grid search to find the optimal weight with the highest score. We’re going to search for weights between 0 and 1. The idea is that if we give a few categories N as weights, most categories will get 1-n as weights.

In this case, the weight size is not very large, but the weight ratio between the majority category and the minority category will be very high.

Such as:

W1 = 0.95

W0 = 1 — 0.95 = 0.05

w1:w0 = 19:1

As a result, the minority categories will be 19 times heavier than the majority.

from sklearn.model_selection import GridSearchCV, StratifiedKFold
lr = LogisticRegression(solver='newton-cg')

Set the range of class weights
weights = np.linspace(0.0.0.99.200)

Create a dictionary grid for grid search
param_grid = {'class_weight': [{0:x, 1:1.0-x} for x in weights]}

## The training data were fitted with the 5-fold grid search method
gridsearch = GridSearchCV(estimator= lr, 
                          param_grid= param_grid,
                          cv=StratifiedKFold(), 
                          n_jobs=-1, 
                          scoring='f1', 
                          verbose=2).fit(x_train, y_train)

Draw scores for different weight values
sns.set_style('whitegrid')
plt.figure(figsize=(12.8))
weigh_data = pd.DataFrame({ 'score': gridsearch.cv_results_['mean_test_score'].'weight': (1- weights)})
sns.lineplot(weigh_data['weight'], weigh_data['score'])
plt.xlabel('Weight for class 1')
plt.ylabel('F1 score')
plt.xticks([round(i/10.1) for i in range(0.11.1)])
plt.title('Scoring for different class weights', fontsize=24)
Copy the code

From the figure, we can see that the maximum value of a few classes reaches its peak at 0.93.

Through grid search, we get the best weight of class, class 0 (majority class) is 0.06467, class 1 (minority class) is 1:0.93532.

Now that we’ve got the best class weights using layered cross-validation and grid search, we’ll see how the test data performs.

# Import and train the model
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver='newton-cg', class_weight={0: 0.06467336683417085.1: 0.9353266331658292})
lr.fit(x_train, y_train)

# Test data prediction
pred_test = lr.predict(x_test)

# Calculate and print the F1 score
f1_test = f1_score(y_test, pred_test)
print('The f1 score for the testing data:', f1_test)

Draw the obfuscation matrix
conf_matrix(y_test, pred_test)
Copy the code

F1 score: 0.15714644

By manually changing the weight values, we can further improve the F1 score by about 6%. The confusion matrix also shows that, from the previous model, we are better able to predict class 0, but at the cost of our misclassification of class 1. It all depends on business issues or the types of errors you want to reduce. Here, our focus is on improving the F1 score, which we can do by tweaking the category weights.

To further improve your scoring skills

Feature engineering: For simplicity, we use only the given independent variables. You can try to create new features

Adjust threshold: By default, the threshold for all algorithms is 0.5. You can try different threshold values and find the best value by using a grid search or a randomized search

Use advanced algorithms: For this interpretation, we used only logistic regression. You can try different Bagging and Boosting algorithms. Finally, you can try a mix of algorithms

At the end

I hope this article has given you an idea of how class weights can help deal with class imbalances and how easy it is to implement in Python.

Although we have discussed how class weights apply only to logistic regression, the idea of the other algorithms is the same; Just the variation of the cost function that each algorithm uses to minimize the error and optimize the results of a few classes

The original link: www.analyticsvidhya.com/blog/2020/1…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/