Using Python to solve unbalanced data problems

I haven’t updated my article for a long time, I believe many readers will be more disappointed, and even take off the bar. I apologize to all the netizens here. The main reason why this article has not been updated in time is that I am currently writing books related to Python and R language. I am excited that the book on Data analysis and mining based on Python has been completed, and I will continue to write contents related to R language in the future. Hope to get the understanding of netizens, sorry again for the late arrival of the new article.

The topic of this presentation is the processing of unbalanced data, which is common in data mining. It covers the solutions and principles of unbalanced data, and how to use Python’s powerful tools to achieve balanced transformation.

SMOTE. Introduction to the SMOTE algorithm

In practical application, readers may encounter a headache problem, that is, there may be serious bias in the dependent variable of category type in the classification problem, that is, the proportion between categories is seriously out of proportion. For example, in the fraud problem, fraud observation accounts for a small number of samples. In the problem of customer churn, unloyal customers often account for a small part; In response to a marketing campaign, only a small number of customers actually participate in the campaign.

If there is a significant imbalance in the data, the predictions tend to be biased, favoring the more observed classes. How to deal with this kind of problem? The simplest and most crude method is to construct 1:1 data, either by cutting a part of the larger category (i.e., undersampling) or by Bootstrap sampling (i.e., oversampling) the smaller category. However, there are problems with this approach. For the first approach, chopping data results in the loss of some implicit information. In the second method, a simple copy formed by putting back the sample will cause the model to overfit.

In order to solve the data imbalance problem, In 2002 Chawla put forward the SMOTE algorithm, that is, SMOTE synthesizing minority oversampling technique, which is an improved scheme based on stochastic oversampling algorithm. This technology is a common method to deal with unbalanced data at present, and has been unanimously recognized by the academic and industrial circles. The theoretical idea of this algorithm is briefly described next.

The basic idea of SMOTE is to analyse and model a few sample of classes and add a manually modulated new sample to the set, thus eliminating the extreme imbalance in the original data. KNN technology is used in the simulation process of the algorithm, and the steps of simulating the generation of new samples are as follows:

In order to understand how SMOTE gives you a new sample, see the following figure and the generation of artificial new samples:

As you can see in the figure above, the solid dot has a much higher number of samples than the five-pointed star. If you use SMOTE to simulate increasing the number of sample points with fewer categories, the following steps are required:

For each randomly selected sample point, a new sample point is constructed. The new sample point is constructed using the following formula:

Where, xi represents a sample point in a few categories (the sample X1 represented by the five-pointed star in the figure); Xj represents the sample point j randomly selected from K nearest neighbors; Rand (0,1) generates a random number between 0 and 1.

Assuming that the observed value of sample point x1 in the figure is (2,3,10,7), two sample points are randomly selected from the five nearest neighbors in the figure, and their observed values are (1,1,5,8) and (2,1,7,6) respectively. Therefore, the two new sample points thus obtained are:

SMOTE is not too difficult to oversampling using the SMOTE algorithm, you can customize a sampling function according to the above steps. Of course, readers can also use IMblearn module and use SMOTE “class” in its submodule over_sampling to achieve the generation of new sample. The syntax and parameters for this “class” are as follows:

SMOTE(ratio= 'AUTO', random_state=None, K_NEIGHBORS =5, M_NEIGHBORS =10, OUT_step =0.5, KIND = 'regular', SVm_estimator =None, n_jobs=1)Copy the code

Thewire:

Random_state:

K_neighbors:

M_neighbors:

Kind:

Svm_estimator:

N_jobs:

Practical application of classification algorithm

The data set shared this time was derived from the customer transaction history data of a German telecom industry. The data set contained a total of 4,681 records with 19 variables. The dependent variable churn was a binary variable, with yes indicating customer churn and NO indicating customer churn. The remaining independent variables include whether customers order international toll package, voice package, number of short messages, call fee, call frequency, etc. Next, using this data set, we explore the effect of unbalanced data after balancing.

Import pandas as pd import numpy as NP import matplotlib.pyplot as PLT from sklearn import model_selection from Sklearn import tree from sklearn import metrics from imblearn. Over_sampling import SMOTE # read data churn = pd.read_excel(r'C:\Users\Administrator\Desktop\Customer_Churn.xlsx') churn.head()Copy the code

Plt.rcparams ['font. Sans-serif ']=['Microsoft YaHei'] Axes (aspect = 'equal') # axes = churns.churns.value_counts () # Data labels = pd # drawing. The Series (counts. The index). The map ({' yes' : 'loss', 'no' : 'not losing'}), # add text tags autopct = '%. 2 f % %' # set percentage format, Keep a decimal place here) # show the graph plt.show()Copy the code

As shown in the figure above, the number of lost users accounts for only 8.3%. Compared with the number of lost users, there is still a big difference. It can be considered that the two types of customers are unbalanced, and modeling such data directly may lead to inaccurate model results. It is advisable to build a random forest model on this data first to see if there is bias.

The state variable and Area_code variable in the original data table represent the “state” and “region” codes to which the user belongs. Intuitively, they may not be important reasons for user loss. Therefore, delete these two variables from the table. In addition, whether users subscribe to the international_plan and voice_mail_plan services are binary values of characters. They cannot be directly substituted into the model, so they need to be converted to 0-1 binary values.

Churn. Drop (labels=['state','area_code'], axis = 1, Inplace = True) # convert binary variables international_plan and voice_mail_plan to 0-1 dummy variables churning.international_plan = churn.international_plan.map({'no':0,'yes':1}) churn.voice_mail_plan = churn.voice_mail_plan.map({'no':0,'yes':1}) churn.head()Copy the code

As shown in the table above, is the clean data after cleaning. Then, the data set is split to construct the training data set and the test data set respectively, and the classifier is constructed using the training data set, and the test data set verifies the classifier:

Columns [:-1] # Data is split into training set and test set X_train,X_test,y_train,y_test = model_selection.train_test_split(churn[predictors], churn.churn, Random_state dt = = # 12) to build decision tree tree. DecisionTreeClassifier (n_estimators = 300) dt. Fit (X_train y_train) # predicted Mr Pred model on the test set = dt. Accuracy_score (y_test, accuracy_score) print(accuracy_score, Print (metrics.classification_report(y_test, pred))Copy the code

As shown in the above results, the prediction accuracy of decision tree is more than 93%. Among them, the coverage recall of no is 97%, while that of yes is 62%. The difference between the two is very large, indicating that the classifier really favors the category with large sample size (NO).

Calculate the probability of losing users, Predict_proba (X_test)[:,1] FPR, TPR,threshold = metrics.roc_curve(y_test ({'no':0,'yes':1}), Roc_auc = metrics. AUC (FPR, TPR) # stackplot(FPR, TPR, color='steelblue', alpha = 0.5, Plot (FPR, TPR, color='black', lw = 1) # add diagonal plt.plot([0,1],[0,1], color=' red', Plt. text(0.5,0.3,'ROC curve (area = %0.3f)' % roc_auc) # Plt.xlabel (' 1-specificity ') plt.ylabel('Sensitivity') # show plt.show()Copy the code

As shown in the figure above, the area under the ROC curve is 0.795, and the VALUE of AUC is less than 0.8, so the model is considered unreasonable. (AUC is usually compared with 0.8, and if it is greater than 0.8, the model is considered reasonable). Next, use the SMOTE algorithm to process the data:

Over_samples = SMOTE(random_state=1234) over_samples_X, OVER_samples_y = over_samples. Fit_sample (X_train, Print (y_train.value_counts()/len(y_train)) print(y_train print(pd.Series(over_samples_y).value_counts()/len(over_samples_y))Copy the code

As shown in the above results, there is still a big difference in the percentage of classes in the training dataset itself, but after SMOTE algorithm, the balance of the two classes reaches 1:1. The decision tree classifier can be reconstructed by using the balance data:

Based on equilibrium data to construct the decision tree model # dt2 = ensemble. DecisionTreeClassifier dt2 (n_estimators = 300). The fit (over_samples_X over_samples_y) # Pred2 =dt2.predict(np.array(X_test)) # print(accuracy_score(y_test, accuracy_score) Print (metrics.classification_report(y_test, pred2))Copy the code

As shown in the above results, after re-modeling with balanced data, the accuracy of the model is also very high, reaching 92.6% (only 1% lower than that of the model constructed with original unbalanced data), but the coverage rate of yes is increased by 10%, reaching 72%, which is the benefit brought by balance.

# Calculate the probability of losing users, Data used to generate ROC curve y_score = rf2.predict_proba(Np.array (X_test))[:,1] FPR, TPR,threshold = metrics.roc_curve(y_test.map({'no':0,'yes':1}), Roc_auc = metrics. AUC (FPR, TPR) # stackplot(FPR, TPR, color='steelblue', alpha = 0.5, Plot (FPR, TPR, color='black', lw = 1) # add diagonal plt.plot([0,1],[0,1], color=' red', Plt. text(0.5,0.3,'ROC curve (area = %0.3f)' % roc_auc) # Plt.xlabel (' 1-specificity ') plt.ylabel('Sensitivity') # show plt.show()Copy the code

The final AUC value is 0.836, at which point the model can be considered relatively reasonable.

The original article was published at: 2018-05-13

Author: Liu Shunxiang

This article is from “Data THU”, a partner of the cloud community. For more information, follow “Data THU”.

Using Python to solve unbalanced data problems

Related Posts

This is what you don’t need to be an architect!

How does Apache Pulsar ensure that messages are not lost or regenerated?

ELK log system uses Rsyslog to collect Nginx logs quickly and easily