Processing unbalanced data with R

In classification problems, data imbalance refers to that the number of samples in one category is much larger than that in other categories. Compared with the multi-classification problem, the sample imbalanced problem appears more frequently in the binary classification problem. For example, in the data of banks or finance, the status of most credit cards is normal, and only a few credit cards have abnormal phenomena such as stealing.

Algorithms cannot obtain enough information from unbalanced data sets to make accurate predictions for a few categories. Therefore, it is recommended to use a balanced classification data set for training.

In this article, we will discuss how to use R to solve the problem of unbalanced classification.

Introduction to data set

The data set used in this paper is the credit card transaction data set, with 284K total transaction information, a total of 31 information columns, including 492 credit card theft (fraud) information.

Data column

Time: The Time (in seconds) between the transaction and the first transaction in the data set.
V1-v28: Principal component variables obtained by PCA.
Amount: Transaction Amount.
Class: application variable. A value of 1 indicates that the record is a swiping record; otherwise, it is 0

This article summary

Exploratory analysis of the data set was performed
- Check for unbalanced data
- Check the number of transactions per hour
- Check the mean of the PCA variables
Data segmentation
Train the model on the training set
Use sampling methods to build balanced data sets

Exploratory analysis of the data set was performed

Let’s use R to summarize the data set and visualize its key, salient features.

Check for unbalanced data

We can see the imbalance of the dependent variable by doing the following:

We can group the values of Class with the group_by function in the dplyr package:

library(dplyr)
creditcard_details$Class <- as.factor(creditcard_details$Class)
creditcardDF <- creditcard_details %>% group_by(Class) %>% summarize(Class_count = n())
print(head(creditcardDF))Copy the code

# A tibble: 2 x 2
  Class Class_count
  <fct>       <int>
1 0          284315
2 1             492Copy the code

Use ggplot to see the proportion of data for each category:

The proportion of positive samples to negative samples

Check the number of transactions per hour

To view the number of transactions by fill or hour, we first need to standardize the date and divide the day into four equal portions based on the time of day.

The time distribution of the number of transactions recorded

The chart above shows the distribution of two days’ trading information over time. The comparison shows that most of the swipe transactions take place between 13 and 18 PM.

Check the mean of the PCA variables

To detect data anomalies, we calculated the mean values of the v1-V28 variables and examined the variances of each variable. As you can see in the figure below, abnormal transaction data (blue dots) have a larger variance.

Variance of normal and abnormal records

Data segmentation

In modeling predictive problems, the data needs to be split into training sets (80% of the data set) and test sets (20% of the data set). After the data is shard, we need to perform feature scaling to standardize the range of independent variables.

Segmentation of training set and test set

Train the model on the training set

Building a model on the training set can be divided into the following steps:

Train the classifier on the training set.
Make predictions on the test set.
The predictive output of the detection model on unbalanced data.

Through the confounding matrix, we can get 99.9% accuracy of the model on the test set, of course, this is due to the imbalance of samples. So for now let’s ignore the model accuracy obtained by obfuscation matrices. Using the ROC curve, we achieved a 78% accuracy on the test set, which was much lower than the previous 99.9%.

Training results of raw data

Use sampling methods to build balanced data sets

Next we will use different sampling methods to balance the given data set, then check the number of normal and abnormal data items in the sampled data set, and finally build the model on the balanced data set.

Number of positive and negative samples of original data

Before processing, there were 394 abnormal records and 227K normal records.

In R, ROSE and the DMwR package help us quickly execute our sampling strategy. The ROSE package generates data based on a sampling method and a smooth bootstrap method, and it provides a nice invocation interface to help us get things done quickly.

It supports the following sampling methods:

Sampling (Oversampling)

The algorithm can be oversampled by this method. Since the original data set has 227K records, this method will continuously sample the categories with small sample size until the data volume reaches 227K. At this point, the total amount of data set samples will reach 454K. This method can be implemented by specifying method=”over”.

A sampling

Undersampling (Undersampling)

This method is similar to the over-sampling method, and the number of normal records and abnormal records in the data set obtained is the same. However, the under-sampling is the sampling without putting back. Accordingly, in the data set in this paper, due to too few abnormal records, we cannot extract the key information in the sample after under-sampling. This method can be implemented by specifying method=”under”.

undersampling

Both Sampling

This method is a combination of oversampling and undersampling. Most classes use undersampling without putting back, and a few classes use oversampling with putting back. This method can be implemented by specifying method=”both”.

ROSE Sampling

The ROSE sampling method uses a synthetic approach to generate data and provides a better estimate of the raw data.

Synthetic Minority Over-Sampling Technique (SMOTE) Sampling

This method can avoid the over-fitting phenomenon that may occur when a few class samples are repeatedly added to the master data set. For example, the data we get at one time after oversampling may only be a subset of a few classes of data. After understanding these methods, we respectively applied these methods to the original data set, and then the two sample types were counted as follows:

The number of positive and negative samples in the sampled data set

The balance training data set is used to train the classification model again and make prediction on the test data. Since the original data set is unbalanced, we no longer use the accuracy obtained by obfuscation matrix calculation as the evaluation index of the model, instead, we use roc captured by ROC. Curve.

Model training results on sampled data

conclusion

In this experiment, the model trained using SMOTE sampling method has the best performance. Due to the small variation of these sampling methods, very high data accuracy can be obtained when they are combined with highly robust algorithms such as random forest.

When dealing with unbalanced data sets, all of the above sampling methods can be used to experiment in the data set to obtain the best sampling method for the data set. For best results, there are also some advanced sampling methods (see SMOTE in this article).

These sampling methods can also be easily implemented in Python, and if you want to see the full code, check out the Github link provided below.

Training data sets and code

Training data set
This article R, Python implementation code