Data imbalance and Smote Smote

This article is from OPPO’s Internet Technology team. At the same time, you are welcome to follow our official account: OPPO_tech, and share OPPO cutting-edge Internet technology and activities with you.

In the actual production, we may encounter a headache, that is, there may be a serious imbalance in the dependent variables of category type in the classification problem, that is, the proportion between categories is seriously unbalanced.

In order to solve the imbalance in the data, Chawla in 2002 proposed SMOTE method, which was unanimously agreed upon in academia and industry. This article will analyze the Smote algorithm, several variations of Smote algorithm, as well as the source code of a mainstream open implementation of Smote algorithm.

1. Overview of data imbalance

1.1 Common Data Imbalance Scenarios

Medical imaging: Cancer cell recognition, ratio of healthy cells to cancer cells 20:1

Astronomy: Other records and solar wind records, 26 to 1

CTR: The ratio of unclicked to clicked records is 57:1

In addition to the above examples, scenarios of data imbalance issues include fraudulent transactions, identifying customer churn rates (where the vast majority of customers continue to use the service), natural disasters such as earthquakes, and so on.

1.2 Accuracy assessment in unbalanced scenarios

When confronted with unbalanced data sets, machine learning algorithms tend to produce less satisfactory classifiers. For any unbalanced data set, if the events to be predicted fall into a small number of categories and the proportion of events is less than 5%, it is usually called a rare event. Accuracy is not an appropriate indicator to measure model performance in unbalanced areas.

In a utility fraud detection data set, there are the following data:

Total observations = 1000

Fraud observation (positive sample) = 20

Non-fraudulent observation (negative sample) = 980

Proportion of rare events = 2%

In this example, if a classifier correctly classifies all instances belonging to most categories, it achieves 98% accuracy; A small number of observations, 2%, were treated as noise and eliminated. But what we really care about is the two percent of the sample that was eliminated.

So in this case, the accuracy grossly overestimates the performance of the model.

1.3 AUC evaluation in unbalanced scenarios

The AUC, the area of the ROC curve, should be used when evaluating the performance of the model on unbalanced data sets. ROC curve refers to the curve enclosed by the true and false positive example rates corresponding to the threshold of a model during continuous transformation.

False positive example rate: FPR = FP/(TN+FP), the probability that a negative sample is mistaken for a positive sample.

True sample rate: TPR= TP/(TP +FN), the probability of dividing a positive sample into pairs.

The lower the FPR, the better, and the higher the TPR, the better.

AUC value is a probability value. When you randomly select a positive sample and a negative sample, the probability that the current classification algorithm will rank the positive sample before the negative sample according to the Score value calculated is the AUC value. The larger the AUC value is, the more likely the current classification algorithm will rank the positive sample before the negative sample, so as to achieve better classification.

General criteria of AUC:

0.1-0.5: The model performs worse than a random guess
0.5-0.7: Low, but good for forecasting stocks
0.7-0.85: the effect is mediocre
0.85-0.95: Good effect
0.95-1: Very good, but generally unlikely

Here are a few animations to help you understand the AUC better:

Left in the curve of the orange is refers to the distribution of the negative samples (most), the curve of the purple is correct sample (a class), the distribution of the two curves in the middle is transparent boundary model is used to convert probability to 0/1 of the threshold (0.5 by default) in um participant, logistic regression, the right is based on the distribution of the positive and negative samples and the change of threshold when different ROC curve.

The better the classification performance of the classifier is, that is, the stronger the ability to separate positive and negative samples is, the higher the true case rate is, and the lower the false positive case rate is. In this way, the ROC curve is more skewed to the upper left. When the ROC curve is a straight line of 45 degrees, it indicates that the model’s discrimination ability is equal to random guessing.

The ROC curve is drawn the same as the PR curve, that is, all thresholds of the model are traversed, FTR and PTR under the current Threshold are calculated, and then (FPR, TPR) under all Threshlod are connected into a curve. It can be seen that the AUC value of the model will not be affected when the threshold value changes.

When faced with unbalanced data sets, THE ROC curve ignores the imbalance of samples and only considers the classification ability of the model. As shown in the figure, when the proportion of positive and negative samples changes, the classification ability of the model remains unchanged and the shape of the ROC curve does not change. Therefore, it is better to use AUC rather than Accuray to evaluate the performance of the model on unbalanced data sets.

1.4 Small experiment: the impact of unbalanced data on model performance

Here is a small experiment to test the model’s performance with different proportions of positive and negative samples.

Sklear.datasets make_classification was used to generate dichotomous data for the experiment, and the proportion of positive sample/population was [0.01,0.05,0.1,0.2,0.5] respectively. The data was two-dimensional, and the number of samples was 500. The default decision tree model is used for fitting, its AUC value is calculated, and the separation boundary is drawn.

from sklearn.datasets import *
from sklearn.model_selection import *
from sklearn import tree
from sklearn.metrics import *
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Decision boundary drawing functiondef plot_decision_boundary(X_in,y_in,pred_func): x_min, x_max = X_in[:, 0].min() - .5, X_in[:, 0].max() + .5 y_min, y_max = X_in[:, 1].min() - .5, X_in[:, 1].max() +.5h = 0.01xx, yy = np.meshgrid(x_min, x_max, h), np.arange(y_min, y_max, h)) Z = pred_func(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.contourf(xx, yy, Z,cmap=plt.get_cmap('GnBu'))
    plt.scatter(X_in[:, 0], X_in[:, 1], c=y_in,cmap=plt.get_cmap('GnBu'))
# Train the decision tree under different proportions of positive and negative samples, draw the decision boundary, and calculate the AUC
for weight_minority in[0.01, 0.05, 0.1, 0.2, 0.5]. X,y=make_classification(n_samples=500,n_features=2,n_redundant=0,random_state=2,n_clusters_per_class=1,weights=(weight_m inority,1-weight_minority)) plt.scatter(X[:,0],X[:,1],c=y) X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,random_state=6) clf = tree.DecisionTreeClassifier() clf = clf.fit(X_train, y_train) plot_decision_boundary(X,y,lambda x: clf.predict(x)) plt.ion() plt.title("Decision Tree with imbalance rate: "+str(weight_minority))
    plt.show()
    print("current auc:"+str(roc_auc_score(y_test, clf.predict(X_test))))
    print("-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -")

Copy the code

Positive and negative sample ratio 1:100, AUC: 0.5

Positive and negative sample ratio 1:20, AUC: 0.8039

The positive and negative sample ratio is 1:10, AUC: 0.9462

The positive and negative sample ratio is 1:5, AUC: 0.9615

Positive and negative sample ratio 1:1, AUC: 0.9598

It can be seen that when the sample is unbalanced, the model performs poorly. When the proportion of a few classes increases, the AUC of the model rapidly improves and the separation boundary becomes more reasonable. When the proportion continues to increase, the AUC drops slightly because the positive and negative samples overlap.

2. Common methods to deal with the problem of data imbalance

This section will provide a simple summary of common methods to deal with data imbalance, which can be divided into four categories according to sample processing methods: oversampling, undersampling, oversampling + undersampling, and anomaly detection.

2.1 sample

Random oversampling: multiple copies of negative samples are randomly selected.

Smote: Using interpolation to create a new sample, the principles of which are specified in section 3.

Adasyn: The principle is similar to Smote, only when taking a seed sample you use a KNN classifier to choose the samples that are more likely to be wrong.

2.2 undersampling

Random undersampling

A part of data is randomly selected from most types of samples for deletion. A big disadvantage of random undersampling is that the distribution of samples is not considered, and the sampling process is very random, so some important information in most types of samples may be mistakenly deleted.

EasyEnsemble

The majority of class samples are randomly divided into n subsets, the number of each subset is equal to the number of minority class samples, which is equivalent to undersampling. Then, each subset is combined with a few class samples to train a model respectively. Finally, N models are integrated. In this way, although the sample of each subset is less than the total sample, the total information does not decrease after integration.

BalanceCascade

The combination of supervision and Boosting is adopted. In the n round of training, the subsets sampled from most class samples and a few class samples are combined to train a base learner H. After the training, the samples that can be correctly classified by H in most classes will be eliminated.

In the next n+1 round, a subset is generated from the majority of excluded class samples for training combined with a small number of class samples, and finally different base learners are integrated. BalanceCascade’s supervised table now plays the role of selecting samples from most classes in each round of base learners, and its Boosting feature is that each round of base learners will discard the correctly classified samples, and subsequent base learners will pay more attention to those incorrectly classified samples.

NearMiss

A prototype selection method is used to select the most representative samples from the majority of class samples for training, mainly in order to alleviate the problem of information loss in random undersampling. NearMiss adopts some heuristic rules to select samples, which can be divided into three categories according to different rules:

Nearmiss-1: Select the most samples with the closest average distance to K minority samples.

Nearmiss-2: Select the most distant K minority class samples with the closest average distance to most class samples.

Nearmiss-3: For each minority class sample, K most recent majority class samples are selected to ensure that each minority class sample is surrounded by majority class samples.

Tomek Link

Tomek Link represents the closest pair of samples between different categories, that is, these two samples are each other’s nearest neighbors and belong to different categories. Thus if two samples form a Tomek Link, either one of them is noise or both samples are near the boundary. In this way, by removing Tomek Link, overlapping samples between classes can be “cleaned out”, so that the samples of the closest neighbors of each other belong to the same category, so that better classification can be made.

Edited Nearest Neighbours (ENN)

For a sample belonging to the majority class, if more than half of its K nearest neighbor points do not belong to the majority class, the sample will be eliminated. Another variation of this method is that if all K neighbors are not in the majority class, the sample will be removed.

Finally, the biggest drawback of data cleaning techniques is the inability to control the amount of undersampling. Since the k-nearest neighbor method is adopted to some extent, and in fact, most of the samples of most classes are surrounded by most classes, so the samples of most classes that can be eliminated are relatively limited.

2.3 Oversampling + Undersampling

Many experiments have shown that SMOTE + ENN, SMOTE + Tomek, using a combination of oversampling and undersampling together gives better results than these two methods alone.

2.4 Abnormal Detection methods

When the samples of minority classes do not belong to the same distribution, the exception detection method can be considered to distinguish the majority class and minority class.

Statistical method detection

The statistical method is also relatively simple, generally divided into two steps:

First, it is assumed that the full data obey a certain distribution, such as the common normal distribution, Poisson distribution, etc.
Then calculate the probability that each point belongs to this distribution, which is the usual problem of determining the density function from the mean and variance.

OneClassSVM

Only one class of information can be used for training. The other classes (known as outliers) are missing, meaning that the boundary between the two classes is learned from information about only one class of data. I’m looking for a hyperplane and I’m going to circle the positive examples in the sample, and the prediction is to make a decision with this hyperplane, and the samples in the circle are considered to be positive samples.

Isolated forests

In iForest, anomaly is defined as “more likely to be separated”, which can be understood as the point with sparse distribution and far away from the group with high density. In the feature space, sparsely distributed regions indicate that the probability of events occurring in this region is very low, so it can be considered that the data falling in these regions are abnormal.

Isolated forest is an unsupervised anomaly detection method applicable to Continuous numerical data, that is, no marked samples are needed for training, but the features need to be Continuous. IForest uses a very efficient strategy for finding out which points are susceptible to isolation. In an isolated forest, the data set is divided randomly recursively until all sample points are isolated. In this random segmentation strategy, outliers usually have short paths.

3. Smote: The principle of Smote

3.1 What is Smote

In 2002, Chawla proposed SMOTE (synthetic minority oversampling technique) algorithm, which was unanimously agreed upon in academia and industry. The idea summarized in SMOTE is to create extra samples by interpolating between a few classes of samples.

Why SMOTE? Can’t you just randomly oversample? The following example is from SMOTE original thesis.

The data were taken from a breast cancer X-ray dataset. Red is the minority sample and green is the majority sample. The above figure shows the decision tree separation boundary of training after direct random oversampling, indicating serious overfitting in random oversampling. The model only learns the few seed samples that are randomly oversampled, and treats the other few kinds of samples as noise!

Specifically, for a small class sampleUsing the k-nearest neighbor method (the value of K needs to be specified in advance), the value ofWhere the distance is defined as the Euclidean distance of n-dimensional feature space between the samples. Then one of the k neighbor points is randomly selected to generate a new sample using the following formula:

Among themIs the selected k-nearest neighbor point, and δ∈[0,1] is a random number. Here is an example of SMOTE SMOTE, using 3-smote SMOTE, so you know the SMOTE sample is usually right thereandOn a straight line:

Here is the pseudocode from the original paper, if you are interested, but the core code is line 22: Interpolated logic.

3.2 Smote Smote

Border-line SMOTE

SMOTE SMOTE SMOTE SMOTE SMOTE SMOTE SMOTE a few samples randomly used to make a new sample without considering the conditions of the surrounding samples, which can easily cause two problems:

If the selected minority samples are surrounded by minority samples, the newly synthesized samples will not provide much useful information. This is like support vector machines where points far away from margin have little effect on the decision boundary.
If the selected minority samples are surrounded by the majority samples, such samples may be noise, then the newly synthesized samples will overlap with the majority of the surrounding samples, making it difficult to classify.

In general, we hope that the newly synthesized sample of a few classes will be near the boundary of the two classes, which will often provide enough information for classification. And that’s what the following border-line SMOTE algorithm wants to do.

This algorithm will first divide all the minority class samples into three categories, as shown in the figure below:

“Noise” : All k-nearest neighbor samples belong to the majority class
“Danger” : more than half of the K-nearest neighbor samples belong to the majority category
“Safe” : More than half of the k-nearest neighbor samples belong to a small class

The border-line SMOTE SMOTE would only choose at random from the samples in the “danger” condition and use SMOTE to create a new sample. Samples in “danger” status represent a small number of samples near the “boundary”, and samples near the boundary are more likely to be misclassified. So SMOTE SMOTE SMOTE all the minorities the same, using synthetic SMOTE only those close to the “Border”.

SVM SMOTE

A SVM classifier is used to find the support vector, and then new samples are synthesized based on the support vector. Similar to Broderline Smote, Smote SVM will also determine the type of sample based on the K-nearest neighbor (safe, Danger,noice), then train the SVM with the Danger sample.

Kmeans SMOTE

Before synthesizing samples, the samples are clustered, and then the negative samples of different clusters are synthesized according to the size of cluster density. In the clustering step, k-means was used to cluster k groups. Filtering selects the clusters used for oversampling, preserving the clusters with a high proportion of a small number of class samples. It then distributes the number of synthetic samples, distributing more samples to a handful of sparsely distributed clusters. Finally, oversampling step, apply SMOTE in each selected cluster to achieve the target rate of minority and majority instances.

SMOTE-NC

Smote: None of the above methods work with Smote, Smote -NC because Smote cannot calculate interpolation, Smote -NC will refer to the Smote nearest friend when making a new sample and bid the most Smote.

Here is a visualization of oversampling over Smote of several variations:

So using primitive Smote, the new sample produced by interpolation is very likely to cross the right separation boundary into some other class area, and other distortions solve this problem very well.

4. Smote smote Imbalanced_learn

One popular implementation of Smote is from sklearn contrib project imbalanced_learn, using imbalanced_learn which conforms to sklearn’s API specifications. Here’s a Smote sample code using Smote:

>>> from collections import Counter >>> from sklearn.datasets import make_classification >>> from imblearn.over_sampling  import SMOTE >>> X, y = make_classification(n_classes=2, class_sep=2, ... Weights =[0.1, 0.9], n_informative=3, n_REDUNDANT =1, flip_Y =0,... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10) >>>print('Original dataset shape %s' % Counter(y))
Original dataset shape Counter({1: 900, 0: 100})

>>> sm = SMOTE(random_state=42)
>>> X_res, y_res = sm.fit_resample(X, y)
>>> print('Resampled dataset shape %s' % Counter(y_res))

Resampled dataset shape Counter({0: 900, 1: 900})
Copy the code

The class inheritance relationship in the source code is as follows:

Below we analyse the source code using boardline_smote as an example, we will analyse the key codes of several core classes in the order from bottom to top (SMOTE->BorderlineSMOTE->BaseSMOTE).

SMOTE

In using SMOTE, the most basic class is SMOTE. SMOTE either using the original SMOTE method, or using BoardlineSmote, SVMSmote by SMOTE.

The way to make the sample is SMOTE fit_resample, and the fit_resample method does two things:

The first thing is to do a model validation, involving checking whether the KNN model instance is Smote, and checking the type of current Smote instance, if the current algorithm is BoardLineSMOTE or SVMSmote, binding the _sample method of the corresponding algorithm to the current object;

The second thing is to call the _sample method, where the basic primitive SMOTE _sample method trains a KNN classifier, gets the K-neighbors of all the few classes through KNN, and then calls the _make_samples method, The interpolation logic we mentioned earlier is implemented in _make_samples which is implemented by the superclass BaseSMOTE of SVMSMOTE and BorderlineSMOTE, which we’ll talk about later.

If the algorithm is BoardLineSMOTE or SVMSMOTE, then SMOTE superclass, SVMSMOTE, and the _sample method implemented in BorderlineSMOTE would be called.

class SMOTE(SVMSMOTE, BorderlineSMOTE):
    
    def __init__(self,
                 sampling_strategy='auto',
                 random_state=None,
                 k_neighbors=5,
                 m_neighbors='deprecated',
                 out_step='deprecated',
                 kind='deprecated',
                 svm_estimator='deprecated',
                 n_jobs=1,
                 ratio=None):
        # FIXME:In 0.6 the call super ()BaseSMOTE.__init__(self, sampling_strategy=sampling_strategy, random_state=random_state, k_neighbors=k_neighbors, n_jobs=n_jobs, ratio=ratio) self.kind = kind self.m_neighbors = m_neighbors self.out_step = out_step self.svm_estimator = svm_estimator  self.n_jobs = n_jobs# mainly used to check whether the KNN model instance is Smote, and also to check the type of current Smote instance
    def _validate_estimator(self):
        # FIXME:In 0.6 the call super ()
        BaseSMOTE._validate_estimator(self)
        # FIXME:Remove in 0.6 after deprecation cycle

        # Determine the model type and bind the model's SAMPLE method to the current object
        ifself.kind ! ='deprecated' and not (self.kind == 'borderline-1' or
                                              self.kind == 'borderline-2') :if self.kind not in SMOTE_KIND:
                raise ValueError('Unknown kind for SMOTE algorithm.'
                                 ' Choices are {}. Got {} instead.'.format(
                                     SMOTE_KIND, self.kind))
            else:
                warnings.warn("Kind" is deprecated in 0.4 and will be"
                              'removed in SMOTE, BorderlineSMOTE or '
                              'SVMSMOTE instead.', DeprecationWarning)
            If the type is BorderLine, use the sample method of the BorderlineSMOTE class
            if self.kind == 'borderline1' or self.kind == 'borderline2':
                self._sample = types.MethodType(BorderlineSMOTE._sample, self)
                self.kind = ('borderline-1' if self.kind == 'borderline1'
                             else 'borderline-2')

            elif self.kind == 'svm':
                self._sample = types.MethodType(SVMSMOTE._sample, self)

                if self.out_step == 'deprecated': self. Out_step = 0.5else:
                    warnings.warn("Out_step" is deprecated in 0.4 and will be deprecated"
                                  'Be removed in 0.6. Use SVMSMOTE class'
                                  'instead.', DeprecationWarning)

                if self.svm_estimator == 'deprecated':
                    warnings.warn('" svM_estimator "is deprecated in 0.4 and"
                                  'Will be removed in 0.6. Use SVMSMOTE class'
                                  'instead.', DeprecationWarning)
                if (self.svm_estimator is None or
                        self.svm_estimator == 'deprecated'):
                    self.svm_estimator_ = SVC(gamma='scale',
                                              random_state=self.random_state)
                elif isinstance(self.svm_estimator, SVC):
                    self.svm_estimator_ = clone(self.svm_estimator)
                else:
                    raise_isinstance_error('svm_estimator', [SVC],
                                           self.svm_estimator)

            ifself.kind ! ='regular':
                if self.m_neighbors == 'deprecated':
                    self.m_neighbors = 10
                else:
                    warnings.warn('" m_Neighbors "is deprecated in 0.4 and"
                                  'Will be removed in 0.6. Use SVMSMOTE class'
                                  'or BorderlineSMOTE instead.',
                                  DeprecationWarning)

                self.nn_m_ = check_neighbors_object(
                    'm_neighbors', self.m_neighbors, additional_neighbor=1)
                self.nn_m_.set_params(**{'n_jobs': self.n_jobs})

    # FIXME:To be removed in 0.6
    def _fit_resample(self, X, y):
        self._validate_estimator()
        return self._sample(X, y)

    # Sampling key functions
    def _sample(self, X, y):
        # FIXME:Uncomment in version 0.6
        # self._validate_estimator()
        X_resampled = X.copy()
        y_resampled = y.copy()

        for class_sample, n_samples in self.sampling_strategy_.items():
            if n_samples == 0:
                continue
            target_class_indices = np.flatnonzero(y == class_sample)
            X_class = safe_indexing(X, target_class_indices)
      # Train a KNN classifier to get the K nearest neighbors of a few classes
            self.nn_k_.fit(X_class)
            nns = self.nn_k_.kneighbors(X_class, return_distance=False)[:, 1:]
      # The actual oversampling method _make_samples is implemented by BaseSMOTEX_class, y_new = self._make_samples(X_class, y.dtype, class_sample, X_class, NNS, n_samples, 1.0)# Supports oversampling of dense and sparse data
            if sparse.issparse(X_new):
                X_resampled = sparse.vstack([X_resampled, X_new])
                sparse_func = 'tocsc' if X.format == 'csc' else 'tocsr'
                X_resampled = getattr(X_resampled, sparse_func)()
            else:
                X_resampled = np.vstack((X_resampled, X_new))
            y_resampled = np.hstack((y_resampled, y_new))

        return X_resampled, y_resampled
Copy the code

BorderlineSMOTE, this class can be called with SMOTE Boardline passed in, or called directly. The core function in this class is the _sample method. _sample does two things:

The first thing is similar to SMOTE, check whether the KNN classifier is string used, and also check whether the user is using borderline 1 or Borderline 2.

The second thing is the logic to generate the new samples, where the interpolation logic also calls BaseSMOTE’s _make_samples method, The difference between the Borderline and the original SMOTE is the need to divide the sample into Safe and Danger (also implemented in BaseSMOTE) take the seed sample and generate a new sample following the different strategy in The Borderline 1/2.

See the comments in the code for the detailed logic.

class BorderlineSMOTE(BaseSMOTE):
       def __init__(self,
                 sampling_strategy='auto',
                 random_state=None,
                 k_neighbors=5,
                 n_jobs=1,
                 m_neighbors=10,
                 kind='borderline-1') : super().__init__( sampling_strategy=sampling_strategy, random_state=random_state, k_neighbors=k_neighbors, n_jobs=n_jobs, ratio=None) self.m_neighbors = m_neighbors self.kind = kind def _validate_estimator(self): super()._validate_estimator() self.nn_m_ = check_neighbors_object('m_neighbors', self.m_neighbors, additional_neighbor=1)
        self.nn_m_.set_params(**{'n_jobs': self.n_jobs})
        if self.kind not in ('borderline-1'.'borderline-2'):
            raise ValueError('The possible "kind" of algorithm are '
                             '"borderline-1" and "borderline-2".'
                             'Got {} instead.'.format(self.kind))

    # FIXME:Rename _sample -> _FIT_resample in 0.6
    def _fit_resample(self, X, y):
        return self._sample(X, y)

    def _sample(self, X, y):
        self._validate_estimator()
         # to get the copy
        X_resampled = X.copy()
        y_resampled = y.copy()

        for class_sample, n_samples in self.sampling_strategy_.items():
            if n_samples == 0:
                continue

            # Get the index of a few classes
            target_class_indices = np.flatnonzero(y == class_sample)

            # Get a sample list of a few classes
            X_class = safe_indexing(X, target_class_indices)

            # Train KNN model with full samples (this model is used to calculate risk samples)
            self.nn_m_.fit(X)

            # Get index list of dangerous samples
            danger_index = self._in_danger_noise(
                self.nn_m_, X_class, class_sample, y, kind='danger')

            # Skip if there are no hazardous samples
            if not any(danger_index):
                continue

            Train a KNN model with a few classes
            self.nn_k_.fit(X_class)
 
            # Get the nearest neighbor of the dangerous sample
            nns = self.nn_k_.kneighbors(safe_indexing(X_class, danger_index),
                                        return_distance=False)[:, 1:]

            # divergence between borderline-1 and borderline-2
            # Borderline -1 Sampling to interpolate neighbors belongs to only a few classes
            if self.kind == 'borderline-1':
                # Create synthetic samples for borderline points.
                X_new, y_new = self._make_samples(
                    safe_indexing(X_class, danger_index), y.dtype,
                    class_sample, X_class, nns, n_samples)
                if sparse.issparse(X_new):
                    X_resampled = sparse.vstack([X_resampled, X_new])
                else:
                    X_resampled = np.vstack((X_resampled, X_new))
                y_resampled = np.hstack((y_resampled, y_new))

            The # Borderline -2 sample interpolation neighbor may belong to any class
            elif self.kind == 'borderline-2':
                random_state = check_random_state(self.random_state)
                fractions = random_state.beta(10, 10)

                # only minority
                X_new_1, y_new_1 = self._make_samples(
                    safe_indexing(X_class, danger_index),
                    y.dtype,
                    class_sample,
                    X_class,
                    nns,
                    int(fractions * (n_samples + 1)),
                    step_size=1.)

                # we use a one-vs-rest policy to handle the multiclass in which
                # new samples will be created considering not only the majority
                # class but all over classes.X_new_2, y_new_2 = self._make_samples( safe_indexing(X_class, danger_index), y.dtype, class_sample, safe_indexing(X, np.flatnonzero(y ! Int ((1 - fractions) * n_samples), NNS, int((1 - fractions) * n_samples), step_size=0.5)if sparse.issparse(X_resampled):
                    X_resampled = sparse.vstack(
                        [X_resampled, X_new_1, X_new_2])
                else:
                    X_resampled = np.vstack((X_resampled, X_new_1, X_new_2))
                y_resampled = np.hstack((y_resampled, y_new_1, y_new_2))
        return X_resampled, y_resampled
Copy the code

BaseSMOTE

It is the superclass of SVMSMOTE and BoardlineSMOTE, and mainly implements a few methods needed in the subclasses, namely the _make_samples, _in_danger_noise and _generate_sample mentioned earlier.

_make_samples mainly implements the logic for sample traversal.
_in_danger_noise implements danger/safe/noise logic.
_generate_sample is the interpolation logic mentioned earlier.

See the comments for a detailed code analysis.

SMOTE_KIND = ('regular'.'borderline1'.'borderline2'.'svm')
class BaseSMOTE(BaseOverSampler):
    """Base class for the different SMOTE algorithms."""
    def __init__(self,
                 sampling_strategy='auto',
                 random_state=None,
                 k_neighbors=5,
                 n_jobs=1,
                 ratio=None):
        super().__init__(
            sampling_strategy=sampling_strategy, ratio=ratio)
        self.random_state = random_state
        self.k_neighbors = k_neighbors
        self.n_jobs = n_jobs

    def _validate_estimator(self):
        """Check the NN estimators shared across the different SMOTE algorithms. """
        self.nn_k_ = check_neighbors_object(
            'k_neighbors', self.k_neighbors, additional_neighbor=1)
        self.nn_k_.set_params(**{'n_jobs': self.n_jobs})

    Function that makes synthetic samples
    def _make_samples(self,
                      X,
                      y_dtype,
                      y_type,
                      nn_data,
                      nn_num,
                      n_samples,
                      step_size=1.):
        """A support function that returns artificial samples constructed along the line connecting nearest neighbours. Parameters  ---------- X : {array-like, sparse matrix}, shape (n_samples, n_features) Points from which the points will be created. y_dtype : dtype The data type of the targets. y_type : str or int The minority target value, just so the function can return the target values for the synthetic variables with correct length in a clear format. nn_data : ndarray, shape (n_samples_all, n_features) Data set carrying all the neighbours to be used nn_num : ndarray, shape (n_samples_all, k_nearest_neighbours) The nearest neighbours of each sample in `nn_data`. n_samples : int The number of samples to generate. step_size : float, optional (default=1.) The step size to create samples. Returns ------- X_new : {ndarray, sparse matrix}, shape (n_samples_new, n_features) Synthetically generated samples. y_new : ndarray, shape (n_samples_new,) Target values for synthetic samples. """
         # Get the current instance of Random_state
        random_state = check_random_state(self.random_state)
        # Get an array with the length of the sample to be generated. Each value is the nearest neighbor used to generate the sample. (Use a number to store the column and column coordinates.
        samples_indices = random_state.randint(
            low=0, high=len(nn_num.flatten()), size=n_samples)
        Step size, default is 1, if more than one, then the generated sample is on the extension of the original sample and the nearest neighbor
        steps = step_size * random_state.uniform(size=n_samples)
        # Divide the number of neighbors to get the number of rows
        rows = np.floor_divide(samples_indices, nn_num.shape[1])
        # Obtain the number of nearest neighbors and obtain the number of columns
        cols = np.mod(samples_indices, nn_num.shape[1])
        # Create the tag column that generates the sample
        y_new = np.array([y_type] * len(samples_indices), dtype=y_dtype)
 
        If the input X is a sparse matrix
        if sparse.issparse(X):
            # Record row
            row_indices, col_indices, samples = [], [], []

            for i, (row, col, step) in enumerate(zip(rows, cols, steps)):
                If the current sample is not empty
                if X[row].nnz:
                    # Generate synthetic samples
                    sample = self._generate_sample(X, nn_data, nn_num,
                                                   row, col, step)
                    # Indece
                    row_indices += [i] * len(sample.indices)
                    col_indices += sample.indices.tolist()
                    # Record sample
                    samples += sample.data.tolist()
             # Return the sparse matrix csr_matrix composed of generated samples
            return (sparse.csr_matrix((samples, (row_indices, col_indices)),
                                      [len(samples_indices), X.shape[1]],
                                      dtype=X.dtype),
                    y_new)

 If not sparse matrix
        else:
 # Construct return result set ndarray
            X_new = np.zeros((n_samples, X.shape[1]), dtype=X.dtype)
            for i, (row, col, step) in enumerate(zip(rows, cols, steps)):
 # Generate a sample of the current row and column and place it at the ith position
                X_new[i] = self._generate_sample(X, nn_data, nn_num,
                                                 row, col, step)
            returnX_new, y_new 'Function that generates samples
    def _generate_sample(self, X, nn_data, nn_num, row, col, step):
        r"""Generate a synthetic sample.

        The rule for the generation is:

        .. math::
           \mathbf{s_{s}} = \mathbf{s_{i}} + \mathcal{u}(0, 1) \times
           (\mathbf{s_{i}} - \mathbf{s_{nn}}) \,

        where \mathbf{s_{s}} is the new synthetic samples, \mathbf{s_{i}} is
        the current sample, \mathbf{s_{nn}} is a randomly selected neighbors of
        \mathbf{s_{i}} and \mathcal{u}(0, 1) is a random number between [0, 1).

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape (n_samples, n_features)
            Points from which the points will be created.

        nn_data : ndarray, shape (n_samples_all, n_features)
            Data set carrying all the neighbours to be used.

        nn_num : ndarray, shape (n_samples_all, k_nearest_neighbours)
            The nearest neighbours of each sample in `nn_data`.

        row : int
            Index pointing at feature vector in X which will be used
            as a base for creating new sample.

        col : int
            Index pointing at which nearest neighbor of base feature vector
            will be used when creating new sample.

        step : float
            Step size for new sample.

        Returns
        -------
        X_new : {ndarray, sparse matrix}, shape (n_features,)
            Single synthetically generated sample.

        """
        #X[row] is the original sample
        #step is the random ratio
        #nn_num[row, col] is the index value of the nearest neighbor currently used in the full sample
        #nn_data[nn_num[row, col]
        return X[row] - step * (X[row] - nn_data[nn_num[row, col]])

    Check whether the sample is in danger or noisy
    def _in_danger_noise(self, nn_estimator, samples, target_class, y,
                         kind='danger') :"""Estimate if a set of sample are in danger or noise. Used by BorderlineSMOTE and SVMSMOTE. Parameters ---------- nn_estimator : estimator An estimator that inherits from :class:`sklearn.neighbors.base.KNeighborsMixin` use to determine if a sample is in danger/noise. samples : {array-like, sparse matrix}, shape (n_samples, n_features) The samples to check if either they are in danger or not. target_class : int or str The target corresponding class being over-sampled. y : array-like, shape (n_samples,) The true label in order to check the neighbour labels. kind : str, optional (default='danger') The type of classification to use. Can be either: - If 'danger', check if samples are in danger, - If 'noise', check if samples are noise. Returns ------- output : ndarray, shape (n_samples,) A boolean array where True refer to samples in danger or noise. """
        Take the K nearest neighbor of the target sample and put it into a matrix
        x = nn_estimator.kneighbors(samples, return_distance=False)[:, 1:]
 
        # List of tags for most classes, with most classes being 1 and a few classes being 0nn_label = (y[x] ! = target_class).astype(int)# Number of most classes
        n_maj = np.sum(nn_label, axis=1)

        # If it is dangerous value
        if kind == 'danger':
            # Samples are in danger for m/2 <= m' < m
            # If the number of most classes is greater than half of the cut is less than K, it is a dangerous sample,
            # Here we use bitwise to bitwise conditional values, which is more efficient than cycling
            return np.bitwise_and(n_maj >= (nn_estimator.n_neighbors - 1) / 2,
                                  n_maj < nn_estimator.n_neighbors - 1)
        elif kind == 'noise':
            # Samples are noise for m = m'
            # All neighbors are majority classes, that is noise
            return n_maj == nn_estimator.n_neighbors - 1
        else:
            raise NotImplementedError
Copy the code

Source link: github.com/scikit-lear…

5. Smote Smote

5.1 Smote in Imbalanced_learn


SMOTE(ratio='auto', random_state-None, k_neighbors=5, m_neighbors=10,
  out_step=0.5, kind="regular", svm_estimator=None, n_jobs=1)
Copy the code

Thewire: It is used to specify the proportion of resampling. If the value of character type is specified, it can be ‘minority’, which means sampling of minority samples, ‘majority’, which means sampling of majority samples, ‘not minority’, which means under-sampling method, ‘all’, which means over-sampling method. Default to ‘auto’, equivalent to ‘all’ and ‘not minority’; If the value of word typical is specified, the key is the label of each category, and the value is the sample size under the category;

Random_state: Specifies the seed of the random number generator. Defaults to None to indicate that the default random number generator is used.

K_neighbors: specifies the number of neighbors. The default value is 5.

M_neighbors: specifies the number of samples to be randomly selected from the neighbor samples. Default is 10.

Kind: used to specify SMOTE, the option used by the SMOTE algorithm to create a new sample. The default is’ regular ‘, indicating a random sampling of a few classes of samples, or ‘borderline1’, ‘borderline2’, or ‘SVM’;

Svm_estimator: Used to specify the SVM classifier. The default value is Sklear.svm. SVC. The purpose of this parameter is to generate support vectors using the SVM classifier and then regenerate new samples of a few categories.

N_jobs: Used to specify the number of cpus required by the SMOTE algorithm when oversampling. The default of 1 means running the algorithm with only 1 CPU, meaning no parallel operation is used.

5.2 SMOTE application of CHURN user data set at Deutsche Telekom

This data set is derived from the customer historical transaction data of a Telecom industry in Germany. This data set contains 5000 records and 17 features in total, where the tag churn is a binary variable, where yes represents customer churn and no represents no customer churn.

The remaining independent variables include whether the customer subscribs to the international call plan, the voice plan, the number of short calls, the call fee, the number of calls, and so on. Next, this data set is used to explore the effect of the unbalanced data after the transformation of balance.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn import tree
from imblearn.over_sampling import SMOTE
from sklearn.metrics import *
from sklearn.linear_model import LogisticRegression
# Clean data
churn=pd.read_csv(r'C:\Users\Administrator\Desktop\Work\data\churn_all.txt',sep='\t')
churn=churn.drop(['Instance_ID'],axis=1)
col=['State'.'Account Length'.'Area Code'.'Phone'.'Intl Plan'.'VMail Plan'.'VMail Message'.'Day Mins'.'Day Calls'.'Day Charge'.'Eve Mins'.'Eve Calls'.'Eve Charge'.'Night Mins'.'Night Calls'.'Night Charge'.'Intl Mins'.'Intl Calls'.'Intl Charge'.'CustServ Calls'.'churn'] 
churn.columns=col
churn['churn'].value_counts()
# The positive and negative sample ratio is 5:1
plt.rcParams['font.sans-serif'] = ['Microsoft Yahei']
plt.axes(aspect='equal')
counts=churn.churn.value_counts()
plt.pie(x=counts,labels=pd.Series(counts.index))
plt.show()
# Data cleaning
churn.drop(labels=['State'.'Area Code'.'Phone'],axis=1,inplace=True)
churn['Intl Plan']=churn['Intl Plan'].map({' no': 0.' yes':1})
churn['VMail Plan']=churn['VMail Plan'].map({' no': 0.' yes':1})
churn['churn']=churn['churn'].map({' False.': 0.' True.': 1})# Build training sets and test sets
predictors=churn.columns[:-1]
X_train,X_test,y_train,y_test=model_selection.train_test_split(churn[predictors],churn.churn,random_state=12)
# Train LR model with unbalanced data and check AUC
lr=LogisticRegression()
lr.fit(X_train,y_train)
pred=lr.predict(X_test)
roc_auc_score(y_test, pred)
Print ROC curve to calculate AUC
fpr,tpr,threshold=metrics.roc_curve(y_test,pred)
roc_auc=metrics.auc(fpr,tpr)
plt.stackplot(fpr,tpr,color='steelblue', edgecolor = alpha = 0.5'black')
plt.plot(fpr,tpr,color='black'PLT, lw = 1). The plot ([0, 1], [0, 1], color ='red',linestyle=The '-') PLT. Text (0.5, 0.3,'the ROC cur (area = % 0.3 f)' % roc_auc)
plt.xlabel('1-Specificity')
plt.ylabel('Sensitivity')
plt.show()
# SMOTE, resampling
over_samples=SMOTE(random_state=1234)
over_samples_X,over_samples_y=over_samples.fit_sample(X_train,y_train)
# Sample number comparison before and after sampling
print(y_train.value_counts())
print(pd.Series(over_samples_y).value_counts())
Train LR model with equalization data
lr2=LogisticRegression()
lr2.fit(over_samples_X,over_samples_y)
fpr,tpr,threshold=metrics.roc_curve(y_test,pred2)
roc_auc=metrics.auc(fpr,tpr)
plt.stackplot(fpr,tpr,color='steelblue', edgecolor = alpha = 0.5'black')
plt.plot(fpr,tpr,color='black'PLT, lw = 1). The plot ([0, 1], [0, 1], color ='red',linestyle=The '-') PLT. Text (0.5, 0.3,'the ORC cur (area = % 0.3 f)' % roc_auc)
plt.xlabel('1-Specificity')
plt.ylabel('Sensitivity')
plt.show()
Copy the code

The ratio of positive and negative samples is 5:1

AUC before undersampling: 0.55

AUC after undersampling: 0.735

The AUC of the model trained with oversampled data improved by 20%!

The resources

1 How to handle Imbalanced Classification Problems in machine learning?

www.analyticsvidhya.com/blog/2017/0…

2 SMOTE: Synthetic Minority Over-sampling Technique

3. SMOTE SMOTE

Blog.csdn.net/qq_33472765…

4 Imbalanced learn User Guide

Imbalanced-learn.org/en/stable/u… .

5 A scikit-learn-contrib to tackle learning from imbalanced data

Glemaitre. Making. IO/talks / 2018 _…

Send out a few job listings at the end.

OPPO Offers multiple positions in Internet Technology field:

Advertising background team focused on advertising management, retrieval, billing statistics advertising system core services such as r&d, invites has the capability of distributed system architecture design and tuning, for high availability/high concurrent systems have practical experience, have strong interest in computational advertising classmates join us and work together for intelligent advertising platform.

Resume delivery: chenquan#oppo.com

The client team is committed to researching the commercial realization solutions of applications and games on Android phones, and assisting applications and games to quickly realize the profit through commercial SDK. We sincerely invite Android application developers who are interested in Android application and game commercialization solutions and have over 3 years of development experience to join us and grow with our team and business.

Resume delivery: liushun#oppo.com

The Data Tag team is committed to penetrating big data to understand the business interests of each OPPO user. We sincerely invite you with more than 2 years experience in data analysis, big data processing, machine learning/deep learning, NLP to join us and grow together with our team and business!

Resume: ping.wang#oppo.com