Github is a project for unbalanced learning of papers, codes, frameworks, libraries and other resources

https://github.com/ZhiningLiu1998/awesome-imbalanced-learning

preface

Class-imbalance, also known as the long-tail problem, refers to the inconsistency in the number of categories of data sets in a classification problem, some of which are extremely large and others are extremely small, and this is a very common problem in practical application. For example, fraud detection, the prediction of rare adverse drug reactions, and gene family prediction. Because the category imbalance will lead to inaccurate prediction and reduce the performance of the classification model. Therefore, the goal of Imbalanced learning is to solve the problem of unbalanced categories and learn an unbiased model from unbalanced data.

The directory is as follows:

  • Code base/framework
    • Python
    • R
    • Java
    • Scalar
    • Julia
  • The paper
    • review
    • Deep learning
    • Data resampling
    • Cost-sensitive Learning
    • Ensemble Learning
    • Anomaly detection
  • other
    • Unbalanced database
    • Other resources

In addition, marked 🉑 are papers and frameworks that have been specifically recommended by the authors as important or of high quality.


Code base/framework

Python

imbalanced-learn

Website: https://imbalanced-learn.org/stable/

Github:https://github.com/scikit-learn-contrib/imbalanced-learn

The official document: https://imbalanced-learn.readthedocs.io/en/stable/

The paper address: http://10.187.70.34/www.jmlr.org/papers/volume18/16-365/16-365.pdf

The Scikit-learn library is a Python library that provides some resampling techniques commonly used in data sets. It ADAPTS to the Scikit-learn library and is also part of the Scikit-learn-contrib library.

🉑 Python, easy to use

R

  • smote_variants

Website: https://smote-variants.readthedocs.io/en/latest/

Documents: [(https://smote-variants.readthedocs.io/en/latest/

Github:https://github.com/analyticalmindsltd/smote_variants)

A collection of 85 oversampling techniques for imbalanced learning, including multi-class oversampling and model selection feature methods (R and Julia supported)

  • caret

Website: https://cran.r-project.org/web/packages/caret/index.html

Documents: http://topepo.github.io/caret/index.html

Github:https://github.com/topepo/caret

Random under-sampling and over-sampling methods are implemented

  • ROSE

Website: https://cran.r-project.org/web/packages/ROSE/index.html

Documents: https://www.rdocumentation.org/packages/ROSE/versions/0.0-3

The random oversampling method is implemented

  • DMwR

Website: https://cran.r-project.org/web/packages/DMwR/index.html

Documents: https://www.rdocumentation.org/packages/DMwR/versions/0.4.1

SMOTE (Synthetic Minority Over-sampling TEchnique) is realized

Java

KEEL

Website: https://sci2s.ugr.es/keel/description.php

Github:https://github.com/SCI2SUGR/KEEL

Paper: https://sci2s.ugr.es/sites/default/files/ficherosPublicaciones/0758_Alcalaetal-SoftComputing-Keel1.0.pdf

KEEL provides a simple graphical interface to design experiments with different data sets based on data flows, as well as algorithms for different intelligent computations (with a special focus on evolutionary algorithms), thus touching the behavior of algorithms. The tool includes many widely used unbalanced learning methods, such as oversampling and undersampling, loss sensitive learning, algorithm modification, and integrated learning methods.

🉑 contains a variety of algorithms, such as classical classification algorithm, regression algorithm and pretreatment algorithm

Scalar

undersampling

Website: https://github.com/NestorRV/undersampling

Documents: https://nestorrv.github.io/

Github:https://github.com/NestorRV/undersampling

Undersampling method and integrated method deformation method are realized.

Julia

Smote_variants website: https://smote-variants.readthedocs.io/en/latest/ document: https://smote-variants.readthedocs.io/en/latest/ Github:https://github.com/analyticalmindsltd/smote_variants

A collection of 85 oversampling techniques for imbalanced learning, including multi-class oversampling and model selection feature methods (R and Julia supported)


The paper

  • Learning from ImBalanced Data, 2009, with 4700+ citations, is a classic paper. This paper systematically reviews popular solutions, evaluation criteria and challenges and problems to be faced in future research (in 2009).

🉑 Classic work

  • Learning from imbalanced data: Open Challenges and Future Directions (2016, 400+ Citation, this paper focuses on the open problems and challenges of imbalanced learning, such as extreme class imbalanced, and dealing with imbalanced online/streaming learning, Multi-category imbalanced learning and semi-supervised or unsupervised imbalanced learning.
  • Learning from class-imbalanced data: Review of Methods and Applications (2017 (400+ citation), a very detailed Review of unbalanced learning methods and applications, containing a total of 527 related papers. It provides a detailed classification standard for several existing methods and is a recent trend in the research field.

🉑 a systematic and detailed review of classification criteria for existing methods

Deep learning

  • review

    • A systematic study of the class problem in convolutional neural networks (2018, 330+ 引用)
    • A Survey on deep learning with Class Imbalance (2019, 50+ citation)

    🉑 recent comprehensive paper on class imbalance in deep learning

  • Difficult sample mining

    • Training Region-based Object mining with Online Hard example Mining (CVPR 2016, 840+ citation), in the last stage of neural network Training, Gradient back propagation is performed only on “difficult samples” (e.g., samples with large loss values)
  • Loss function engineering

    • Training deep neural networks on imbalanced data sets (IJCNN 2016, 110+ citation), root mean square error can catch classification errors equally from most classes and a few classes
    • Focal Loss for Dense Object Detection [Code (Unofficial)] (ICCV 2017, 2600+ citation), a uniform loss function for sparse collection of centralized training difficulty samples, A large number of simple negative samples that can easily overwhelm the detector during training can be prevented.

    🉑 Elegant solution, high impact

    • Deep imbalanced attribute classification using visual attention aggregation [Code] (ECCV 2018, 30+ 引用)
    • Imbalanced deep learning by Minority Class Incremental Tuning (TPAMI 2018, 60+ citation), where there is no doubt that there is no iterative batch learning, The dominant effect of many classes is minimized by finding sparse sampling boundaries of a few classes.
    • Learning Imbalanced Datasets with label-distribution-aware Margin Loss [Code] (NIPS 2019, 10+ citation), A theoretically principled label distributed perceived marginal loss (LDAM) whose motivation is to minimize marginal based generalization boundaries.
    • Gradient Harmonized single-stage detector [Code] (AAAI 2019, 40+ citation) will only reduce the weight of negative sample “easy” compared to Focal Loss, GHM can also reduce the weight of “difficult” samples that might be outliers.

    🉑 Interesting idea: harmonizing sample contributions according to gradient distribution

    • Class-balanced Loss Based on Effective Number of Samples (CVPR 2019, 70+ citation), a simple and universal mechanism for Class weight adjustment Based on Effective Number of Samples.
  • Yuan learning

    • Learning to Model the tail (NIPS 2017, 70+ Citations), transfer meta-knowledge from data-rich classes at the distribution head to data-poor classes at the tail;

    • Learning to Reweight Examples for robust deep Learning [Code] (ICML 2018, 150+ Citations), In gradient updating of deep neural network, a weight function is implicitly learned to adjust the weight of samples.

      🉑 representative work on solving the problem of class imbalance through meta-learning.

    • Meta-weight-net: Learning an Explicit Mapping for Sample Weighting [Code] (NIPS 2019), Explicitly learning a weight function (using multi-layer perceptron as function approximation) to adjust the weight of the sample;

    • Learning Data Manipulation for Augmentation and Weighting [Code] (NIPS 2019)

    • Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distribution Tasks [Code] (ICLR 2020)

  • Said learning

    • Learning deep representation for imbalanced classification (CVPR 2016, 220+ 引用)
    • Supervised Class Distribution Learning for GANs-Based Imbalanced Classification (ICDM 2019)
    • Decoupling Representation and Classifier for Long-tailed Recognition (ICLR 2020)
  • course

    • Dynamic Curriculum Learning for Imbalanced Data Classification (ICCV 2019)
  • Two-stage learning

    • Brain tumor segmentation with deep neural Networks (2017, 1200+ citation) Then fine-tune the final output layer before Network SoftMax on the original category imbalance data set;
  • The network structure

    • BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition (CVPR 2020)

Data resampling

  • A sampling

    • ROS [Code] – Random oversampling

    • SMOTE [Code] (2002, 9800+ ref.), Synthetic Minority over-sampling TEchnique.

      🉑 Classic work

    • Borderline-SMOTE [Code] (2005, 1400+ ref.), SMOTE with few class oversampling;

    • ADASYN [Code] (2008, 1100+ citation), ADAptive SYNthetic Sampling;

    • SPIDER [Code (Java)] (2008, 150+ citation), selective preprocessing of unbalanced data;

    • Safe-level-freezing [Code (Java)] (2009, 370+ ref.), Safe Level combine oversampling of few classes;

    • Svm-smote [Code] (2009, 120+ quote), based on SVM

    • SMOTE-IPF (2015, 180+ quote), iteratively SMOTE partition filtering

  • undersampling

    • RUS [Code], random undersampling;
    • CNN [Code] (1968, 2100+ citation), Condensed Nearest Neighbor;
    • ENN [Code] (1972, 1500+ Citations), Edited Condensed Nearest Neighbor;
    • TomekLink [Code] (1976, 870+ Citations), Tomek’s modification of condensed nearest Neighbor;
    • NCR [Code] (2001, 500+ ref), nearestneighbor cleaning rules;
    • NearMiss-1 & 2 & 3 [Code] (2003, 420+ citation), several KNN methods for solving unbalanced data distribution
    • CNN with TomekLink [Code (Java)] (2004, 2000+ citation), combining condensed nearest Neighbor and TomekLink methods;
    • OSS [Code] (2007, 2100+ ref) : One Side Selection;
    • EUS (2009, 290+ citation) : Evolutionary under-sampling;
    • IHT [Code] (2014, 130+ citation) : The Instance difficulty Threshold;
  • Hybrid sampling

    • SMOTE-Tomek & SMOTE-ENN (2004, 2000+ quote) [Code (SMOTE-Tomek)] [Code (SMOTE-ENN)], combining SMOTE minority oversampling and Tomek’s modification of compressed and edit neighers;

      🉑 Extensive experimental evaluations involved 10 different over/under sampling methods.

    • SMOTE-RSB (2012, 210+ cited), mixing use of SMOTE pretreatment and rough set theory;

Cost-sensitive learning

  • Sc4.5 [Code (Java)] (2002, 420+ citation), an actual weighting method to cause cost-sensitive trees;
  • CSSVM [Code (Java)] (2008, 710+ citation), cost sensitive SVMs for highly unbalanced classification;
  • [CSNN] (https://sci2s.ugr.es/keel/pdf/algorithm/articulo/2006 – IEEE_TKDE – Zhou_Liu. PDF) [Code (Java)] (2005, 950 + reference), A cost-sensitive neural network is trained based on the method of class imbalance problem.

Integrated learning

  • Boosting-based

    • [AdaBoost](https://sci2s.ugr.es/keel/pdf/algorithm/articulo/1997-JCSS-Schapire-A Decision-Theoretic Generalization of Online Learning (AdaBoost).pdf) [Code] (1995, 18700+ citation)

    • DataBoost (2004, 570+ citation)

    • SMOTEBoost [Code] (2003, 1100+ citation)

      🉑 Classic work

    • [MSMOTEBoost](https://sci2s.ugr.es/keel/pdf/algorithm/articulo/2011-IEEE TSMC partC- GalarFdezBarrenecheaBustinceHerrera. PDF) (2011, 1300 + reference)

    • RAMOBoost [Code] (2010, 140+ citation)

    • [RUSBoost](https://sci2s.ugr.es/keel/pdf/algorithm/articulo/2010-IEEE TSMCpartA-RUSBoost A Hybrid Approach to Alleviating Class Imbalance. PDF) [Code] (2009, 850+ citation)

      🉑 Classic work

    • AdaBoostNC (2012, 350+ citation)

    • EUSBoost (2013, 210+ citation)

  • bagging-based

    • [Bagging](https://sci2s.ugr.es/keel/pdf/algorithm/articulo/1996-ML-Breiman-Bagging Predictors.pdf) [Code] (1996, 23100+ quote), Bagging predictor;
    • [OverBagging & UnderOverBagging & SMOTEBagging & MSMOTEBagging](https://sci2s.ugr.es/keel/pdf/algorithm/congreso/2009-IEEE CIDM-WangYao.pdf) [Code (SMOTEBagging)] (2009, SMOTE, based on Bagging, stochastic oversampling/stochastic mixed resampling/SMOTE/modified.
    • [UnderBagging](https://sci2s.ugr.es/keel/pdf/algorithm/articulo/2003-PAA- New Applications of Ensembles of PDF) [Code] (2003, 170+ quote), random undersampling based on Bagging;
  • Other inheritance methods

    • [EasyEnsemble & BalanceCascade](https://sci2s.ugr.es/keel/pdf/algorithm/articulo/2009-IEEE TSMCpartB Exploratory PDF) [Code (EasyEnsemble)] [Code (BalanceCascade)] (2008, 1300+ citation), Parallel integration training using RUS (EasyEnsemble)/cascade integration training using RUS, while iteratively removing well classified samples (Balanceccade);

      🉑 simple but effective way

    • Bungled Ensemble [Code] (ICDE 2020), an effective integration of unbalanced data trained by adaptive coordination to classify difficulties;

      🉑 high performance and computational efficiency, widely applicable to different classifiers.

Anomaly detection

  • Anomaly Detection Learning Resources, books, papers, videos, and toolkits for exception Detection.
  • review
    • Anomaly Detection: A Survey (2009, 7300+ citation)
    • A Survey of Network Anomaly Detection Techniques (2017, 210+ citation)
  • Based on the classification
    • One-class SVMs for Document Classification (2001, 1300+ citation)
    • One-class Collaborative Filtering (2008, 830+ citation)
    • Isolation Forest (2008, 1000+ citation)
    • Anomaly Detection using One-Class Neural Networks (2018, 70+ citation)
    • Anomaly Detection with Robust Deep Autoencoders (KDD 2017, 170+ citation)

other

Unbalanced data sets

ID Name Repository & Target Ratio #S #F
1 ecoli UCI, target: imU 8.6:1. 336 7
2 optical_digits UCI, target: 8 9.1:1. 5620 64
3 satimage UCI, target: 4 9.3:1. 6435 36
4 pen_digits UCI, target: 5 9.4:1. 10992 16
5 abalone UCI, target: 7 9.7:1. 4177 10
6 sick_euthyroid UCI, target: sick euthyroid 9.8:1. 3163 42
7 spectrometer UCI, target: > =44 now 531 93
8 car_eval_34 UCI, target: good, v good now 1728 21
9 isolet UCI, target: A, B now 7797 617
10 us_crime UCI, target: > 0.65 now 1994 100
11 yeast_ml8 LIBSVM, target: 8 13:1 2417 103
12 scene LIBSVM, target: >one label 13:1 2407 294
13 libras_move UCI, target: 1 14:1 360 90
14 thyroid_sick UCI, target: sick 15:1 3772 52
15 coil_2000 KDD, CoIL, target: minority now 9822 85
16 arrhythmia UCI, target: 06 now 452 278
17 solar_flare_m0 UCI, target: M->0 19:1 1389 32
18 oil UCI, target: minority The children 937 49
19 car_eval_4 UCI, target: vgood nor 1728 21
20 wine_quality UCI, wine, target: <=4 nor 4898 11
21 letter_img UCI, target: Z nor 20000 16
22 yeast_me2 UCI, target: ME2 o 1484 8
23 webpage LIBSVM, w7a, target: minority However now 34780 300
24 ozone_level UCI, ozone, data now 2536 72
25 mammography UCI, target: minority Brooks 11183 6
26 protein_homo KDD CUP 2004, minority 111:1. 145751 74
27 abalone_19 UCI, target: 19 130:1. 4177 10

The collection of the above datasets comes from imblearn.dataset. fetch_datasets

Other resources

  • Paper-list-on-Imbalanced-Time-series-Classification-with-Deep-Learning
  • Acm_imbalanced_learning, slides and code from the ACM Unbalanced Learning Lecture, April 27, 2016, in Austin, Texas;
  • Imbalanced-algorithms, Python based algorithms to learn unbalanced data
  • Imbalanced-dataset sampler, a (PyTorch) unbalanced dataset sampler, used for over-sampling low frequency classes and under-sampling high frequency classes;
  • Class_imbalance, the category imbalance of binary classification shown by Jupyter Notebook;

Finally, github addresses are:

https://github.com/ZhiningLiu1998/awesome-imbalanced-learning

In addition, my level is limited, so the translation of some professional terms may not be correct, so it can not be perfect, please forgive me, thank you!


Welcome to follow my wechat official account — the growth of algorithmic ape, or scan the QR code below, we can communicate, learn and progress together!