Category imbalance learning materials recommended

Github is a project for unbalanced learning of papers, codes, frameworks, libraries and other resources

https://github.com/ZhiningLiu1998/awesome-imbalanced-learning

preface

Class-imbalance, also known as the long-tail problem, refers to the inconsistency in the number of categories of data sets in a classification problem, some of which are extremely large and others are extremely small, and this is a very common problem in practical application. For example, fraud detection, the prediction of rare adverse drug reactions, and gene family prediction. Because the category imbalance will lead to inaccurate prediction and reduce the performance of the classification model. Therefore, the goal of Imbalanced learning is to solve the problem of unbalanced categories and learn an unbiased model from unbalanced data.

The directory is as follows:

Code base/framework
- Python
- R
- Java
- Scalar
- Julia
The paper
- review
- Deep learning
- Data resampling
- Cost-sensitive Learning
- Ensemble Learning
- Anomaly detection
other
- Unbalanced database
- Other resources

In addition, marked 🉑 are papers and frameworks that have been specifically recommended by the authors as important or of high quality.

Code base/framework

Python

imbalanced-learn

Website: https://imbalanced-learn.org/stable/

Github：https://github.com/scikit-learn-contrib/imbalanced-learn

The official document: https://imbalanced-learn.readthedocs.io/en/stable/

The paper address: http://10.187.70.34/www.jmlr.org/papers/volume18/16-365/16-365.pdf

The Scikit-learn library is a Python library that provides some resampling techniques commonly used in data sets. It ADAPTS to the Scikit-learn library and is also part of the Scikit-learn-contrib library.

🉑 Python, easy to use

R

smote_variants

Website: https://smote-variants.readthedocs.io/en/latest/

Documents: [(https://smote-variants.readthedocs.io/en/latest/

Github：https://github.com/analyticalmindsltd/smote_variants)

A collection of 85 oversampling techniques for imbalanced learning, including multi-class oversampling and model selection feature methods (R and Julia supported)

caret

Website: https://cran.r-project.org/web/packages/caret/index.html

Documents: http://topepo.github.io/caret/index.html

Github：https://github.com/topepo/caret

Random under-sampling and over-sampling methods are implemented

ROSE

Website: https://cran.r-project.org/web/packages/ROSE/index.html

Documents: https://www.rdocumentation.org/packages/ROSE/versions/0.0-3

The random oversampling method is implemented

DMwR

Website: https://cran.r-project.org/web/packages/DMwR/index.html

Documents: https://www.rdocumentation.org/packages/DMwR/versions/0.4.1

SMOTE (Synthetic Minority Over-sampling TEchnique) is realized

Java

KEEL

Website: https://sci2s.ugr.es/keel/description.php

Github：https://github.com/SCI2SUGR/KEEL

Paper: https://sci2s.ugr.es/sites/default/files/ficherosPublicaciones/0758_Alcalaetal-SoftComputing-Keel1.0.pdf

KEEL provides a simple graphical interface to design experiments with different data sets based on data flows, as well as algorithms for different intelligent computations (with a special focus on evolutionary algorithms), thus touching the behavior of algorithms. The tool includes many widely used unbalanced learning methods, such as oversampling and undersampling, loss sensitive learning, algorithm modification, and integrated learning methods.

🉑 contains a variety of algorithms, such as classical classification algorithm, regression algorithm and pretreatment algorithm

Scalar

undersampling

Website: https://github.com/NestorRV/undersampling

Documents: https://nestorrv.github.io/

Github：https://github.com/NestorRV/undersampling

Undersampling method and integrated method deformation method are realized.

Julia

Smote_variants website: https://smote-variants.readthedocs.io/en/latest/ document: https://smote-variants.readthedocs.io/en/latest/ Github:https://github.com/analyticalmindsltd/smote_variants

A collection of 85 oversampling techniques for imbalanced learning, including multi-class oversampling and model selection feature methods (R and Julia supported)

The paper

Learning from ImBalanced Data, 2009, with 4700+ citations, is a classic paper. This paper systematically reviews popular solutions, evaluation criteria and challenges and problems to be faced in future research (in 2009).

🉑 Classic work

Learning from imbalanced data: Open Challenges and Future Directions (2016, 400+ Citation, this paper focuses on the open problems and challenges of imbalanced learning, such as extreme class imbalanced, and dealing with imbalanced online/streaming learning, Multi-category imbalanced learning and semi-supervised or unsupervised imbalanced learning.
Learning from class-imbalanced data: Review of Methods and Applications (2017 (400+ citation), a very detailed Review of unbalanced learning methods and applications, containing a total of 527 related papers. It provides a detailed classification standard for several existing methods and is a recent trend in the research field.

🉑 a systematic and detailed review of classification criteria for existing methods

Deep learning

review
- A systematic study of the class problem in convolutional neural networks (2018, 330+ 引用)
- A Survey on deep learning with Class Imbalance (2019, 50+ citation)
🉑 recent comprehensive paper on class imbalance in deep learning
Difficult sample mining
- Training Region-based Object mining with Online Hard example Mining (CVPR 2016, 840+ citation), in the last stage of neural network Training, Gradient back propagation is performed only on “difficult samples” (e.g., samples with large loss values)
Loss function engineering
- Training deep neural networks on imbalanced data sets (IJCNN 2016, 110+ citation), root mean square error can catch classification errors equally from most classes and a few classes
- Focal Loss for Dense Object Detection [Code (Unofficial)] (ICCV 2017, 2600+ citation), a uniform loss function for sparse collection of centralized training difficulty samples, A large number of simple negative samples that can easily overwhelm the detector during training can be prevented.
🉑 Elegant solution, high impact
- Deep imbalanced attribute classification using visual attention aggregation [Code] (ECCV 2018, 30+ 引用)
- Imbalanced deep learning by Minority Class Incremental Tuning (TPAMI 2018, 60+ citation), where there is no doubt that there is no iterative batch learning, The dominant effect of many classes is minimized by finding sparse sampling boundaries of a few classes.
- Learning Imbalanced Datasets with label-distribution-aware Margin Loss [Code] (NIPS 2019, 10+ citation), A theoretically principled label distributed perceived marginal loss (LDAM) whose motivation is to minimize marginal based generalization boundaries.
- Gradient Harmonized single-stage detector [Code] (AAAI 2019, 40+ citation) will only reduce the weight of negative sample “easy” compared to Focal Loss, GHM can also reduce the weight of “difficult” samples that might be outliers.
🉑 Interesting idea: harmonizing sample contributions according to gradient distribution
- Class-balanced Loss Based on Effective Number of Samples (CVPR 2019, 70+ citation), a simple and universal mechanism for Class weight adjustment Based on Effective Number of Samples.
Yuan learning
- Learning to Model the tail (NIPS 2017, 70+ Citations), transfer meta-knowledge from data-rich classes at the distribution head to data-poor classes at the tail;
- Learning to Reweight Examples for robust deep Learning [Code] (ICML 2018, 150+ Citations), In gradient updating of deep neural network, a weight function is implicitly learned to adjust the weight of samples.
  
  🉑 representative work on solving the problem of class imbalance through meta-learning.
- Meta-weight-net: Learning an Explicit Mapping for Sample Weighting [Code] (NIPS 2019), Explicitly learning a weight function (using multi-layer perceptron as function approximation) to adjust the weight of the sample;
- Learning Data Manipulation for Augmentation and Weighting [Code] (NIPS 2019)
- Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distribution Tasks [Code] (ICLR 2020)
Said learning
- Learning deep representation for imbalanced classification (CVPR 2016, 220+ 引用)
- Supervised Class Distribution Learning for GANs-Based Imbalanced Classification (ICDM 2019)
- Decoupling Representation and Classifier for Long-tailed Recognition (ICLR 2020)
course
- Dynamic Curriculum Learning for Imbalanced Data Classification (ICCV 2019)
Two-stage learning
- Brain tumor segmentation with deep neural Networks (2017, 1200+ citation) Then fine-tune the final output layer before Network SoftMax on the original category imbalance data set;
The network structure
- BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition (CVPR 2020)

Data resampling

A sampling
- ROS [Code] – Random oversampling
- SMOTE [Code] (2002, 9800+ ref.), Synthetic Minority over-sampling TEchnique.
  
  🉑 Classic work
- Borderline-SMOTE [Code] (2005, 1400+ ref.), SMOTE with few class oversampling;
- ADASYN [Code] (2008, 1100+ citation), ADAptive SYNthetic Sampling;
- SPIDER [Code (Java)] (2008, 150+ citation), selective preprocessing of unbalanced data;
- Safe-level-freezing [Code (Java)] (2009, 370+ ref.), Safe Level combine oversampling of few classes;
- Svm-smote [Code] (2009, 120+ quote), based on SVM
- SMOTE-IPF (2015, 180+ quote), iteratively SMOTE partition filtering
undersampling
- RUS [Code], random undersampling;
- CNN [Code] (1968, 2100+ citation), Condensed Nearest Neighbor;
- ENN [Code] (1972, 1500+ Citations), Edited Condensed Nearest Neighbor;
- TomekLink [Code] (1976, 870+ Citations), Tomek’s modification of condensed nearest Neighbor;
- NCR [Code] (2001, 500+ ref), nearestneighbor cleaning rules;
- NearMiss-1 & 2 & 3 [Code] (2003, 420+ citation), several KNN methods for solving unbalanced data distribution
- CNN with TomekLink [Code (Java)] (2004, 2000+ citation), combining condensed nearest Neighbor and TomekLink methods;
- OSS [Code] (2007, 2100+ ref) : One Side Selection;
- EUS (2009, 290+ citation) : Evolutionary under-sampling;
- IHT [Code] (2014, 130+ citation) : The Instance difficulty Threshold;
Hybrid sampling
- SMOTE-Tomek & SMOTE-ENN (2004, 2000+ quote) [Code (SMOTE-Tomek)] [Code (SMOTE-ENN)], combining SMOTE minority oversampling and Tomek’s modification of compressed and edit neighers;
  
  🉑 Extensive experimental evaluations involved 10 different over/under sampling methods.
- SMOTE-RSB (2012, 210+ cited), mixing use of SMOTE pretreatment and rough set theory;

Cost-sensitive learning

Sc4.5 [Code (Java)] (2002, 420+ citation), an actual weighting method to cause cost-sensitive trees;
CSSVM [Code (Java)] (2008, 710+ citation), cost sensitive SVMs for highly unbalanced classification;
[CSNN] (https://sci2s.ugr.es/keel/pdf/algorithm/articulo/2006 – IEEE_TKDE – Zhou_Liu. PDF) [Code (Java)] (2005, 950 + reference), A cost-sensitive neural network is trained based on the method of class imbalance problem.

Integrated learning

Boosting-based
- [AdaBoost](https://sci2s.ugr.es/keel/pdf/algorithm/articulo/1997-JCSS-Schapire-A Decision-Theoretic Generalization of Online Learning (AdaBoost).pdf) [Code] (1995, 18700+ citation)
- DataBoost (2004, 570+ citation)
- SMOTEBoost [Code] (2003, 1100+ citation)
  
  🉑 Classic work
- [MSMOTEBoost](https://sci2s.ugr.es/keel/pdf/algorithm/articulo/2011-IEEE TSMC partC- GalarFdezBarrenecheaBustinceHerrera. PDF) (2011, 1300 + reference)
- RAMOBoost [Code] (2010, 140+ citation)
- [RUSBoost](https://sci2s.ugr.es/keel/pdf/algorithm/articulo/2010-IEEE TSMCpartA-RUSBoost A Hybrid Approach to Alleviating Class Imbalance. PDF) [Code] (2009, 850+ citation)
  
  🉑 Classic work
- AdaBoostNC (2012, 350+ citation)
- EUSBoost (2013, 210+ citation)
bagging-based
- [Bagging](https://sci2s.ugr.es/keel/pdf/algorithm/articulo/1996-ML-Breiman-Bagging Predictors.pdf) [Code] (1996, 23100+ quote), Bagging predictor;
- [OverBagging & UnderOverBagging & SMOTEBagging & MSMOTEBagging](https://sci2s.ugr.es/keel/pdf/algorithm/congreso/2009-IEEE CIDM-WangYao.pdf) [Code (SMOTEBagging)] (2009, SMOTE, based on Bagging, stochastic oversampling/stochastic mixed resampling/SMOTE/modified.
- [UnderBagging](https://sci2s.ugr.es/keel/pdf/algorithm/articulo/2003-PAA- New Applications of Ensembles of PDF) [Code] (2003, 170+ quote), random undersampling based on Bagging;
Other inheritance methods
- [EasyEnsemble & BalanceCascade](https://sci2s.ugr.es/keel/pdf/algorithm/articulo/2009-IEEE TSMCpartB Exploratory PDF) [Code (EasyEnsemble)] [Code (BalanceCascade)] (2008, 1300+ citation), Parallel integration training using RUS (EasyEnsemble)/cascade integration training using RUS, while iteratively removing well classified samples (Balanceccade);
  
  🉑 simple but effective way
- Bungled Ensemble [Code] (ICDE 2020), an effective integration of unbalanced data trained by adaptive coordination to classify difficulties;
  
  🉑 high performance and computational efficiency, widely applicable to different classifiers.

Anomaly detection

Anomaly Detection Learning Resources, books, papers, videos, and toolkits for exception Detection.
review
- Anomaly Detection: A Survey (2009, 7300+ citation)
- A Survey of Network Anomaly Detection Techniques (2017, 210+ citation)
Based on the classification
- One-class SVMs for Document Classification (2001, 1300+ citation)
- One-class Collaborative Filtering (2008, 830+ citation)
- Isolation Forest (2008, 1000+ citation)
- Anomaly Detection using One-Class Neural Networks (2018, 70+ citation)
- Anomaly Detection with Robust Deep Autoencoders (KDD 2017, 170+ citation)

other

Unbalanced data sets

ID	Name	Repository & Target	Ratio	#S	#F
1	ecoli	UCI, target: imU	8.6:1.	336	7
2	optical_digits	UCI, target: 8	9.1:1.	5620	64
3	satimage	UCI, target: 4	9.3:1.	6435	36
4	pen_digits	UCI, target: 5	9.4:1.	10992	16
5	abalone	UCI, target: 7	9.7:1.	4177	10
6	sick_euthyroid	UCI, target: sick euthyroid	9.8:1.	3163	42
7	spectrometer	UCI, target: > =44	now	531	93
8	car_eval_34	UCI, target: good, v good	now	1728	21
9	isolet	UCI, target: A, B	now	7797	617
10	us_crime	UCI, target: > 0.65	now	1994	100
11	yeast_ml8	LIBSVM, target: 8	13:1	2417	103
12	scene	LIBSVM, target: >one label	13:1	2407	294
13	libras_move	UCI, target: 1	14:1	360	90
14	thyroid_sick	UCI, target: sick	15:1	3772	52
15	coil_2000	KDD, CoIL, target: minority	now	9822	85
16	arrhythmia	UCI, target: 06	now	452	278
17	solar_flare_m0	UCI, target: M->0	19:1	1389	32
18	oil	UCI, target: minority	The children	937	49
19	car_eval_4	UCI, target: vgood	nor	1728	21
20	wine_quality	UCI, wine, target: <=4	nor	4898	11
21	letter_img	UCI, target: Z	nor	20000	16
22	yeast_me2	UCI, target: ME2	o	1484	8
23	webpage	LIBSVM, w7a, target: minority	However now	34780	300
24	ozone_level	UCI, ozone, data	now	2536	72
25	mammography	UCI, target: minority	Brooks	11183	6
26	protein_homo	KDD CUP 2004, minority	111:1.	145751	74
27	abalone_19	UCI, target: 19	130:1.	4177	10

The collection of the above datasets comes from imblearn.dataset. fetch_datasets

Other resources

Paper-list-on-Imbalanced-Time-series-Classification-with-Deep-Learning
Acm_imbalanced_learning, slides and code from the ACM Unbalanced Learning Lecture, April 27, 2016, in Austin, Texas;
Imbalanced-algorithms, Python based algorithms to learn unbalanced data
Imbalanced-dataset sampler, a (PyTorch) unbalanced dataset sampler, used for over-sampling low frequency classes and under-sampling high frequency classes;
Class_imbalance, the category imbalance of binary classification shown by Jupyter Notebook;

Finally, github addresses are: