Abstract: In order to improve the performance of the model, this paper introduces the cutting-edge method of noisy learning to solve the neural network optimization strategy in imperfect scenarios.

Learning fromNoisy Labels with Deep Neural Networks Learning fromNoisy Labels with Deep Neural Networks

The Introduction:

The success of neural networks is based on large amounts of clean data and deep network models. However, in real scenes, data and models are often not particularly ideal, for example, the data level is incorrectly marked, for example, a dog is marked as a Wolf. In addition, the actual business scene stresses timeliness, and the layer number of neural network cannot be particularly deep. We try to iterate the effective training method of neural network with data and model defects, and solve the problem of Noisy Data in network training with Noisy Labellearning technology, which has been implemented in the actual business scenarios of the team. The robustness of the whole model was improved by optimizing multiple modules including loss function, network structure, model regularization, loss function adjustment, sample selection and label correction, which were not limited to fully supervised, semi-supervised and self-supervised learning methods.

Framework:

【Robust Loss Function】

The core idea is that when the data is clean as a whole, the traditional cross entropy loss function can learn a small number of negative samples, which can improve the robustness of the model. When the data noise is relatively high, CE will be deviated by the noise data, so we need to modify the Loss function to make the weight of each sample equally important in training. Therefore, it is not difficult to think of using GCE Loss to control the hyperparameter, combining CE Loss and MAE Loss

  • A. Ghosh, H. Kumar, and P. Sastry, “Robust Loss Functionsunder Label Noise for Deep Neural Networks,” in Proc. AAAI, 2017

  • Generalized Cross Entropy Loss for Training Deep NeuralNetworks with Noisy Labels, NeurlPS 2018

In addition, there are borrowed from KL divergence idea, the author thinks that at the time of calculating entropy, the original q, p represents the real data distribution and projection on the relatively clean data no problem, but in larger data, noise may q does not represent the real data distribution, the opposite is p can represent the real data distribution, Therefore, a Symmetric crossentropy loss function based on Symmetric crossentropy is proposed.

  • Y. Wang, X. Ma, Z. Chen, Y. Luo, J. Yi, And J. Bailey, “Symmetric Cross Entropy for Robust Learning with Noisy Labels,” in Proc. ICCV,2019, pp. 322 — 330

The Robust Architecture 】

In this part, the robustness of the model is gradually improved by referring to the clever network structure and selecting a batch of relatively clean data through the model in the process of model training. Firstly, the coteachingframework is introduced. Firstly, the two models select data from each other and input it to each other to calculate loss. The data transmitted to each other’s network is the lowest loss data in each min-batch. Each time the epoch ends, the data is shuffled to ensure that it is not permanently forgotten.

  • How does Disagreement Help Generalization against LabelCorruption? ICML 2019

  • Co-teaching: Robust Training of Deep Neural Networks withExtremely Noisy Labels, NeurlPS 2018

Another idea is to score clean samples and noise data based on attention attention mechanism, which is called AttentionFeature Mixup in this paper. There are two parts in the calculation of final loss, one is the cross entropy loss calculated by each figure and label of the same class. Another loss is the new data X ‘obtained by data mixup and the loss calculated by label Y’.

The Robust Regularization.

In this part, regular ticks are added to prevent the model from over-fitting to noise data. The commonly used regular methods include label smooth, L1, L2, MixUp, etc.

【 Loss Adjustment 】

In fact, this part is also some training ticks. In fact, it is inseparable from the improvement of Loss function. There is no detailed introduction of Ticks here.

The Sample Selection 】

This module is mainly based on how to select better data. One method, Area Under theMargin metric (AUM), is the one we participated in CVPR WebVision 2020 (the highest level competition in image recognition field) last year. Succeed ImageNet) to win the scheme. This scheme is a way of screening data while training in the training process. The specific idea is to use a min-batch model to calculate the Logits value of each image and the maximum difference value of logits in other classes as area. In this way, the AUM value of each image can be obtained by averaging the values after several iterations of epoch. The experiment found that if the data was relatively clean, the area value would be relatively large; if the data was mis-label, the area value would be relatively small, or even negative. The author separated the clean data of a class from the noise data through this idea. Of course, at the end of the paper also pointed out that the clean data and noise data accounted for 99% of the threshold is optimal.

  • Pleiss, Geoff, et al. “Identifying mislabeled datausing the area under the margin ranking. “, NeurlPS 2020.

In another paper, data partitioning is to divide the data of a class into easy dataset, SMI-Hard dataset and Hard dataset by density clustering. General noise data is the data that is difficult to train, and a weight is assigned to each graph. It recommends 1.0, 0.5 and 0.5; The training of the model is based on the ideas of course learning.

  • Guo, Sheng, et al. “Curriculumnet: Weakly supervisedlearning from large-scale web images.” Proceedings of the European Conferenceon Computer Vision (ECCV). 2018.

The Semi – supervised Learning 】

Firstly, a DivideMix method is introduced, which is actually co-teaching. However, after clean samples and noise samples are selected, the noise samples are treated as unlabeled samples and trained by FixMatch method. At present, THE SOTA of semi-supervised image classification should still be FixMatch, which can obtain close to supervised results in 90% of unlabeled samples… So now the idea of achieving high accuracy is basically towards semi-supervision and how to completely distinguish the noise of the general direction.

The whole Pipline is divided into two parts: co-divide and semi-supervised learning.

The co-divide part uses the pre-trained model to calculate loss for N samples. There is an assumption here that the N variables are generated by the mixed distribution of two Gaussian distributions. The distribution with a larger mean is a noise sample, and the distribution with a smaller mean is a clean sample. According to the loss of each sample, we can separately calculate the probability WI that it belongs to the clean sample. After obtaining the threshold value, we can divide the training data into labeled and unlabeled according to the set threshold value, and then use SSL method for training.

It should be noted that in order to make the model converge, we need to train several EPOchs on the model with all the data before dividing the data, so as to achieve the purpose of “warming up”. However, the process of “warming up” will lead to over-fitting of asymmetric noise samples by the model, resulting in small loss of noise samples, which makes it difficult for GMM to distinguish and will affect subsequent training. In order to solve this problem, we can add an additional regular term -h on the basis of the original cross entropy loss during the “warm-up” training. This regular term is a negative entropy, which penalizes the samples with sharp prediction probability distribution and flattens the probability distribution, so as to prevent the model from being too “confident”.

After partitioning the training data, we can use some existing semi-supervised learning methods to train the model. In this paper, we use MixMatch, a commonly used MixMatch method, but refinement of Co-refinement and Co-guess has been made before MixMatch is used.

  • DivideMix: Learning with Noisy Labels as Semi-supervisedLearning. ICLR 2020

【 Label correction 】

The idea of label correction method is very simple, which is equivalent to a new concept of a false label, but completely abandon the original label is also too violent, In this paper of ICCV2019, several graphs in each class were randomly selected through a pre-trained model in “Label Correction Phase”, and the clustering method was adopted to obtain the clustering center of each class of Prototype sample. The pseudo label of the image is obtained by calculating the distance of the feature vector obtained from the input image and various clustering centers. The final loss is the sum of the cross entropy loss calculated by the original label and the pseudo label calculated by the pseudo label.

  • Han, Jiangfan, Ping Luo, and Xiaogang Wang. “Deep Self-Learning from Noisy Labels. “, ICCV 2019

The Result andConclusion:

The research in the field of noisy learning is very meaningful. We have verified it in our scene, and all of them have a good improvement, with a minimum improvement of 2-3 points and a maximum improvement of 10 points. Of course, the verification in one scene cannot fully demonstrate the effectiveness of the method. We also found that the combination of multiple methods sometimes did not provide a double performance increase, but rather reduced the final result.

We hope to use AutoML to select the optimal combination mode, and we also hope that the method of noisy Learning can be more flexible. After all, most of it is still focused on classification tasks, and we will explore the method of Meta Learning in the field of noisy Learning later. At the same time, we will constantly update the latest methods of each module and improve them in MMClassification. Welcome to communicate with us.

Click download attachment

Click to follow, the first time to learn about Huawei cloud fresh technology ~