This is the 20th day of my participation in the November Gwen Challenge. Check out the details: The last Gwen Challenge 2021

The long-tail distribution, no doubt familiar to you, refers to a situation where a few categories have a large number of samples and most categories have only a small number of samples, as shown in the figure below

Usually, when we discuss long-tail distribution or text classification, we only consider single label, that is, one sample corresponds to only one label, but in fact, multi-label is also very common in practical applications, for example, the collection of personal hobbies has 6 elements: Sports, travel, reading, work, sleep, food, under normal circumstances, a person’s hobbies have one or more of these, which is a typical multi-label classification task

There’s a paper called Balancing Methods for Multi-label Text Classification with long-tailed Class in EMNLP2021 Distribution’s paper discusses in detail the effects of various balance Loss functions on multi-label classification problems, from the original BCE Loss to Focal Loss, etc. It feels that this paper is more like a summary of balance Loss functions. The source code in Roche/BalancedLossNLP

Loss Functions

In the NLP domain, Binary Cross Entropy Loss is often used to deal with multi-label text classification problems. Given a training set (X1,y1) containing NNN samples,… ,(xN,yN){(x^1,y^1),… ,(x^N,y^N)}(x1,y1),… , xN, yN), including yk = [y1k,…, yCk] ∈ {0, 1} Cy ^ k = [y_1 ^ k,…, y_C ^ k] \ in \ {0, 1 \} ^ Cyk = [y1k,…, yCk] ∈ {0, 1} C, CCC is the number of categories, Assume that the output of the model for a sample for zk = [z1k,…, zCk] ∈ Rz ^ k = [z_1 ^ k,…, z_C ^ k] \ \ mathbb in zk = {R} [z1k,…, zCk] ∈ R, BCE losses are defined as follows:


L BCE = { log ( p i k ) if  y i k = 1 log ( 1 p i k ) otherwise \mathcal{L}_{\text{BCE}} = \begin{cases}-\log (p_i^k)\quad &\text{if } y_i^k =1\\-\log (1-p^k_i)\quad &\text{otherwise} \end{cases}

Pik =σ(zik)p_i^k = \sigma(z_i^k)pik=σ(zik). For multi-label classification problems, we need to compress the output value of the model between 0,1, so we need to use the sigmoid function

For the original single-label problem, the true value yky^kyk is equivalent to a onehot vector, while for the multi-label problem, the true value yky^kyk is equivalent to a onehot vector with some more ones, such as [0,1,0,1], indicating that the sample is the first and the third class at the same time

This naive BCE is very susceptible to label imbalance because of the large number of head samples, perhaps the total loss of all head samples is 100, and the total loss of all tail samples is not more than 10. Below, we introduce three alternative methods to solve the category imbalance problem of long mantissa data in multi-label text classification. The main idea of these balancing methods is to re-weight BCE so that rare sampler-tag pairs get reasonable “attention”.

Focal Loss (FL)

By multiplying the BCE with an adjustable focusing parameter γ≥0\gamma \ge 0γ≥0, Focal Loss places a higher Loss weight on “hard-to-classify” samples that have a low probability of predicting their true value. For multi-label classification tasks, Focal Loss is defined as follows:


L FL = { ( 1 p i k ) gamma log ( p i k ) if  y i k = 1 ( p i k ) gamma log ( 1 p i k ) otherwise \mathcal{L}_{\text{FL}} = \begin{cases} -(1-p_i^k)^\gamma \log (p_i^k)\quad &\text{if } y_i^k =1\\ -(p_i^k)^\gamma \log (1-p_i^k)\quad &\text{otherwise} \end{cases}

In fact, there is only so much about Focal Loss in the paper. If you want to know more detailed description of Focal Loss parameters, you can refer to my article Focal Loss in detail

Class-balanced focal loss (CB)

By estimating the effective sample number, CB Loss further re-weighted Focal Loss to capture the marginal diminishing effect of the data and reduce redundant information in the head sample. For multi-label tasks, we first calculate the nin_ini frequency of each class, then for each class, there is a balance term rCBr_{\text{CB}}rCB


r CB = 1 Beta. 1 Beta. n i r_{\text{CB}} = \frac{1-\beta}{1-\beta^{n_i}}

β∈[0,1)\beta \in [0,1)β∈[0,1) controls the growth rate of effective sample number, and the loss function becomes


L CB = { r CB ( 1 p i k ) gamma log ( p i k ) if  y i k = 1 r CB ( p i k ) gamma log ( 1 p i k ) otherwise \mathcal{L}_{\text{CB}} = \begin{cases} -r_{\text{CB}} (1-p_i^k)^\gamma \log (p_i^k) \quad &\text{if } y_i^k =1\\ -r_{\text{CB}} (p_i^k)^\gamma \log (1-p_i^k) \quad &\text{otherwise} \end{cases}

Distribution-balanced loss (DB)

By integrating rebalancing weights and negative tolerant regularization (NTR), distribution-balanced Loss firstly reduces the redundant information of tag co-occurrence (which is critical in the case of multi-tag classification). The “easily classified” sample (the head sample) was then assigned a lower weight

First, to rebalance the weight, in the case of single labels, a sample can be weighted by sampling probability PiC=1C1niP_i^C = \ FRAc {1}{C}\ FRAc {1}{n_I}PiC=C1ni1, but in the case of multiple labels, if the same strategy is used, A sample with multiple labels will be over-sampled with a probability of PI=1c∑yik=11niP^I = \frac{1}{c}\sum_{y_I ^k=1}\frac{1}{n_i}PI= C1 ∑ Yik = 1Ni1. Therefore, we need to combine the two to rebalance the weight


r DB = P i C / P I r_{\text{DB}} = P_i^C / P^I

We can make the above weights a little smoother (bounded)


r ^ DB = Alpha. + sigma ( Beta. x ( r DB mu ) ) \hat{r}_{\text{DB}} = \alpha + \sigma(\beta \times (r_{\text{DB}} – \mu))

At this point, the r ^ DB \ hat {r} _ {\ text {DB}} r ^ DB range for [alpha and alpha + 1] [, alpha, and alpha + 1] [alpha and alpha + 1]. Rebalanced-fl (r-FL) loss function is


L R-FL = { r ^ DB ( 1 p i k ) log ( p i k ) if  y i k = 1 r ^ DB ( p i k ) log ( 1 p i k ) otherwise \mathcal{L}_{\text{R-FL}} = \begin{cases} -\hat{r}_{\text{DB}} (1-p_i^k)\log (p^k_i) \quad &\text{if } y_i^k =1\\ -\hat{r}_{\text{DB}} (p_i^k)\log (1-p^k_i) \quad &\text{otherwise} \end{cases}

NTR then performs different processing on the same tag head and tail samples, introducing a scaling factor λ\lambdaλ and an inherent class-specific bias VIv_IVI to lower the tail category threshold and avoid excessive suppression


L NTR-FL = { ( 1 q i k ) log ( q i k ) if  y i k = 1 1 Lambda. ( q i k ) log ( 1 q i k ) otherwise \mathcal{L}_{\text{NTR-FL}} = \begin{cases} – (1-q_i^k)\log (q^k_i) \quad &\text{if } y_i^k =1\\ -\frac{1}{\lambda} (q_i^k)\log (1-q^k_i) \quad &\text{otherwise} \end{cases}

For samples of the tail, qik = sigma (zik – vi) q ^ k_i = \ sigma (z_i ^ k – v_i) qik = sigma (zik – vi); For the head sample, qik = sigma (lambda (zik – vi)) q_i ^ k = \ sigma (\ lambda (z_i ^ k – v_i)) qik = sigma (lambda (zik – vi)). Viv_ivi can be estimated by minimizing the loss function at the beginning of training, its proportional coefficient is κ\kappaκ, and the prior information of the class PI = Ni /Np_i = N_I /Npi= Ni /N, then


b ^ i = log ( 1 p i 1 ) .   v i = κ x b ^ i \hat{b}_i = -\log (\frac{1}{p_i} – 1), \ v_i = -\kappa \times \hat{b}_i

Finally, by integrating rebalancing weights and NTR, distribution-balanced Loss is


L DB = { r ^ DB ( 1 q i k ) log ( q i k ) if  y i k = 1 r ^ DB 1 Lambda. ( q i k ) log ( 1 q i k ) otherwise \mathcal{L}_{\text{{DB}}} = \begin{cases} – \hat{r}_{\text{DB}}(1-q_i^k)\log (q^k_i) \quad &\text{if } y_i^k =1\\ -\hat{r}_{\text{DB}}\frac{1}{\lambda} (q_i^k)\log (1-q^k_i) \quad &\text{otherwise} \end{cases}

Result

The two data sets of the author’s experiment are as follows

The SVM model was used to compare the effects of different loss functions

Personal summary

This paper, innovative but not innovative, all the loss functions were proposed by others, and my own work was just running through multi-label data sets and doing a comparison. At last, the pure love warrior showed great concern