Text classification remains BERT? The dual contrast learning framework is also too strong

Dual Contrastive Learning: How to Apply Contrastive Learning to supervised text classification

Dual Contrastive Learning: Text Classification via label-aware Data Augmentation mentation

Links to papers: arxiv.org/abs/2201.08…

Code link: github.com/hiyouga/dua…

1. School of Earth Science, Qianben Chen}

Zhihu notes: zhuanlan.zhihu.com/p/466685216

abstract

Contrastive learning has achieved remarkable success in self-supervised representational learning in an unsupervised environment. However, effectively adapting contrastive learning to supervised learning tasks remains a challenge in practice. In this work, we propose a dual contrast learning (DualCL) framework, which simultaneously learns the characteristics of input samples and the parameters of classifiers in the same space. Specifically, DualCL treats the classifier’s parameters as enhanced samples associated with different labels, and then uses them for comparative learning between input samples and enhanced samples. The experimental study on five benchmark text classification datasets and the corresponding low-resource version datasets shows that the classification accuracy of DualCL is improved obviously, and it is confirmed that DualCL can realize the effect of sample discriminant representation.

DualCL profile

Representation learning is the core of current deep learning. In the context of unsupervised learning, contrastive learning has recently been shown to be an effective method for obtaining general representations of downstream tasks. Simply put, unsupervised contrastive learning employs a loss function that forces the representation vectors of different “perspectives” of the same sample to be similar, while the representation vectors of different samples are different. The effectiveness of contrast learning has recently been shown to stem from the fact that both alignment and uniformity are achieved.

Contrast learning is also applicable to supervised representation learning, and similar contrast loss has been used in previous studies. The basic principle is to insist that the representation of samples in the same class is similar, and the representation of samples of different classes is similar. Clear. However, despite its success, this approach appears to be much less principled than unsupervised comparative learning. For example, unity of representation is no longer valid; Generally speaking, features are no longer evenly distributed in space, so we think that the standard supervised contrast learning method is not natural for supervised representation learning. In addition, there is a fact that the result of this comparative learning method does not directly give us a classifier, so we need to develop another classification algorithm to solve the classification task.

Next we talk about DualCL’s motivation for developing a more natural approach to contrastive learning under supervised tasks. The key motivation of the authors is that supervised representation learning should include learning two kinds of parameters: one is the input of XXX in the appropriate spatial feature ZZZ, which is used to satisfy the classification task requirements; the other is the parameter of the classifier, or the parameter of the classifier space θ\thetaθ; We call this classifier the “One example” classifier for XXX. In this view, it is natural to associate sample XXX with two parameters: one is z∈Rdz\in \mathbb{R}^dz∈Rd with dimension DDD, which is used to represent characteristics; One is the classifier parameter θ∈Rd×K\theta \in \mathbb{R}^{d \times K}θ∈Rd×K, where KKK represents the total number of classifications in the sample. Then supervised representation learning can be thought of as generating (z,θ)(z,\theta)(z,θ) for input sample XXX.

To ensure that the classifier θ\theta thetaθ is valid for feature ZZZ, it is only necessary to ensure that θTz\theta^TzθTz is aligned with the label of sample XXX, which can be constrained by softMax normalized probability and cross entropy. In addition, the comparison learning method can be used to force constraints on these (z, θ)(z, θ) representations. Specifically, we remember θ∗\theta^*θ∗ as the ideal parameter of the real label corresponding to the classifier θ\thetaθ for sample XXX. Here we can design two comparison losses. First loss to contrast (z, theta ∗) (z, \ theta ^ *) (z, theta ∗) and multiple (z ‘, theta ∗) (z ‘, \ theta ^ *) (z ‘, theta ∗), which z ‘z’ z ‘representative and samples of different categories of XXX sample characteristics; The second loss is used to contrast (z, theta ∗) (z, \ theta ^ *) (z, theta ∗) and multiple (z, theta ‘∗) (z, \ theta’ ^ *) (z, theta ‘∗), including theta’ \ theta ‘theta’ representative samples of different categories corresponding classifier parameters, This learning framework is called Dual Contrastive Learning (DualCL).

On the basis of comparative learning, DualCL can be considered as a unique data enhancement method, as the title of the paper suggests. Specifically, for each sample $x$ , its $Theta.$ Each column can be considered a “tag-inspired input representation” or one that injects tag information into the feature space $x$ Enhanced view of. The power of this approach is illustrated in Table 1, where you can see from the two images on the left that standard comparative learning does not utilize label information. In contrast, as you can see from the two graphs on the right, DualCL effectively uses label information to classify the input samples in its class.

In this paper, we verify the validity of DualCL on five benchmark text classification datasets. By fine-tuning pre-trained language models (BERT and RoBERTa) using dual ratio losses, DualCL achieved the best performance compared to existing comparative learning supervised baselines. The authors also found that DualCL improved classification accuracy, especially in low-resource scenarios. In addition, some interpretable analyses of DualCL are given through the representational and attentional diagram learned by visualization.

Contributions can be summarized as follows:

1) Double contrast learning (DualCL) is proposed, which naturally combines contrast loss with supervised task;
2) Label perception data enhancement is introduced to obtain multiple views of input samples for DualCL training;
3) The validity of DualCL framework is verified on five benchmark text classification datasets.

DualCL principle

“Dual” refers to supervised comparative learning methods. The purpose of the method is: the first is to discriminant the input of the classification task in the appropriate space; the second is to construct the classifier of supervised task and learn the parameters of the classifier in the classifier space. Now let’s look at the core of DualCL.

Tag heuristic data enhancement

In order to obtain different views of training samples, the author uses the idea of data enhancement to obtain the representation of feature Ziz_ {I}zi and classifier θ I \theta_{I}θ I. Specifically, the corresponding parameter θ I \ theTA_ {I}θ I for each category of the classifier is denoted as a unique representation of Ziz_ {I}zi, denoted as θ IK \ theTA_ {I}^{k}θik, called tag aware input representation, and the tag KKK information is injected into xix_{I}xi as an additional enhanced view.

In practice, the tag set {1,… ,K}\{1,… ,K\}{1,… ,K} is inserted into the input sequence xix_{I}xi to obtain a new input sequence RI ∈RL+Kr_{I}\in \mathbb{R}^{L+K} RI ∈RL+K, and then PLMS (Bert or Roberta) model is used as encoder FFF. To obtain each token feature of the input sequence, where the [CLS] feature is the feature ziz_{I}zi of sample XIx_ {I}xi, and the inserted tag corresponds to the tag-inspired input representing θ IK \ theTA_ {I}^{k}θik. The label name is used as the label, forming a sequence riR_ {I} RI, such as positive and negative. For tags containing multiple words, we use average pooling of token features to obtain tag-aware input representations. This operation is similar to the previous paper, which you can read in interest: Bert can also be used like this: merge tag vector to Bert

Dual contrast loss

The characteristics of input sample XIx_ {I}xi are used to represent ziz_{I}zi and the classifier θ I θ_{I}θ I. DualCL’s role is to align the softmax normalization probability of θiTziθ_{I}^{T}z_{I}θiTzi with the label of xix_{I}xi. Will theta I ∗ theta ^ I ∗ ∗ _ {I} theta theta. Theta _ I said {I} theta I a column, corresponding to the xix_ {I} real label index of xi, DualCL expect theta I ∗ Tzi theta _ {I} ^ {T} * z_ {I} theta I ∗ Tzi dot product is maximized. In order to learn better ziz_{I}zi and θ I θ_{I}θ I, DualCL defined the dual contrast loss using the relationship between different training samples. If xjx_{j}xj and XIx_ {I}xi had the same label, Then try to maximize theta I ∗ Tzj theta _ {I} ^ {T} * z_ {j} theta I ∗ Tzj, if xjx_ {j} xj and xix_ {I} have different labels, xi is minimized theta I ∗ Tzj theta _ {I} ^ {T} * z_ {j} theta I ∗ Tzj.

Given a sample from the input $x_{i}$ The anchor point $z_{i}$ . $\{ \theta^*_j \}_j\in P_{i}$ Is the set of positive samples, $\{ \theta^*_j \}_j\in A_i \ | P_i$ Is the negative sample set, and the comparative loss of Z can be defined as follows:

Similarly, given a sample from the input $x_{i}$ The anchor point $\theta_{i}^*$ . $\{ z_j \}_j\in P_{i}$ Is the set of positive samples, $\{ z_j \}_j\in A_i \ | P_i$ Is the negative sample set, and the comparative loss of Z can be defined as follows:

The Dual ratio loss is the combination of the two contrast loss terms: LDual=Lz+Lθ \mathcal{L_{Dual}}= \mathcal{L_{z}}+ \mathcal{L_{theta}}LDual=Lz+Lθ

Contrast training with supervised prediction

To take full advantage of supervisory signals, DualCL also expects $Theta _ {I}$ It’s a good one $z_{i}$ Classifier. So the authors use an improved version of the cross entropy loss to maximize each input sample $x_i$ the $Theta _ {I} ^ T} {* z_ {I}$ :

Finally, the two training objectives are minimized to train the encoder FFF. These two objectives improve both the representation quality of features and the representation quality of classifiers. The total loss shall be: Loverall=LCE+λLDual \mathcal{L_{overall}}= \mathcal{L_{CE}}+\lambda \mathcal{L_{Dual}}Loverall=LCE+λLDual λλ is a hyperparameter that controls the weight of the double contrast loss term.

In the classification process, we use the trained encoder FFF to generate the characteristic representation ziz_izi of the input sentence xix_ixi and the classifier θ I θ_iθ I. Here θ I θ_iθ I can be regarded as a “one-example” classifier, such as xix_ixi, we use the argmax result of θiTziθ_{I}^{T}z_{I}θiTzi as model prediction: Y ^ I = argmax (theta ik ⋅ zi) \ widehat {} y _i = argmax (\ theta_i ^ k. z_ {I}) y I = argmax (theta ik ⋅ zi)

Figure 1 illustrates the framework of dual contrast learning, where eCLSe_{CLS}eCLS are feature representations and ePOSe_{POS}ePOS and eNEGe_{NEG}eNEG are classifier representations. In this specific example, we assume that the target sample has a “positive” class as the anchor point, and that there is a positive sample with the same class label and a negative sample with a different class label. Dual contrast loss aims to simultaneously attract feature representation to the classifier representation between positive samples and repel feature representation to the classifier between negative samples.

Represents the duality between

The comparison loss uses the dot product function as a measure to express the similarity between them, which makes the feature representation ZZZ in DualCL and the classifier representation θθθ have a double relationship. In linear classifier, the relationship between input features and parameters also shows a similar phenomenon. We can then consider θθθ θ as a parameter of a linear classifier so that the pre-trained encoder FFF can generate a linear classifier for each input sample. Therefore, DualCL naturally learns how to generate a linear classifier for each input sample to perform the classification task.

The experimental setup

The data set

Four kinds of data sets, SST-2, SUBJ, TREC, PC and CR, are adopted in this paper. The relevant statistics of data sets are as follows:

The experimental results

It can be seen from the results that using BERT and RoBERTa encoders together achieved the best classification performance in almost all Settings except for TREC datasets using RoBERTa. Compared with CE+CL with complete training data, the average improvement rate of DualCL to BERT and RoBERTa was 0.46% and 0.39%, respectively. In addition, we observed that under 10% of the training data, the performance of DualCL was significantly higher than that of CE+CL, 0.74% and 0.51% higher on BERT and RoBERTa, respectively. Meanwhile, the performance of CE and CE+SCL cannot surpass DualCL. This is because CE method ignores the relationship between samples, and CE+SCL method cannot directly learn the classifier of the classification task.

In addition, we find that the double contrast loss term helps the model achieve better performance across all five data sets. It shows that using relationships between samples helps models learn better representations in contrast learning.

Case analysis

To verify whether DualCL could capture information features, the authors also calculated the attention score between the features of [CLS] tags and each word in the sentence. First fine tune the RoBERTa encoder over the entire training set. Then we calculate the L2L_2L2 distance between the features and visualize the attention diagram in Figure 4. The results show that the characteristics captured are different when classifying emotions. In the above example from the SST-2 dataset, we can see that our model focuses more on sentences that predictably heart warming express “positive” emotions. In the following example from the CR data set, we can see that our model pays more attention to “small” for sentences expressing “negative” emotions. In contrast, the CE method does not focus on these distinguishing features. The results show that DualCL can successfully process informative keywords in sentences.

Thesis summed up

In this study, DualCL, a dual contrast learning method, is proposed to solve supervised learning tasks from the perspective of text categorization tasks.
In DualCL, the author uses PLMs to learn both representations simultaneously. One is the discriminating characteristics of the input sample, and the other is the classifier for that sample. We introduced tag-aware data enhancements to generate different views of the input sample, including features and classifiers. Then a dual contrast loss is designed to make the classifier efficient for input features.
Dual contrast loss uses supervised signals between training samples to learn better representations. The effectiveness of dual contrast learning is verified by a large number of experiments.

The core code

For the Dual-Contrastive-Learning implementation, you can view the open source code:

Github.com/hiyouga/Dua…

def _contrast_loss(self, cls_feature, label_feature, labels): normed_cls_feature = F.normalize(cls_feature, dim=-1) normed_label_feature = F.normalize(label_feature, dim=-1) list_con_loss = [] BS, LABEL_CLASS, HS = normed_label_feature.shape normed_positive_label_feature = torch.gather(normed_label_feature, dim=1, index=labels.reshape(-1, 1, 1).expand(-1, 1, HS)).squeeze(1) # (bs, 768) if "1" in self.opt.contrast_mode: loss1 = self._calculate_contrast_loss(normed_positive_label_feature, normed_cls_feature, labels) list_con_loss.append(loss1) if "2" in self.opt.contrast_mode: loss2 = self._calculate_contrast_loss(normed_cls_feature, normed_positive_label_feature, labels) list_con_loss.append(loss2) if "3" in self.opt.contrast_mode: loss3 = self._calculate_contrast_loss(normed_positive_label_feature, normed_positive_label_feature, labels) list_con_loss.append(loss3) if "4" in self.opt.contrast_mode: loss4 = self._calculate_contrast_loss(normed_cls_feature, normed_cls_feature, labels) list_con_loss.append(loss4) return list_con_loss def _calculate_contrast_loss(self, anchor, target, labels, Mu =1.0): BS = len(labels) with torch. No_grad (): labels = labels.reshape(-1, 1) mask = torch.eq(labels, labels.T) # (bs, bs) # compute temperature using mask temperature_matrix = torch.where(mask == True, mu * torch.ones_like(mask), 1 / self.opt.temperature * torch.ones_like(mask)).to(self.opt.device) # # mask-out self-contrast cases, Scatter (# torch. Ones_like (mask), # 1, # torch. Arange (BS). View (-1, 1).to(self.opt.device), # 0 # ) # mask = mask * logits_mask # compute logits anchor_dot_target = torch.multiply(torch.matmul(anchor, target.T), temperature_matrix) # (bs, bs) # for numerical stability logits_max, _ = torch.max(anchor_dot_target, dim=1, keepdim=True) logits = anchor_dot_target - logits_max.detach() # (bs, bs) # compute log_prob exp_logits = torch.exp(logits) # (bs, Db_logits = exp_logits-diag_embed (torch.diag(exp_logits)) # Log_prob = logits-torch. Log (exp_logits.sum(dim=1, keepdim=True) + 1e-12) # (bs, bs) # in case that mask.sum(1) has no zero mask_sum = mask.sum(dim=1) mask_sum = torch.where(mask_sum == 0, torch.ones_like(mask_sum), mask_sum) # compute mean of log-likelihood over positive mean_log_prob_pos = (mask * log_prob).sum(dim=1) / mask_sum.detach() loss = - mean_log_prob_pos.mean() return lossCopy the code

The resources

ICML 2020: understanding from the perspective of Alignment and Uniformity contrast characterization study blog.csdn.net/c2a2o2/arti…