Article source | community] [turbine cloud (turbine cloud, focus on AI industry platform for the sharing work force)

The original address | 【 2022 】 ACL

The original author | Mathor


Xiaobian heart OS at the moment: can’t blame big guy output high ~ is I don’t work hard! Workers working soul, stand up, began to carry!!

Beginning of the body:

Key words: “Mathcal {Y}Y-Tuning: The submission of An Efficient Tuning Paradigm for Large-scale Pre-trained Models via Label Representation Learning has attracted my attention from the title. Different from fine-tuning, Adapter-tuning and prompt-tuning, the innovation of this paper lies in that it does not adjust the input text features and parameters of large-scale pre-training model, but only learns the characteristics of tags, which is rarely seen in previous papers. While the end result may still not be comparable to fine-tuning, it has significant advantages in terms of computational cost savings and the potential for further performance improvements through follow-up research

PRELIMINARIES OF TUNING PTMS

For NLP tasks, there is usually input text
x X x\in \mathcal{X}
And the label
y Y y\in \mathcal{Y}
, including
X \mathcal{X}
The eigenspace of is discrete (e.g. One-hot). Using the Sentiment Analysis (SA) task as an example, enter sentencesLabel set
Y \mathcal{Y}
=
p o s t i v e postive
.
negative \text{negative}
In the label
y = postive y=\text{postive}
For true label

define
ϕ : X Z \phi : \mathcal{X}\to \mathcal{Z}
Is the mapping of input sentences to high-dimensional dense vector space,
f : Z Y f: \mathcal{Z}\to \mathcal{Y}
Is the mapping from vector space to label space. Given training set
D \mathcal{D}
, we can define the loss function as
: Y x Y R + \ell: \mathcal{Y}\times \mathcal{Y}\to \mathbb{R}^+
And you can find the best one by the following methods
f f
and
ϕ \phi
:In general, even classifiers
f f
It’s simple, but with a good feature extractor
ϕ ( x ) \phi(x)
And perform no worse on downstream tasks

In fact, ϕ\phiϕ can be thought of as BERT, and FFF is the layer followed by BERT for different downstream tasks, such as text classification followed by a Linear layer


Y \mathcal{Y}
-TUNING

ϕ⋆ we define (x,y)(x,y) as a tabbed training sample, ϕ⋆\phi^{\star}ϕ⋆ as a pre-training model trained on a large scale corpus, and the parameters for ϕ⋆\phi^{\star}ϕ⋆ remain unchanged throughout the rest of the year. The traditional approach is to fine-tune the parameters for ϕ⋆\phi^{\star}ϕ⋆ to bring them closer to the real tags. The technique for fixing ϕ⋆ phi^{star}ϕ⋆, in turn Tuning the parameters of the ϕ extractor ψ psi, And we use the Cross Attention to ϕ ⋆ (x) \ phi ^ {\ star} (x) ϕ ⋆ (x) with bits of (Y) \ psi (\ mathcal {Y}) bits (Y) characteristics of fusion, as shown in the figure below

The Loss function is Triplet Loss, in the following form:

Among them,
[ x ] + = max ( x . 0 ) [x]_+=\max (x, 0)
.
Alpha. \alpha
Is a boundary hyperparameter that controls the distance between positive and negative samples. During training, given training set
D \mathcal{D}
, we can find the best model parameters in the following ways

In the reasoning stage, we can use the following methods to obtain the predicted value:

AN IMPLEMENTATION OF
Y \mathcal{Y}
-TUNING

The model architecture in the figure is mainly composed of three parts:

  1. ϕ\phiϕ is used to extract text features. This part is generally an Encoder class model, such as BERT, etc
  2. ψ\psiψ for extracting tag features, this part generally adopts Transformer Decoder structure, because there needs to be cross-attention part to interact with tag features and text features
  3. The Label Pointer is used to predict the category. This part is relatively simple, using an average or maximum pooling to convert a high-dimensional vector to a low-dimensional vector
Label Embedding

Given a label set Y\mathcal{Y}Y, we first map the label Y ∈Yy\in \mathcal{Y}Y ∈Y to one or more continuous vectors. Of course, in addition to labels, we also need to map task-related information to vectors, such as sentiment analysis tasks, where we add an SA flag at the front

This is a bit like mBART, where machine translation is done by adding the corresponding flags of the language (such as ZH, JP, EN, etc.) to the beginning of the sentence

Therefore, the initial tag feature is

Among them,
e T e_T
Represents task-specific embedding,
e c e^{c}
According to the first
c c
Four categories of embedding,
N N
and
D D
Respectively represent the number of samples and the dimension of label characterization. In fact, each label can be represented by multiple vectors. The author also conducted a comparative experiment to study the influence of multiple vectors for each label on the resultThere are many ways to label
y y
Mapping to a vector, such as sampling from Vocabulary, uniform distribution, token embedding, etc

Self-Attention and Cross-Attention

We first use self-attenion to enhance information interaction between different tagsAmong them,
Q R N x D k . K R M x D k . V R M x D v \mathbf{Q}\in \mathbb{R}^{N\times D_k}, \mathbf{K}\in \mathbb{R}^{M\times D_k}, \mathbf{V}\in \mathbb{R}^{M\times D_v}
If in self-attention,
N = M N=M
; If I’m in cross-attention,
N N
Represents the length of the input sentence,
M M
Represents the length of the tag input in cross-attentionAmong them,
X \mathbf{X}
Is the higher dimensional vector after the input sentence passes PTMs

Label Pointer

And when we’ve done all the calculations, we’ll have the output vectorAmong them,
h T \mathbf{h}_T
Are task-related descriptive features,
h c \mathbf{h}_c
Is the category of
c c
Tag characteristics of. Triplet Loss is defined as follows:Among them,
c c^{\star}
Represents the index corresponding to the correct label

MODEL ANALYSIS

Suppose we have a pretraining model of LLL layer, whose complexity is O(LM2)\mathcal{O}(LM^2)O(LM2), where MMM is the input sentence length; A continuous Prompt with length PPP, whose complexity is O(L(M+P)2)\mathcal{O}(L(M+P)^2)O(L(M+P)2); For Y\mathcal{O}(N^2)O(N ^2)O(N) and O(MN)\mathcal{O}(MN)O(MN) O(MN) O(N)O(N) NNN is the size of the label set. Because in y-mathcal {Y} y-tuning, we fixed the parameters of the pre-training model without training, so the part of the pre-training model will not occupy computing resources (especially in the process of back propagation).

RESULT

From the experimental results, the effect is “very competitive”, we certainly can not compare it with the traditional FineTune, after all, the trainable parameters are so few, and the calculation power required for training is not the same order of magnitude

Personal summary

The y-Mathcal {Y} y-tuning approach proposed in this paper is very interesting. The traditional approach is to learn the input sentence and make its output vector close to the distribution of labels. This article is just the opposite, learning about labels. A bit of a surprise to me is that the loss function is not a traditional CrossEntropyLoss, because it seems to me that you can simply convert the output vector into dimensions and compare it to the real tag. However, the Loss function used in this paper is Triplet Loss. I wonder why the author does this