10.13 Update: There is a new state-the-art pre-training model, portal:

[NLP] Google BERT detailed explanationzhuanlan.zhihu.com

1. Introduction

For a long time, word vector has been the main representation technique in NLP tasks. Following a series of technological breakthroughs in late 2017 and early 2018, research has demonstrated that pre-trained language representations can perform better on a wide range of NLP tasks after fine-tuning. Currently, there are two methods of pre-training:

  1. Feature-based: The trained representation is used as Feature for tasks, such as word vector, sentence vector, segment vector and text vector. The new ELMo also falls into this category, but needs to recalculate the representation of the input after migration.
  2. The idea is to add some task-specific layers to a pre-trained model, and then fine-tune the last few layers. The new ULMFit and OpenAI GPT fall into this category.

Three pre-training language models, ELMo, ULMFiT and OpenAI GPT, are introduced in this paper.

2. ELMo

2.1 Model principle and architecture

“Deep contextualized word representations”

ELMo is the Embedding extracted from BiLM. Use BiLSTM to create N tokens (T1, T2… ,tN), the goal is to maximize:

ELMo for each token, 2L+1 representations are calculated by biLM of an L layer:

Among themIs the result of encoding the token directly (in this case, the character is encoded by CNN),Is the output of each biLSTM layer.

The application compresses the output R of all layers in ELMo into a single vector,, the simplest compression method is to take the top result as the token representation:, it is more common to combine all layers of information with a few parameters:

Among themIs the weight from SoftMax,Is a task-related scale parameter, which is very important in the optimization process. At the same time, because the output distribution of BiLM at each layer is different,Can normalise to the layer.

The paper uses pre-training BiLM in Jozefowicz et al. The final model was two-story biLSTM (4096 units, 512 dimension projections) and added residual connections between the first and second floors. Both CNN and two-layer Highway are used for character-level context-free encoding of tokens. Finally, the model outputs three-layer vector representation for each token.

2.2 Precautions for model training

– Regularization:

1. Dropout

2. Add the weight penalty item in loss(Experimental results show that ELMo is suitable for smaller )

– TF version source code parsing:

1. The code of the model architecture is mainly in the LanguageModel class of the Training module, which is divided into two steps: the first step is to create the Embedding layer of Word or character (CNN+Highway); Step two creates the BiLSTM layer.

2. Training models of load required for model BidirectionalLanguageModel classes in the module.

2.3 Use of the model

  1. The ELMo vectorWith traditional word vectorsJoining together into aThen input into the RNN corresponding to the specific task.
  2. The ELMo vector is put into the output part of the model, and the output of the specific task RNNJoining together into a
  3. Keras code example
import tensorflow as tf
from keras import backend as K
import keras.layers as layers
from keras.models import Model

# Initialize session
sess = tf.Session()
K.set_session(sess)

# Instantiate the elmo model
elmo_model = hub.Module("https://tfhub.dev/google/elmo/1", trainable=True)
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())

# We create a function to integrate the tensorflow model with a Keras model
# This requires explicitly casting the tensor to a string, because of a Keras quirk
def ElmoEmbedding(x):
    return elmo_model(tf.squeeze(tf.cast(x, tf.string)), signature="default", as_dict=True)["default"]

input_text = layers.Input(shape=(1,), dtype=tf.string)
embedding = layers.Lambda(ElmoEmbedding, output_shape=(1024,))(input_text)
dense = layers.Dense(256, activation='relu')(embedding)
pred = layers.Dense(1, activation='sigmoid')(dense)

model = Model(inputs=[input_text], outputs=pred)

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()Copy the code

2.4 Advantages and disadvantages of the model

Advantages:

  1. The results are good, with improvements over the traditional model on most tasks. Compared with word vectors, experimental formal ELMo can better capture the information at both grammatical and semantic levels.
  2. Traditional pre-trained word vectors can only provide one level of representation and their vocabulary is limited. What ELMo provides is character-level representation with no limitation on vocabulary.

Disadvantages:

The speed is slow, and each token code should be calculated by language model.

2.5 Applicable Tasks

  • Question Answering
  • Textual entailment
  • Semantic role labeling
  • Coreference resolution
  • Named entity extraction
  • Sentiment analysis

3. ULMFiT

3.1 Model principle and architecture

Universal Language Model Fine-Tuning for Text Classification

ULMFiT is an effective NLP transfer learning method. The core idea is to complete other NLP tasks by fine-tuning the pre-trained language model. The language model used in this paper refers to the AWD-LSTM model of Merity et al. 2017a, namely the three-layer LSTM model without attention or shortcut.

The ULMFiT process is divided into three steps:



1. General-domain LM pre-train

  • The language model is pre-trained on Wikitext-103.
  • The pre-training corpus requirements: Large & Capture general properties of language
  • Pre-training is very effective for small data sets, and then only a small number of samples can generalize the model.

2. Target task LM fine-tuning

Two fine-tuning methods are presented in this paper:

  • Discriminative fine-tuning

Because different layers in the network can capture different types of information, different learning rates should also be used in fine tuning. The authors assign a learning rate to each layerAfter the experiment, it is found that the learning rate is first determined by the last layer L of the fine tuning model, recursively selecting the learning rate of the previous layer for fine tuning has the best effect. The recursion formula is:

  • Slanted triangular learning rates (STLR)

In order to select parameters for specific tasks, ideally, the parameters should be rapidly converged to an appropriate region at the beginning of training, and then refined. In order to achieve this effect, the author proposes the STLR method, that is, to make LR increase briefly at the beginning of training and then decrease afterwards. This is shown in the upper right corner of Figure B. The specific formula is:

    • T: number of training iterations
    • cut_frac: fraction of iterations we increase the LR
    • cut: the iteration when we switch from increasing to decreasing the LR
    • p: the fraction of the number of iterations we have increased or will decrease the LR respectively
    • ratio: specifies how much smaller the lowest LR is from thr max LR
    • : the LR at iteration t

Used by the author in the text

3. Target task classifier fine-tuning

To complete the fine tuning of the classification task, the authors add two linear blocks in the last layer, each with batch-norm and dropout, using ReLU as the middle layer activation function, and finally output the probability distribution of the classification through SoftMax. The final fine tuning involves the following steps:

  • Concat pooling

    The input to the first linear layer is the pooling of the state of the last hidden layer. Because the key information for text classification can be anywhere in the text, it is not enough to just use the output of the last time step. The author will be the last time stepWith as many time steps as possiblePool after splicing up toAs input.

  • Gradual unfreezing model due to excessive fine adjustment will be forgotten before training information, the author puts forward gradually unfreez network layer method, starting from the last layer unfreez and fine adjustment, by forward after unfreez and fine adjustment all the layers.
  • In order to perform model refinement on large Documents, the author divided the documents into B fixed length batches and recorded mean and Max pooling during each batch training. The gradient is propagated back to batches that contribute to the final prediction.
  • In the experiment, the Bidirectional Language Model is used to calibrate the forward and backward LM independently, and the prediction results of the two models are averaged. The combination of the two results in an increase of 0.5-0.7.

3.2 Precautions for model training

– Source parsing for PyTorch (FastAI Lesson 10)

# location: Fastai /lm_rnn.py def get_language_Model (n_tok, emb_sz, n_HID, n_layers, pad_token, dropout=0.4, dropouth=0.3, Weights =0.5, weights= 0.1, wdrop=0.5, Tie_weights =True, QRNN =False, BIAS =False): """Returns a SequentialRNN model. A RNN_Encoder layer is instantiated using the parameters provided. This is followed by  the creation of a LinearDecoder layer. Also by default (i.e. tie_weights = True), the embedding matrix used in the RNN_Encoder is used to instantiate the weights for the LinearDecoder layer. The SequentialRNN layer is the native torch's Sequential wrapper that puts the RNN_Encoder and LinearDecoder layers sequentially in the model. Args: n_tok (int): number of unique vocabulary words (or tokens) in the source dataset emb_sz (int): the embedding size to use to encode each token n_hid (int): number of hidden activation per LSTM layer n_layers (int): number of LSTM layers to use in the architecture pad_token (int): the int value used for padding text. dropouth (float): dropout to apply to the activations going from one LSTM layer to another dropouti (float): dropout to apply to the input layer. dropoute (float): dropout to apply to the embedding layer. wdrop (float): dropout used for a LSTM's internal (or hidden) recurrent weights. tie_weights (bool): decide if the weights of the embedding matrix in the RNN encoder should be tied to the weights of the LinearDecoder layer. qrnn (bool): decide if the model is composed of LSTMS (False) or QRNNs (True). bias (bool): decide if the decoder should have a bias layer or not. Returns: A SequentialRNN model """ rnn_enc = RNN_Encoder(n_tok, emb_sz, n_hid=n_hid, n_layers=n_layers, pad_token=pad_token, dropouth=dropouth, dropouti=dropouti, dropoute=dropoute, wdrop=wdrop, qrnn=qrnn) enc = rnn_enc.encoder if tie_weights else None return SequentialRNN(rnn_enc, LinearDecoder(n_tok, emb_sz, dropout, tie_encoder=enc, bias=bias)) def get_rnn_classifier(bptt, max_seq, n_class, n_tok, emb_sz, n_hid, n_layers, Pad_token, layers, Drops, bidir=False, dropouth=0.3, dropouti=0.5, dropoute=0.1, wdrop=0.5, QRNN =False): rnn_enc = MultiBatchRNN(bptt, max_seq, n_tok, emb_sz, n_hid, n_layers, pad_token=pad_token, bidir=bidir, dropouth=dropouth, dropouti=dropouti, dropoute=dropoute, wdrop=wdrop, qrnn=qrnn) return SequentialRNN(rnn_enc, PoolingLinearClassifier(layers, drops))Copy the code

3.3 Advantages and disadvantages of the model

Advantages:

Compared with other transfer learning methods (ELMo), it is more suitable for the following tasks:

– For non-English languages, there is little training data on labels

– New NLP tasks without the state-the-art model

– Tasks with only partial tag data

Disadvantages:

Classification and sequence labeling tasks are easy to migrate, but complex tasks (questions and answers, etc.) require new precision tuning methods.

3.4 Applicable Tasks

  • Classification
  • Sequence labeling

4. OpenAI GPT

4.1 Model principle and architecture

原文 : Improving Language Understanding by Generative pre-training

OpenAI Transformer is a Transformer based language model that can be migrated to multiple NLP tasks. Its basic idea is the same as ULMFiT, which applies a pre-trained language model to various tasks without changing the structure of the model as much as possible. The difference is that OpenAI Transformer advocates a Transformer structure, while ULMFiT uses an RNN-based language model. The network structure used in this paper is as follows:



The training process of the model is divided into two steps:

1. Unsupervised pre-training

The goal of the first stage is to pre-train the language model and give tokens’ corpus, the objective function is the maximum likelihood function:

In this model, multi-headed self-attention is applied, and position-wise forward propagation layer is added afterwards, and finally a distribution is output:

2. Supervised fine-tuning

With the pre-trained language model, for the labeled training set, given the input sequenceAnd the labelCan be obtained from the language model, after the output layerMake predictions:

Then the objective function is:

The objective function of the whole task is:

4.2 Precautions for model training

– TF version of the sourceparsing

# location: finetune-transformer-lm/train.py def model(X, M, Y, train=False, reuse=False): with tf.variable_scope('model', reuse=reuse): # n_special = 3, # n_context we = tf.get_variable("we", [n_VOCab +n_special+n_ctx, n_embd], Initializer =tf. random_NORMAL_Initializer (STdDev =0.02)) We = Dropout (we, embd_pdrop, train) X = Tf. shape(X, [-1, n_ctx, 2]) M = tf.reshape(M, [-1, n_ctx]) # 1. Embedding h = embed(X, we) # 2. transformer block for layer in range(n_layer): h = block(h, 'h%d'%layer, train=train, scale=True) # 3. Shape (H [:, :-1], [-1, n_embd]) LM_logits = tv. matmul(lm_h, we, transpose_b=True) lm_losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=lm_logits, labels=tf.reshape(X[:, 1:, 0], [-1])) lm_losses = tf.reshape(lm_losses, [shape_list(X)[0], shape_list(X)[1]-1]) lm_losses = tf.reduce_sum(lm_losses*M[:, 1:], 1)/tf.reduce_sum(M[:, 1:], 1) # 4. Shape (h, [-1, n_embd]) Pool_idx = tf.cast(tf.argmax(tf.cast(tf.equal(X[:, :, 0], 0) clf_token), tf.float32), 1), tf.int32) clf_h = tf.gather(clf_h, tf.range(shape_list(X)[0], dtype=tf.int32)*n_ctx+pool_idx) clf_h = tf.reshape(clf_h, [-1, 2, n_embd]) if train and clf_pdrop > 0: shape = shape_list(clf_h) shape[1] = 1 clf_h = tf.nn.dropout(clf_h, 1-clf_pdrop, shape) clf_h = tf.reshape(clf_h, [-1, n_embd]) clf_logits = clf(clf_h, 1, train=train) clf_logits = tf.reshape(clf_logits, [-1, 2]) clf_losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=clf_logits, labels=Y) return clf_logits, clf_losses, lm_lossesCopy the code

4.3 Advantages and disadvantages of the model

Advantages:

  1. Whereas recurrent neural networks capture less information, Transformer can capture a much longer range of information.
  2. The computation speed is faster than the recurrent neural network and easy to parallelize
  3. Experimental results show that Transformer is better than ELMo and LSTM networks

Disadvantages:

Some types of tasks require adjustments to the structure of the input data

4.4 Applicable Tasks

  • Natural Language Inference
  • Question Answering and commonsense reasoning
  • Classification
  • Semantic Similarity

5. To summarize

From Word Embedding to OpenAI Transformer, the migration learning in NLP is from using Word2VEc and GLoVe for the vector representation of words at first to providing the weight sharing of the first several layers with ELMo. The entire pre-training model, including ULMFiT and OpenAI Transformer, has been fine-tuned to greatly improve the performance of basic NLP tasks. At the same time, a number of studies have shown that the language model as a pre-training model can not only capture the grammatical information between words, but also capture the semantic information, to provide high-level abstract information for the subsequent network layer. In addition, the Transformer model is superior to the RNN model in some aspects.

Finally, as for specific tasks, various attempts should be made. The above methods can be used to make model Baseline, and then the network structure can be adjusted to improve the effect.