F(X) Team – Star and Ali Tao

BERT — State of the Art Language Model for NLP

BERT (Transformers bidirectional encoder representation) is a paper published by Google AI language researchers. It has caused a stir in machine learning because it has achieved top results in many NLP tasks, including question answering (SQuAD V1.1), natural language reasoning (MNLI), and others.

BERT’s key technical innovation is to apply the bi-directional training of a popular attentional model Transformer to language modeling, as opposed to the way text sequences are entered in previous left-to-right or left-to-right and right-to-left combined training. The results show that the language model with two-way training has a deeper context perception than the language model with one-way training. In this paper, the researchers detail a new technique, called Masked LM (MLM), which makes bidirectional training possible in models.

background

In computer vision, researchers have repeatedly demonstrated the value of transfer learning — pre-training neural network models on known tasks, such as Imagnet, and then fine-tuning the application, using the trained neural network as the basis for new purpose-specific models. In recent years, researchers have shown that similar techniques are effective in many natural language tasks. In addition to transfer learning, another approach is that features-based training is also popular in NLP tasks and is validated in a recent ELMo paper. In this approach, a pre-trained neural network generates a word embedding representation that is then characteristic of a natural language processing model.

How Bert works

BERT uses Transformer’s attentional mechanism to learn contextual relationships between words in text. In its general form, Transformer includes two separate mechanisms: an encoder that reads text input and a decoder that generates task predictions. Since BERT’s goal is to generate a language model, only the encoder mechanism is required. A Google paper describes how Transformer works in detail.

Unlike the directional model where text input is read sequentially (left to right or right to left), Transformer encoders read the entire sequence of words at once. So it’s considered bidirectional, or it’s more accurate to be undirectional. This feature allows the model to learn the context of a word based on its surroundings (left and right).

The following figure is an abstract description of the Transformer encoder. The inputs are a series of tokens that are formed by embedding words into vectors that can then be processed and used in a neural network. The output is a sequence of vectors of size H, where each vector corresponds to a sequence of input vectors with the same index.

In training language models, it is difficult to define the prediction target reasonably. Many models predict The next word sequentially (for example, “The child came home from ()”, which is a single-directional training approach that limits learning in model context. To overcome this shortcoming, Bert employed two training strategies:

Masked LM (MLM)

Before the word sequences are entered into BERT, 15% of the words in each sequence are replaced with a [MASK] marker. The model then attempts to predict masked words based on the context provided by other unmasked words in the sequence. From a technical point of view, the prediction of output words needs to do:

  1. Used as a classification layer after the encoder output of the model.

  2. Multiply the output vector by the embedding matrix of the thesaurus to convert the output dimension to the thesaurus dimension.

  3. Use Softmax to calculate the probability that the word that is [masked] is each word in the word list.

The BERT loss function only considers the prediction of masked words and ignores the prediction of unmasked words. Therefore, the convergence speed of the model will be slower than that of the ordinary directional model, but considering that this approach can enhance context awareness, it still has more advantages than disadvantages.

Note: In practice, the BERT implementation is slightly more complex and does not replace all 15% of the mask words.

Next Sentence Prediction (NSP)

During BERT training, the model receives paired sentences as inputs and learns to predict whether the second sentence is a subsequent sentence in the training text. During the training, 50% of the inputs were paired, where the second sentence was a follow-up sentence in the training text, and in the other 50% of the inputs, a sentence was randomly selected from the corpus as the second sentence. Assume that the random sentence will be unrelated to the first sentence. To help the model distinguish between two sentences in training, the input needs to be processed in the following way before entering the model:

  1. Insert a [CLS] tag at the beginning of the first sentence and a [SEP] tag at the end of each sentence.

  2. Add A sentence insert representing sentence A or sentence B to each tag. Sentence embedding is conceptually similar to word embedding.

  3. Position vectors are added to each token to indicate the position of the term in the sequence.

Perform the following steps to predict whether the second sentence is indeed related to the first:

  1. The entire input sequence is fed into the Transformer model.

  2. Using a simple classification layer (weights and deviation parameters learned), the output corresponding to the [CLS]token is converted into a 2×1 shaped vector.

  3. Use Softmax to calculate the probability of the next sentence.

MLM and NSP are combined to train the BERT model, aiming at minimizing the combined loss function of the two strategies.

How to use BERT (Fine tuning Training)

BERT can be used relatively simply for a wide variety of language tasks, and only needs to add a small layer to the core model:

  1. Apply nSP-like sentence classification tasks, such as sentiment analysis, by adding a classification layer to Transformer output of the [CLS] tag.

  2. In a question-and-answer task (such as SQuAD V1.1), the system receives a question about an existing text sequence and asks for the answer to be marked within the existing text sequence. With BERT, the question answering model can be trained by learning two additional vectors that mark the beginning and end of the answer in the text.

  3. In named Entity Recognition (NER), the system receives a sequence of text and needs to mark various entities (such as individuals, organizations, dates, etc.) that appear in the text. With BERT, the NER model can be trained by feeding the output vector of each token into the classification layer that predicts the NER tag.

In fine-tuning training, most of the hyperparameters are consistent with BERT training, and the paper gives specific guidance for the hyperparameters that need fine-tuning (Section 3.5). The BERT team has used this technique to obtain optimal results on a variety of natural language challenge tasks, which are described in detail in Section 4 of the paper.

Note: BERT’s pre-training model can also be used to generate text embedding, similar to many other feature-based models such as DOC2VEc and ELMo. It is found that the best embedding is achieved by connecting the last four layers of the encoder.

conclusion

Bert is undoubtedly a breakthrough in machine learning in natural language processing. The fact that it can be fine-tuned quickly and is easy to pick up, will likely find widespread practical use in the future. In this paper, we try to describe the main ideas of the paper, rather than drowning in too much technical detail. For those wishing to pursue further research, we strongly recommend reading the full paper and the papers cited therein. Another useful reference is the research team’s open source BERT source code and pre-training model, which covers 103 languages and can be directly used in real-world scenarios.