To download the full text, go to GitHub BERT_Paper_Chinese_Translation

BERT: A pre-trained deep bi-directional Transformer language model

Jacob Devlin. Ming – Wei Chang; Kenton Lee; Kristina Toutanova Google AI Language {jacobdevlin,mingweichang,kentonl,kristout}@google.com

Abstract

We propose a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent linguistic representation models (Peters et al., 2018, Radford et al., 2018), BERT aims to pre-train depth bidirectional representation by jointly adjusting the left and right contexts in all layers. Thus, with just one additional output layer, pre-trained BERT representations can be fine-tuned to create state-of-the-art models for a wide range of tasks, such as answering questions and linguistic inference tasks, without extensive task-specific model structure modifications.

BERT’s concept is simple, but the experimental effects are powerful. It updates current optimal results for 11 NLP tasks, including increasing GLUE benchmark to 80.4% (an absolute improvement of 7.6%), increasing MultiNLI accuracy to 86.7% (an absolute improvement of 5.6%), And increased SQuAD V1.1’s F1 score on the Q&A test to 93.2 points (up 1.5 points) — two points higher than human performance.

1. Introduction

Language model pre-training can significantly improve the performance of many natural language processing tasks (Dai and Le, 2015; Peters et al., 2018; Radford et al., 2018; Howard and Ruder, 2018). These tasks include sentence-level tasks such as natural language reasoning (Bow-Man et al., 2015; Williams et al., 2018) and interpretation (Dolan and Brockett, 2005), aiming to predict the relationship between sentences through the overall analysis of sentences, as well as marker level tasks, Such as named entity recognition (Tjong Kim Sang and De Meulder, 2003) and SQuAD question answering (Rajpurkar et al., 2016), the model needs to generate fine-grained output at the mark level.

There are two approaches to applying pre-trained language model representations to downstream tasks: feature-based and fine-tuning. The feature-based approach, such as ELMo (Peters et al., 2018), uses a task-specific model structure, which contains pre-trained representations as additional features. Fine-tuning methods, such as generating a pre-training Transformer (OpenAI GPT) (Radford et al., 2018) model, then introducing the smallest task-specific parameters, and simply fine-tuning the parameters of the pre-training model to train downstream tasks. In previous work, both methods have the same objective function in the pre-training task, that is, to use a unidirectional language model to learn the general language expression.

We believe that current techniques severely limit the effectiveness of pre-trained representations, especially for fine-tuning methods. The main limitation is that the standard language model is unidirectional, which limits the choice of model structures that can be used during pre-training. For example, in OpenAI GPT, the authors use a left-to-right model structure where each tag can only focus on the tag that precedes it in Transformer’s self-attention layer (Williams et al., 2018). These limitations are suboptimal (acceptable) for sentence-level tasks, but can have a negative impact when applying a fine-tuning approach to marker-level tasks such as SQuAD questions (Rajpurkar et al., 2016), because under marker-level tasks, It is crucial to analyze the context from two directions.

In this paper, we improve on the fine-tuning approach by proposing BERT: a bi-directional encoder representation from Transformer. Inspired by the cloze task, BERT proposes a new pre-training task to solve the previously mentioned unidirectional constraint: MLM Masked Language Model (TAy-Lor, 1953). The Shadowed language model randomly shadows some tags from the input in order to predict the ID of the original term corresponding to the Shadowed tag based only on its context. Unlike left-to-right language model pre-training, MLM targets allow representations to fuse left and right contexts, which allows us to pre-train a deep two-way Transformer. In addition to masking the language model, we propose a task of “predicting the next sentence” by combining pre-training text pairs.

The contribution of this paper is as follows:

  • We demonstrate the importance of bi-directional pre-training for language representation. Unlike Radford et al., 2018, which used a unidirectional language model for pre-training, BERT used a shadowing language model to achieve deep bidirectional representation for pre-training. This is also in contrast to the study by Peters et al., 2018, which used a shallow connection of a left-to-right and right-to-left independently trained language model.
  • We show that pre-trained representation eliminates the need for many task-specific, highly engineered model structures. BERT is the first fine-tuning based presentation model that achieves state-of-the-art performance across a wide range of sentence-level and mark-level tasks, outperforming many task-specific structuring models.
  • BERT provided state-of-the-art technology for 11 NLP missions. We have also conducted extensive ablation studies, demonstrating that the bidirectional nature of our model is the most important new contribution. Code and pre-training models will be available in Goo. gl/language/be… To obtain.

2 Related Work

Pre-training common language representation has a long history, and we’ll briefly review the most popular methods in this section.

2.1 Feature-based approach

The study of broadly applicable word representations has been an active area of research for decades, including in non-neural networks (Brown et al., 1992; Blitzer et al., 2006) and the field of neural networks (Collobert and Weston, 2008; Mikolov et al., 2013; Pennington et al., 2014) method. Pre-trained word embedding is considered to be an integral part of modern NLP systems and provides a significant improvement over learning from scratch (Turian et al., 2010).

These methods have been extended to coarser granularity, such as sentence embedding (Kiros et al., 2015; Logeswaran and Lee, 2018) or paragraph embedding (Le and Mikolov, 2014). As with traditional word embedding, these learned representations are often used as input features for downstream models.

ELMo (Peters et al., 2017) summarized traditional word embedding studies from different dimensions. They suggest extracting context-sensitive features from the language model. ELMo provides state-of-the-art technology for several major NLP standards when integrating context embedding with a task specific architecture (Peters et al., 2018), including q&As on SQuAD (Rajpurkar et al., 2016), Sentiment analysis (Socher et al., 2013), and named entity recognition (Jong Kim Sang and De Meul-der, 2003).

2.2 Methods based on fine tuning

A recent trend in language model transfer learning (LMs) is to pre-train some model constructs on LM targets before fine-tuning models for supervised downstream tasks (Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2018). The advantage of these methods is that very few parameters need to be learned from scratch. Thanks, at least in part, to this advantage, OpenAI GPT (Radford et al., 2018) achieved the most advanced results on many sentence-level tasks in GLUE benchmarks (Wang et al.(2018)).

2.3 Transfer learning from supervised data

Although the advantage of unsupervised pre-training is that the amount of data available is almost unlimited, studies have also shown that effective migration can be carried out from supervised tasks with large data sets, such as natural language reasoning (Con-Neau et al., 2017) and machine translation (McCann et al., 2017). Outside of NLP, computer vision research has also demonstrated the importance of transfer learning from large pre-trained models, and there is an effective method to fine-tune pre-trained models on ImageNet (Deng et al., 2009; Yosinski et al., 2014)

3 BERT

This section introduces BERT and its implementation. First, we introduce the BERT model structure and input representation. We then introduce the core innovation of this paper, the pre-training task, in section 3.3. The pre-training process and fine-tuning model process are described in detail in section 3.4 and 3.5, respectively. Finally, the differences between BERT and OpenAI GPT are discussed in Section 3.6.

3.1 Model Structure

BERT’s model structure is a multi-layer two-way Transformer encoder based on the original implementation described by Vaswani et al.(2017), and the Transformer encoder is published in the Tensor2Tensor code base. Due to the recent widespread use of Transformer and the fact that our implementation is virtually identical to the original implementation, we will omit the detailed background description of the model structure and refer readers to Vaswani et al.(2017) and the excellent guide, Such as “annotated Transformer”.

In this work, we represent the number of layers (that is, Transformer blocks) as, the hidden size is, the number of self-attention heads is. In all examples, we set the size of the feed forward/filter to, that is, whenWhen I was; when. We mainly analyze the results of two model sizes:

For the sake of comparison,The same model size as OpenAI GPT was selected. Importantly, however, BERT Transformer uses bidirectional self-attention, whereas GPT Transformer uses limited self-attention, where each tag can only focus on the context to its left. We note that in the literature, a two-way Transformer is often referred to as a “Transformer encoder”, while the version marked only with the left context is redefined as a “Transformer decoder” because it can be used for text generation. A comparison between BERT, OpenAI GPT, and ELMo is shown in Figure 1.

Figure 1: Different structure of the pre-training model. BERT uses a two-way Transformer. OpenAI GPT uses left-to-right Transformer. ELMo uses independently trained left-to-right and right-to-left LSTM connections to generate features for downstream tasks. Among them, only BERT represents the restriction of left and right contexts in all layers.

3.2 Input Representation

Our input representation is able to clearly represent a single text sentence or a pair of text sentences (for example, [Question, Answer]) in a sequence of tags. (Note: Throughout the work, “sentences” can be consecutive pieces of text of any span, rather than sentences in the actual linguistic sense. A “sequence” is the sequence of tags input to BERT, which can be a single sentence or a combination of two sentences.) The input representation of a given tag is constructed by summing its corresponding tag embeddings, sentence embeddings, and position embeddings. Figure 2 shows a visual representation of the input representation. Details are:

  • We used a WordPiece embedded with 30,000 marked words (Wu et al., 2016). We use ## to represent broken pieces of words.
  • We use learned location embedding to support sequences up to 512 tags in length.
  • The first marker of each sequence is always a special classification embedding ([CLS]). The final hidden state corresponding to this particular tag (that is, the output of Transformer) is used as the overall representation of the sequence in the classification task. For non-classified tasks, this final hidden state is ignored.
  • Pairs of sentences are packaged together to form a single sequence. We distinguish these sentences in two ways. In method one, we separate them with a special tag ([SEP]). In method two, we add A trainable sentence A embedding to each marker of the first sentence and A trainable sentence B embedding to each marker of the second sentence.
  • For single-sentence input, we only use sentence A embedding.

Figure 2: Input representation for BERT. Input embedding is the sum of tag embedding (word embedding), sentence embedding, and position embedding.

3.3.1 Task 1 # : Shadowing the language model

Intuitively, there is reason to believe that deep bidirectional models are technically more powerful than left-to-right models or shallow connections between left-to-right models combined with right-to-left models. Unfortunately, the standard conditional language model can only be trained from left to right or right to left, because bidirectional conditioning will allow each word to “see itself” in multiple layers of context.

To train the bidirectional representation of depth, we employ a simple method of randomly shadowing a certain proportion of input markers and then predicting only those that are Shadowed. We refer to this process as the “Shadowed language model” (MLM), although in the literature it is often referred to as the cloze task (Taylor, 1953). In this case, as in the standard language model, the final hiding vector corresponding to the shadothed tag is entered into the output SoftMax corresponding to the vocabulary (that is, the shadothed tag corresponds to a word in the vocabulary). In all of our experiments, we randomly Shadowed 15% of the markers in each sequence. Unlike denoising autoencoders (Vincent et al., 2008), we only let the model predict the shadoded marks, rather than requiring the model to reconstruct the entire input.

While this does allow us to obtain a two-way pre-training model, there are two disadvantages to this approach. The first disadvantage is that we created a mismatch between pre-training and fine-tuning because the [MASK] marker never appeared during fine-tuning. To mitigate this, we don’t always replace the selected word with a real [MASK] tag. Instead, the training data generator randomly selects 15% of the marks. For example, in the phrase “My dog is hairy”, it selects “hairy”. Then perform the following steps:

  • Instead of always replacing the selected word with [MASK], the data generation does the following:
  • 80% of the time: Replace the selected word with [MASK], e.g., My dog is hairy → my dog is [MASK]
  • 10% of the time: Replace the selected word with a random word, such as my dog is hairy → my dog is apple
  • 10% of the time: Keep the selected word unchanged, for example, my dog is hairy → my dog is hairy. The purpose of this is to bias the expressions towards the words actually observed.

The Transformer encoder does not know which words it will be asked to predict, or which words have been replaced by random words, so it is forced to keep a contextual representation of the distribution of each input tag. In addition, because random substitutions only occur in 1.5% of markers (i.e., 10% of 15%) it does not seem to impair the model’s language comprehension.

The second disadvantage is that only 15% of the markers in each batch of data using Transformer are predicted, which means that the model may require more pre-training steps to converge. In Section 5.3, we demonstrate that The Transformer is indeed slightly slower than the left-to-right model (predicting each marker), but that the experimental effect of the Transformer model far outweighs its increased cost of the pre-training model.

3.3.2 Task 2# : Predict the next sentence

Many important downstream tasks, such as question answering (QA) and natural language reasoning (NLI), are based on understanding the relationship between two textual sentences, which is not captured directly by language modeling. To train a model for understanding sentence relationships, we pre-train a binary classification task for predicting the next sentence, which can be simply generalized from any monolingual corpus. Specifically, when choosing sentences A and B for each pre-training example, 50% of the time B was really the next sentence after A, and 50% of the time it was A random sentence from the corpus. Such as:

We chose sentences that were not the next sentence completely at random, and the final pre-training model achieved 97 to 98 percent accuracy on this task. Although this task is simple, we show in Section 5.1 that pre-training for this task can be very beneficial to BOTH QA and NLI.

3.4 Pre-training process

The pre-training process generally follows the language model pre-training process in the previous literature. For the pre-training corpus, we used BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words). For Wikipedia, we extract only paragraphs of text and ignore lists, tables, and titles. In order to extract long continuous sequences, it is crucial to use document-level corpus instead of a sentence-level corpus that is out of order like Billion Word Benchmark (Chelba et al., 2013).

To generate each training input sequence, we sample two pieces of text from the corpus, which we call “sentences”, although they are usually much longer (but can be shorter) than individual sentences. The first sentence adds A embed, and the second sentence adds B embed. 50% of the time B is actually the actual next sentence after A, and 50% of the time it’s the sentence of one of the random ones, for the next sentence prediction task. The combined length of two sentences should be 512 or less marks. The language model masking process is to mask the marks with a uniform 15% probability after serializing a sentence with a WordPiece, regardless of the effect of partial word pieces (those containing marks prefixed with ##, which are split by a WordPiece).

We used a batch size of 256 sequences (256 sequences * 512 markers = 128,000 markers/batch) for 1,000,000 steps of training, which is about 40 cycles of training in a corpus of 3.3 billion words. We used Adam to optimize the algorithm and set its learning rate as..The weight of is decreased by 0.01, and the learning rate warmup in the first 10,000 steps, and then the learning rate begins to decline linearly. We use 0.1 probability dropout on all layers. Like OpenAI GPT, we use gelu to activate (Hendrycks and Gimpel, 2016) instead of the standard Relu. The training loss is the average value of the likelihood of the shadoded language model and the next sentence prediction.

In the fourCloud TPU(containing 16 TPU in total). Trained on 16 Cloud Tpus (out of 64 Tpus). It takes four days before each training session.

3.5 Fine tuning Process

For sequence-level classification tasks, BERT fine-tuning is very simple. In order to obtain a representation of the fixed dimensions of the input sequence, we take special markers ([CLS]) to construct the pooled output of the relevant embedded corresponding final hidden state (that is, the output of Transformer). So let’s write this vector as, the only newly added parameter required during fine-tuning is the parameter matrix of the classification layer, includingIs the number of labels to classify. The probability of classifying labelsCalculated by a standard SoftMax,. The parameter matrix for BERTAll parameters of are jointly fine-tuned to maximize the logarithmic probability of correct labels. For interval – and tag-level prediction tasks, the above procedures must be modified slightly in a task-specific way. See section 4 for the specific process.

For fine-tuning, most model hyperparameters were the same as during pre-training, except for batch size, learning rate, and training times. The Dropout probability is always 0.1. The optimal hyperparameter value is task-specific, but we found that the following range of possible values works well across all tasks:

  • Batch size: 16, 32
  • Learning rate (Adam): 5e-5, 3e-5, 2e-5
  • Number of epochs: 3, 4

We also observed that large data sets (such as 100K + labeled training sets) were much less sensitive to hyperparameter selection than small data sets. Fine-tuning is often very fast, so it is simply a matter of doing a full search of the above parameters and choosing the model that performs best on the validation set.

3.6 Comparison between BERT and OpenAI GPT

Among the existing pre-training methods, the most similar to BERT is OpenAI GPT, which trains a left-to-right Transformer language model in a large text corpus. In fact, many design decisions in BERT are intentionally chosen to be as close as possible to GPT so that the two approaches can be compared more directly. The core argument of our work is that the two new pre-training language model tasks presented in section 3.3 account for most of the improvement in experimental results, but we note several other differences in how BERT and GPT are trained:

  • GPT is trained on BooksCorpus (800M word); BERT was trained on Bookscor-Pus (800M words) and Wikipedia (2,500M words).
  • GPT uses sentence separators ([SEP]) and classification markers ([CLS]) only for fine-tuning; BERT uses [SEP], [CLS] and A/B sentence embedding in pre-training.
  • GPT trained 1M steps on 32,000 words per lot; BERT trained 1M steps on a batch of 128,000 words.
  • The learning rate of GPT is 5E-5 in all fine-tuning experiments. BERT selects the task-specific fine-tuning learning rate that performs best in the validation set.

In order to discern the effects of these differences, we performed ablation experiments for each of these differences in section 5.1 and showed that most of the improvement in experimental performance actually came from the new pre-training task (masking the language model and the next sentence prediction task).

Figure 3: Our task-specific model is constructed by adding an extra output layer to BERT, so only the minimum number of parameters need to be learned from scratch. Where (a) and (b) are sequence-level tasks, and (c) and (d) are tag-level tasks. In the figureRepresents the embedded input,According to the first[CLS] is a special symbol for sorting output, and [SEP] is a special symbol for separating a sequence of discontinuous tags (separating two sentences).

Experiment 4.

In this section, we present the results of BERT’s fine-tuning of 11 natural language processing tasks.

4.1 GLUE data sets

The GLUE General Language Understanding Evaluation benchmark (Wang et al.(2018)) is a collection of multiple natural Language Understanding tasks. Most GLUE data sets have existed for many years, but the purpose of GLUE is (1) to publish these data sets in a standard form of separate training sets, validation sets, and test sets; And (2) establish an evaluation server to alleviate the problems of inconsistent evaluations and over-fitting test sets. GLUE does not publish labels for test sets, and users must upload their predictions to the GLUE server for evaluation, with a limit on the number of submissions.

The GLUE benchmark consists of the following data sets, whose description was originally summarized in Wang et al.(2018) :

MNLI’s multi-genre Natural Language Inference is a large-scale crowdsourced implication classification task (Williams et al., 2018). Given a pair of sentences, the goal is to predict whether the second sentence is implied, contradictory, or neutral relative to the first.

The QQP Quora Question Pairs are a binary classification task designed to determine whether two questions are semantically equivalent when asked on Quora (Chen et al., 2018).

The QNLI Question Natural Language Inference (QNLI) is one of the most advanced research papers in the Stanford Question Answering Data Set (Rajpurkar et al., 2016) has been converted to a version of the binary classification task Wang et al.(2018). An example of a positive class is a (question, sentence) pair that contains the correct answer, and an example of a negative class is a (question, sentence) pair from the same paragraph that contains no correct answer.

The SST-2 Stanford Sentiment Treebank data set is a binary single-sentence classification task, which consists of sentences extracted from movie reviews and annotated by humans (Socher et al., 2013).

CoLA Corpus of Linguistic Acceptability, whose purpose is to predict whether an English sentence is linguistically “acceptable” (Warstadt et al., 2018).

Sts-b Semantic Textual Similarity bench-mark is a collection of sentence pairs extracted from news headlines and other sources (Cer et al., 2017). They marked them on a scale of one to five, indicating how semantically similar the two sentences were.

MRPC Microsoft Research Paraphrase Corpus consists of sentence pairs automatically extracted from online news, and manual annotation is used to explain whether these two sentences are semantically equivalent (Dolan and Brockett, 2005.).

It is a similar task to MNLI to recognize Textual Textual Entailment, but RTE has less training data. Bentivogli et al., 2009.

WNLI Winograd NLI is a small natural language reasoning data set from (Levesque et al., 2011). The GLUE web page indicates that there is a problem with the construction of this data set. Every trained system submitted to GLUE is lower than the baseline accuracy of 65.1 in predicting most classes. Therefore, in fairness to OpenAI GPT, we excluded this data set. For our GLUE commits, we always anticipate most classes.

4.4.1 GLUE results

To fine-tune the model on GLUE, we represent the input sentences or sentence pairs as described in section 3 of this article and use the last-layer hiding vectorThe first input marker ([CLS]) in is used as the overall representation of the sentence. As shown in Figure 3 (a) and (b). The only new parameter introduced during fine-tuning is a category-layer parameter matrix, includingIt’s the number of things to classify. We useCalculate a standard classification loss, in other words.

We used a batch size of 32 and 3 cycles across all the tasks on GLUE. For each task we useTo fine-tune the learning rate and then select the best performing learning rate in the validation set. In addition, forWe found that it was sometimes unstable when fine-tuned on small data sets (in other words, sometimes worse when run), so we did a few random restarts and chose the model that performed best on the validation set. For random restarts, we use the same pre-training checkpoint, but perform different data scrambles and classifier layer initialization to fine-tune the model. We note that the GLUE published data set does not include the labels for the tests, so we will separatelySubmit the results to the GLUE evaluation server.

The results are shown in Table 1. Of all the tasks,Are better than existing systems, achieving an average improvement of 4.4% and 6.7%, respectively, over advanced levels. Please note that except forIt contains attention masking.The model structure is almost identical to OpenAI GPT. For MNLI, the largest and most widely used GLUE task, BERT achieved an absolute improvement of 4.7% over the current optimal model. On GLUE’s official leaderboard,With a score of 80.4, OpenAI GPT, the original leader, has only scored 72.8 as of this writing.

And the interesting thing is,Was significantly better in all tasks, even on tasks with very little training data. The influence of BERT model size is further discussed in section 5.2 of this paper.

Table 1: GLUE test results, scored by the GLUE evaluation server. The number under each task represents the number of training examples. The “Average” column is slightly different from the official GLUE score because we excluded the problematic WNLI dataset. OpenAI GPT = (L=12, H=768, A=12); BERTBASE = (L=12, H=768, A=12); Large = (L=24, H=1024, A=16) Both BERT and OpenAI GPT are single-model, single-task. All the results from the gluebenchmark.com/leaderboard and blog.openai.com/language-un… To obtain.

4.2 SQuAD v1.1

The SQuAD Standford Question Answering Dataset is a collection of 100K crowdsourced Question/answer pairs (Rajpurkar et al., 2016). Given a question and a paragraph from Wikipedia containing an answer to the question, our task is to predict the range of text in the answer. Such as:

  • Input question: Where do water droplets collide with ice crystals to form precipitation?
  • Input paragraph… Precipitation forms as smaller droplets coalesce via collision with other rain drops or ice crystals within a cloud. …
  • Output the answer within a cloud

This interval prediction task is quite different from the sequence classification task of GLUE, but we can make BERT run on the SQuAD in a straightforward way. As in GLUE, we represent the input questions and paragraphs as A single packed sequence, where the questions are embedded with A and the paragraphs are embedded with B. The only new parameter to learn during fine-tuning the model is the interval start vectorAnd the end of interval vector. Let BERT model the last layer of the hidden vector of the firstThe input tag is represented as. See Figure 3 (c) for a visual representation. Then, count the wordsAs the probability of the beginning of the answer interval, it isBetween the dot product and divide by the result of all the words in the paragraph again after softmax:

The same formula is used to calculate the probability of the word being the end of the answer interval, and the interval with the highest score is used as the prediction result. The training objective is the logarithmic probability of the correct start and end positions.

We use theThe learning rate was 32 for the batch size of the training model over 3 cycles. During model inference, because the end position has no conditional relationship to the start position, we added the condition that the end position must be after the start position, but no other heuristic was used. To facilitate evaluation, the serialized markup interval is aligned back to the original unserialized input.

The results are described in Table 2. SQuAD uses a highly rigorous testing process in which submitters have to manually contact group organizers and then run their system on a hidden test set, so we submitted only the best models for testing. The results shown in the table were the first and only tests we submitted to the team. We note that the above results do not have the latest public model description in the group leaderboard and are allowed to use any public data when training their model. Therefore, we used very limited data enhancement in the submitted model through joint training between SQuAD and TriviaQA(Joshi et al., 2017).

Our best performing model was 1.5 F1 values higher than the number one model in the integrated model ranking and 1.7 F1 values higher than the number one model in the single model ranking. In fact, our single model BERT performs better than the optimal integration model. Even if only on the SQuAD data set (instead of the TriviaQA data set) we only lost 0.1-0.4 F1 values, and our model output still performed much better than the existing model.

Table 2: SQuAD results. Ensemble BERT is a 7X model using different pre-training model checkpoints and fine-tuning seeds.

4.3 Named entity identification

To evaluate the performance of the tag task, we fine-tuned the BERT model on the CoNLL 2003 NER Named Entity Recognition dataset. The data set consists of 200K training words labeled as person, organization, place, miscellaneous, or other (unnamed entity).

For fine-tuning, we’ll put the last layer on the hidden representation of each wordFeeds a classification layer in the NER tag set. The classification of each word is not conditional on surrounding prediction (in other words, no autoregression and no CRF). To accommodate WordPiece serialization, we input conli-serialized (conll-tokenized) input words into our WordPiece serializer, and then use the first block corresponding to these hidden states without predicting the block marked X. Such as:

Since the word-block serialization boundary is a known part of the input, this is done for both training and testing. The results are shown in Table 3.Better than the existing optimal model, cross-field training using multi-task learning(Clark et al., 2018), conLL-2003 named entity identification test set high 0.2f1 value.

Table 3: CONLL-2003 named entity identification. The model hyperparameters were selected using validation sets, and the reported validation sets and test scores were used for more than five random experiments using these hyperparameters and then the average results of the experiments were taken.

4.4 SWAG

The Adversarial Generations (SWAG) dataset is composed of 113K sentence pairs and is used to evaluate common sense based reasoning (Zellers et al., 2018).

Given a sentence from a video subtitle data set, the task is to choose the most reasonable continuation among four options. Such as:

The BERT model is adjusted for the SWAG dataset in a manner similar to that for the GLUE dataset. For each example, we construct four input sequences, each of which connects A given sentence (sentence A) with A possible continuation (sentence B). The only task-specific parameter we introduce is the vectorAnd then it’s dotted with the last sentenceFor every choiceProduce a fraction. The probability distribution for SoftMax is these four choices:

We use theThe learning rate was 16 for the batch size of the training model over 3 cycles. The results are shown in Table 4.This was 27.1% better than the baseline standard model of the authors’ ESIM+ELMo.

Table 4: SWAG validation set and test set accuracy. The test results were scored for hidden tags by the SWAG authors. Human performance was measured using a sample of 100, as described in the SWAG paper.

5. Ablation Studies

While we have demonstrated very strong empirical results, the results presented so far do not address the specific contributions of each part of the BERT framework. In this section, we perform ablation experiments on many aspects of BERT in order to better understand the relative importance of each part.

5.1 Impact of pre-training tasks

One of our core points is that BERT’s deep bidirectional (pre-training through shadowing language models) is the most important improvement over previous work. To prove this point, we evaluated two new models using theExactly the same pre-training data, fine-tuning scheme and Transformer hyperparameters:

  1. No NSP: The model uses the “shadows language Model” (MLM) but does not have the “predict next sentence task” (NSP).
  2. LTR & No NSP: Model Uses a left-to-right (LTR) language model rather than shadowing the language model. In this case, we predict each input word and do not apply any masking. Left-only constraints were also applied in the fine-tuning because we found that it was always bad to pre-train with left-only contexts and fine-tune with bidirectional contexts. In addition, the model is not pre-trained to predict the next sentence task. This is directly comparable to OpenAI GPT, but with a larger training data set, input representation, and fine-tuning scheme.

The results are shown in Table 5. We first analyzed the impact of the NSP task. We can see that removing NSP has significantly hurt QNLI, MNLI, and SQuAD performance. These results suggest that our pre-training approach was critical to obtaining the strong empirical results previously proposed.

We then assessed the effect of training bidirectional representations by comparing “No NSP” with “LTR & No NSP”. The LTR model performed worse than the MLM model on all tasks, with a particularly large drop in MRPC and SQuAD. For the SQuAD, it is clear that the LTR model performs poorly in interval and tag prediction because the hidden state at the tag level has no right side context. Because MRPC doesn’t know whether the poor performance is due to the small data size or the nature of the task, we found that the poor performance is consistent between a full hyperparametric scan and many random restarts.

To enhance the LTR system, we tried to add a randomly initialized BiLSTM to it for fine tuning. This did significantly improve the SQuAD’s performance, but the results were still much worse than the pre-training two-way model. It also hurts the performance of all four GLUE tasks.

We note that individual LTR and RTL models can also be trained and each tag represented as a connection between the two model representations, just as ELMo did. But :(a) this is twice the size of a single bidirectional model parameter; (b) This is not intuitive for tasks such as QA, because the RTL model does not condition the answer on the question; (c) This is much less powerful than the deep bidirectional model, which has the option of using either upper left or upper right.

Table 5: Used in pre-training tasksAblation experiment was performed on the model. “No NSP” means not to train the model with the next sentence prediction task. “LTR & No NSP” means that just like OpenAI GPT, the model is trained using a left-to-right language model without performing the next sentence prediction task. “+ BiLSTM” means to add a randomly initialized BiLSTM layer when fine-tuning the “LTR & No NSP” model.

5.2 Influence of model size

In this section, we will explore the effect of model size on the accuracy of the fine-tuning task. We trained a number of BERT models with different levels, hidden units, and number of heads of attention, while using the same hyperparameters and training process as described earlier.

The results of the selected GLUE tasks are shown in Table 6. In this table, we report the average model accuracy of random restarts for five fine-tuning on the validation set. As we can see, the larger model led to a significant increase in accuracy across all four data sets selected, even for MRPC with only 3600 training data, and was very different from the pre-training task. It is perhaps surprising that we were able to achieve such a significant improvement on the existing model relative to the existing literature. For example,Vaswani et al.(2017)The maximum Transformer in the study is (L=6, H=1024, A=16), and the encoder parameter is 100M. The maximum Transformer in the literature known to us is (L=64, H=512, A=2), and the parameter is 235M (Al-Rfou et al., 2018). By contrast,Contains 110M parameter andContains 340M parameter.

It has been known for years that increasing model size will continue to improve performance on large tasks such as machine transformation and language modeling, as shown in Table 6 for the perplexity of language models calculated from held out traing data. However, we believe that this is the first demonstration that scaling models to extreme sizes can also lead to huge improvements in very small tasks if the models are sufficiently pre-trained.

Table 6: Resize BERT’s model. #L = number of layers; #H = hide dimension size; #A = Number of attention heads. “LM (PPL)” indicates the confusion degree of the Shadowed language model in the reserved training data.

5.3 Influence of training steps

Figure 4 shows the accuracy on the MNLI validation set after the checkpoint of the k-step pre-trained model is fine-tuned to the model. This allows us to answer the following questions:

  1. Q: Does BERT really need so much pre-training (128,000 words/batch * 1,000,000 steps) to achieve high fine-tuning accuracy? A: Yes,The accuracy of pre-training with 1M steps on MNLI improved by nearly 1.0% compared with 500K steps.
  2. Q: Does the shadoded language model pre-training converge more slowly than the LTR model pre-training because only 15% of the words per batch are predicted, rather than each word? A: The shadothed language model does converge slightly slower than the LTR model. However, in terms of absolute accuracy, the shadowing language model surpasses the LTR model almost at the beginning of training.

Figure 4: Adjusting the training steps of the model. The figure shows the fine-tuned accuracy of the model parameters on the MNLI dataset after k-step pre-training. The value of the X-axis is going to be k.

5.4 Use the BERT feature-based approach

All BERT results so far have used a fine-tuning approach, with a simple classification layer added to the pre-trained model and a joint fine-tuning of all parameters in a downlink task. However, the feature-based approach, which extracts fixed features from the pre-training model, has certain advantages. First, not all NLP tasks can be easily represented by the Transformer encoder architecture, so task-specific model architectures need to be added. Secondly, it is of great computational advantage to be able to spend a large amount of computation on the representation of pre-computed training data at one time, and then carry out many experiments on the basis of this representation using a more computation-saving model.

In this section, we evaluate how well BERT performs in the feature-based approach by generating elmo-like pre-trained context representations on conLL-2003 named entity recognition tasks. To do this, we use the same input representation as in Section 4.3, but use the activation output from one or more layers without any fine-tuning of BERT’s parameters. Prior to the classification layer, these context embeddings are used as input to an initialized two-layer 768 dimensional BI-LSTM.

The results are shown in Table 7. The best implementation is indicated by concatenation of symbols from the first four hidden layers of the pre-trained converter, which is only 0.3f1 behind the fine-tuning of the overall model. This shows that BERT is effective for both fine-tuning and feature-based approaches.

Table 7: Using feature-based methods on CONLL-2003 named entity recognition and adjusting BERT layers The active output from the specified layer is combined and sent to a two-layer BiLSTM without back-propagation to BERT.

Conclusion 6.

Recent experimental advances in transfer learning using language models have shown that rich, unsupervised pre-training is an integral part of many language understanding systems. In particular, these results allow even low-resource (data sets with a small number of tags) tasks to benefit from very deep unidirectional structural models. Our main contribution is to further generalize these findings to deep bidirectional structures, enabling the same pre-training model to successfully handle a wide range of NLP tasks.

While these empirical results are compelling and in some cases even exceed human performance, important future work is to study what linguistic phenomena BERT may or may not capture.

reference

All references are listed in the order in which they were cited in each section of the paper, with multiple citations appearing multiple times in the list below.

Abstract references in the Abstract

Short for BERT Original paper title other
Peters et al., 2018 Deep contextualized word representations ELMo
Radford et al., 2018 Improving Language Understanding with Unsupervised Learning OpenAI GPT

1. References in the Introduction

Short for BERT Original paper title other
Peters et al., 2018 Deep contextualized word representations ELMo
Radford et al., 2018 Improving Language Understanding with Unsupervised Learning OpenAI GPT
Dai and Le, 2015 Semi-supervised sequence learning In Neural Information Processing Systems, Pages 3079 — 3087 AndrewMDai and Quoc V Le. 2015
Howard and Ruder, 2018 Universal Language Model Fine-tuning for Text Classification ULMFiT; Jeremy Howard and Sebastian Ruder.
Bow-man et al., 2015 A large annotated corpus for learning natural language inference Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning.
Williams et al., 2018 A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference Adina Williams, Nikita Nangia, and Samuel R Bowman.
Dolan and Brockett, 2005 Automatically constructing a corpus of sentential paraphrases William B Dolan and Chris Brockett. 2005.
Tjong Kim Sang and De Meulder, 2003 Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition Erik F Tjong Kim Sang and Fien De Meulder. 2003.
Rajpurkar et al., 2016 SQuAD: 100,000+ Questions for Machine Comprehension of Text SQuAD
Taylor, 1953 “Cloze Procedure”: A New Tool For Measuring Readability Wilson L Taylor. 1953.

2. References in Related Work

Short for BERT Original paper title other
Brown et al., 1992 Class-based n-gram models of natural language Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai. 1992.
Ando and Zhang, 2005 A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data Rie Kubota Ando and Tong Zhang. 2005.
Blitzer et al., 2006 Domain adaptation with structural correspondence learning John Blitzer, Ryan McDonald, and Fernando Pereira.2006.
Collobert and Weston, 2008 A Unified Architecture for Natural Language Processing Ronan Collobert and Jason Weston. 2008.
Mikolov et al., 2013 Distributed Representations of Words and Phrases and their Compositionality CBOW Model; Skip-gram Model
Pennington et al., 2014 GloVe: Global Vectors for Word Representation GloVe
Turian et al., 2010 Word Representations: A Simple and General Method for Semi-Supervised Learning Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.
Kiros et al., 2015 Skip-Thought Vectors Skip-Thought Vectors
Logeswaran and Lee, 2018 An efficient framework for learning sentence representations Lajanugen Logeswaran and Honglak Lee. 2018.
Le and Mikolov, 2014 Distributed Representations of Sentences and Documents Quoc Le and Tomas Mikolov. 2014.
Peters et al., 2017 Semi-supervised sequence tagging with bidirectional language models Matthew Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017.
Peters et al., 2018 Deep contextualized word representations ELMo
Rajpurkar et al., 2016 SQuAD: 100,000+ Questions for Machine Comprehension of Text SQuAD
Socher et al., 2013 Deeply Moving: Deep Learning for Sentiment Analysis SST-2
Tjong Kim Sang and De Meulder, 2003 Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition Erik F Tjong Kim Sang and Fien De Meulder. 2003.
Dai and Le, 2015 Semi-supervised sequence learning In Neural Information Processing Systems, Pages 3079 — 3087 AndrewMDai and Quoc V Le. 2015
Howard and Ruder, 2018 Universal Language Model Fine-tuning for Text Classification ULMFiT; Jeremy Howard and Sebastian Ruder.
Radford et al., 2018 Improving Language Understanding with Unsupervised Learning OpenAI GPT
Wang et al.(2018) GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding GLUE
Con-neau et al., 2017 Supervised Learning of Universal Sentence Representations from Natural Language Inference Data Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. 2017.
McCann et al., 2017 Learned in Translation: Contextualized Word Vectors Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017.
Deng et al. ImageNet: A large-scale hierarchical image database J. Deng,W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. 2009.
Yosinski et al., 2014 How transferable are features in deep neural networks? Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014.

3. References in BERT

Short for BERT Original paper title other
Vaswani et al. (2017) Attention Is All You Need Transformer
Wu et al., 2016 Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation WordPiece
Taylor, 1953 “Cloze Procedure”: A New Tool For Measuring Readability Wilson L Taylor. 1953.
Vincent et al., 2008 Extracting and composing robust features with denoising autoencoders denoising auto-encoders
Zhu et al., 2015 Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books BooksCorpus (800M words)
Chelba et al., 2013 One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling Billion Word Benchmark corpus
Hendrycks and Gimpel, 2016 Gaussian Error Linear Units (GELUs) GELUs

4. References from Experiments

Short for BERT Original paper title other
Wang et al.(2018) GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding GLUE
Williams et al., 2018 A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference MNLI
Chen et al., 2018 First Quora Dataset Release: Question Pairs QQP
Rajpurkar et al., 2016 SQuAD: 100,000+ Questions for Machine Comprehension of Text QNLI
Socher et al., 2013 Deeply Moving: Deep Learning for Sentiment Analysis SST-2
Warstadt et al., 2018 The Corpus of Linguistic Acceptability CoLA
Cer et al., 2017 SemEval-2017 Task 1: Semantic Textual Similarity – Multilingual and Cross-lingual Focused Evaluation STS-B
Dolan and Brockett, 2005 Automatically constructing a corpus of sentential paraphrases MRPC
Bentivogli et al., 2009 The fifth pascal recognizing textual entailment challenge RTE
Levesque et al., 2011 The winograd schema challenge. In Aaai spring symposium: Logical formalizations of commonsense reasoning, volume 46, page 47. WNLI
Rajpurkar et al., 2016 SQuAD: 100,000+ Questions for Machine Comprehension of Text SQuAD
Joshi et al., 2017 TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension TriviaQA
Clark et al., 2018 Semi-Supervised Sequence Modeling with Cross-View Training
Zellers et al., 2018 SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference SWAG

5. References to Ablation Studies

Short for BERT Original paper title other
Vaswani et al. (2017) Attention Is All You Need Transformer
Al-Rfou et al., 2018 Character-Level Language Modeling with Deeper Self-Attention