0. Guide language

BERT (Bidirectional Encoder Representation from Transformers) has become a hot topic in the NLP field since Google announced BERT’s excellent performance in 11 NLP tasks at the end of October 2018. We will study the BERT model to understand how it works.

I’ve been working on the AI Basics series, which is a very important part of NLP (Natural Language Processing). (Huang Haiguang)

It has been released: \

AI: Python development environment setup and tips

AI Basics: Easy to get started with Python

AI Basics: Numpy easy to get started

AI: Pandas

AI: Scipy(Scientific Computing Library) easy to get started

AI Basics: An easy introduction to Data Visualization (Matplotlib and Seaborn)

AI Fundamentals: Feature Engineering – Category Features

习皇帝AI:习皇帝Feature engineering – Digital feature processing

AI: Feature engineering – Text feature processing

AI: Word embedding basics and Word2Vec

AI: The illustration Transformer

Follow-up updates

Author: jinjiajia95

Reference: blog.csdn.net/weixin_4074…

Original Author: Jay Alammar\

The original link: jalammar. Making. IO/illustrated…

The text start

Preface \

2018 was the first year of natural Language processing (NLP), with great advances in how we can aid computers in conceptually understanding sentences in ways that best capture the underlying semantic relationships. In addition, some of the open source communities in the NLP space have released many powerful components that can be downloaded and used for free during our own model training. (You could say this is NLP’s ImageNet moment, because it’s similar to the development of computer vision a few years ago.)

Above, the newly released BERT is a milestone model of the NLP mission, and its release is bound to usher in a new era of NLP. BERT is an algorithmic model that has broken records for a large number of natural language processing tasks. Shortly after BERT’s paper was published, the Google team also opened up the code for the model and provided a way to download the algorithm model that had been pre-trained on large data sets. Goole makes this model open source and provides pre-trained models that allow anyone to build an algorithmic model involving NLP, saving a lot of time, effort, knowledge, and resources required to train language models.

BERT integrates some of the best ideas in the NLP field in recent times, including but not limited toSemi-supervised Sequence Learning (by Andrew Dai and Quoc Le), ELMo (by Matthew Peters and researchers from AI2 and UW CSE), ULMFiT (by fast.ai founder Jeremy Howard and Sebastian Ruder), and the OpenAI transformer (by OpenAI researchers Radford, Narasimhan, Salimans, and Sutskever), and the Transformer(Vaswani, et al.).

There are a few things you need to note to understand BERT properly, but you can use BERT’s approach before introducing the concepts involved in the model.

Example: Sentence classification The simplest way to use BERT is to make a text classification model, the structure of which is shown in the figure below:

To train such a model (mainly a classifier), the BERT model changes very little during the training phase. This training process is called fine-tuning and is derived from semi-supervised Sequence Learning and ULMFiT.

To make things easier to understand, let’s use an example of a classifier. Classifiers are in the field of supervised learning, which means you need some labeled data to train the models. In the case of the spam classifier example, the tagged data set consists of the content of the message and the category 2 part of the message (classified as “spam” or “non-spam”).

Other examples of such use cases include:

Sentiment analysis

Input: movie/product reviews. Output: Are the reviews positive or negative? Sample data set: SST Fact check Input: sentence. Output: “claim” or “don’t claim” more ambitious/futuristic examples: Input: sentence. Output: “true” or “false”

The proposed framework

Now that you’ve seen how BERT’s example works, let’s take a closer look at how it works.

BERT introduced two versions in his paper:

  • BERT BASE – Comparable in size to OpenAI Transformer for comparing performance
  • BERT LARGE – a very LARGE model that accomplishes the most advanced results presented in this article.

BERT’s basic integration unit is Encoder for Transformer. For an introduction to Transformer, read The author’s previous article: The Illustrated Transformer, which explains The basic concepts of The Transformer model-Bert and The concepts we will discuss next.

Both BERT models have a large number of encoder layers (referred to as Transformer Blocks in the paper) – 12 for the base version and 24 for the advanced version. It also has large feedforward neural networks (768 and 1024 hidden layer neurons) and lots of attention heads (12-16). This exceeds the reference configuration parameters in the Transformer paper (6 encoder layers, 512 hidden layer units, and 8 attention headers)

Model input



The first character of the input is [CLS], where the meaning of the character [CLS] is simple – Classification.

BERT codes in the same way as Transformer. Fixed length strings are taken as input, and the data is transmitted from bottom to top for calculation. Self attention is applied at each layer, and the results are transmitted through the feedforward neural network to the next encoder.

This architecture seems to follow Transformer’s architecture (except for the number of layers, which is a parameter we can set). So what makes BERT different from Transformer? Perhaps some clues can be found in the output of the model.

The model output

The output returned at each location is a vector of hidden layer size (base BERT is 768). In the case of text categorization, we focus on the output at the first position (the first position is the category identifier [CLS]). The following figure

This vector can now be used as the input of the classifier we choose. It is pointed out in the paper that a good result can be achieved by using a single layer neural network as a classifier. Here’s how it works. :

The example only has spam and non-spam. If you have more labels, you just need to increase the number of output neurons and change the last activation function to Softmax.

Parallels with Convolutional Nets (BERT VS Convolutional Neural Networks) For those with a background in computer vision, this vector switch should be reminiscent of what happens between the Convolutional part of a network such as VGGNet and the fully connected classified part of the end of the network. You can think of it this way, and it’s actually convenient to think of it this way.

The new era of word embedding

BERT’s open source is followed by an update of word embedding. So far, word embedding has become a major component of NLP models dealing with natural language. Methods such as Word2vec and Glove have been widely used to deal with these problems, and it is worth reviewing their development before we use new word embedding.

Word Embedding Recap

In order for the machine to learn the characteristic properties of the text, we need some way to represent the text numerically. Word2vec algorithm uses a set of fixed dimensional vectors to represent words in a way that captures the semantics of words and the relationships between words. The vectomized representation of Word2vec can be used to determine whether words are similar or opposite, or whether “man” and “woman” are related to each other as “king” and “queen.” Are you tired of hearing these words? It also captures some grammatical relationships, which are useful in English. For example, the relationship between “had” and “has” is the same as that between “was” and “is”.

This approach, we can use a large number of text data to training a word embedded model, and the word embedded model can be widely used in other NLP tasks, that’s a good idea, it makes some start-up company or the lack of computing resources, also can download already open source word embedded model to complete the task of the NLP.

ELMo: Context

One obvious problem with the word embedding approach described above is that using a pre-trained word vector model, each word has a unique and fixed vectorized form, regardless of the context. “Wait a minute” – from Peters et al., 2017, McCann et al., 2017, and yet again Peters et al., 2018 in the ELMo Paper

This is similar to the Chinese homophone, for example, the word ‘long’, which means measuring in the word ‘length’, and the word ‘long’ which means increasing in the word ‘height’. So why don’t we judge the pronunciation or the meaning of “long” by the degree or height around it? Whizz, this problem is derived from the contextualized word embedding model.

EMLo changes the way the Word2vec class fixes words to vectors of specified length by looking at the entire sentence before assigning a word vector to each word, and then using bi-LSTM to train its corresponding word vector.

ELMo has made an important contribution to solving the context problem of NLP. Its LSTM can be trained using a large amount of textual data relevant to our task, and the trained model can then be used as a benchmark for word vectors for other NLP tasks.

What’s ELMo’s secret?

ELMo trains a model that takes a sentence or word input, with the output most likely to be the next word. Think about input methods. Yeah, that’s the way it works. This is also called Language Modeling in NLP. Such a model is easy to implement because we have a lot of text data and we can learn without labels.

Above, this paper introduces the ELMo part of the process steps in the process of the training, we need to finish such a task: input “Lets stick to”, to predict the next most likely word, if used in the stage of training a large number of training data set, then in the forecast period we could accurately predict we are looking forward to the next word. For example, if you input “machine”, its most likely output will be “study” rather than “buy food” between “study” and “buy food”.

As can be seen from the figure above, each expanded LSTM completes the prediction in the last step.

Oh, and the real ELMo goes one step further, predicting not just the next word, but the one before it. (Bi – Lstm)

ELMo extracts the word embedding method with context meaning by combining hidden States (the initial embedding of the words) in the following way (weighted sum after full connection)

Ulm-fit: NLP domain application transfer learning

Ulm-fit mechanism makes better use of the pre-training parameters of the model. Leveraging parameters beyond embeddings and contextual embedding, ULM-FIT introduces the Language Model and a process for effectively fine-tuning the Language Model to perform various NLP tasks. This makes NLP tasks as easy to use as computer vision for transfer learning.

The Transformer: Structures beyond LSTM

The release of Transformer papers and code, as well as its excellent results in machine translation and other tasks, have led some researchers to believe that Transformer is a replacement for LSTM. In fact, Transformer handles long-term dependancies better than LSTM. The structure of Transformer Encoding and Decoding is very suitable for machine translation, but how to use it to do the task of text classification? You actually use it to pre-train language models that can be fine-tuned for other tasks.

OpenAI Transformer: Pre-training of Transformer decoder for language models

Turns out, we don’t need oneA complete transformer structureTo use transfer learning and a goodLanguage modelTo process NLP tasks. All we need is the Transformer decoder. The decoder is a good choice because it’s a natural choice for language modeling (predicting The next word) since it’s Built to mask future tokens — a valuable feature when it’s generating a translation word by word.

The model is stacked with twelve Decoder layers. Since there is no Encoder in this setting, these decoders will not have the Encoder – Decoder Attention layer that Transformer Decoder layer has. Instead, however, there is a self-attention layer (masked so it doesn’t peak at future tokens).

With this structural adjustment, we can continue to train the model on a similar language model task: to predict the next word, using large unlabeled datasets of training. Take, for example, your model fed to you by 7,000 books (books are excellent training samples ~ much better than blogs and tweets). The training framework is as follows:

Transfer Learning to Downstream Tasks

After pre-training and some fine-tuning with OpenAI’s Transformer, we were able to use the trained model for other downstream NLP tasks. (Such as training a language model and classifying it with its hidden state.) , the following is to introduce the SAO operation. (Again, as in the above example: spam and non-spam)

The OpenAI paper Outlines many examples of Transformer using transfer learning to handle different types of NLP tasks. As shown in the following example:

BERT: From Decoders to Encoders

OpenAI Transformer provides a sophisticated pre-training model based on Transformer. But with the transition from LSTM to Transformer, we found something missing. ELMo’s language model is bidirectional, but OpenAI’s Transformer is a forward-trained language model. Can we also have bi-LSTM features in our Transformer model?

R-bert: “Hold my Beer”

“I’m going to use Encoders in Transformer” says Masked Language Model BERT

Ernie sniffed, “Well, you can’t think of articles like BI-LSTM.”

BERT confidently replied: ‘We’ll use masks’

Explain Mask:

The language model will predict the next word according to the previous word, but the attention of self-attention will only be on oneself, so it is meaningless to predict oneself 100%, so Mask is used to block the words that need to be predicted.

The diagram below:



Two-sentence Tasks

If we look back at how OpenAI Transformer handles input conversion for different tasks, you will see that in some tasks we need two sentences as input and make more intelligent decisions, such as whether they are similar or not, such as giving a Wikipedia entry as input and then adding a question to that entry, So can our algorithmic model handle this problem?

In order to make BERT better deal with the relationship between the two sentences, the pre-training process has an additional task: given two sentences (A and B), is A similar to B? 0 or 1.

Special NLP tasks

BERT’s paper introduces several NLP tasks that BERT can handle:

  1. Short text similarity
  2. Text classification
  3. QA robot
  4. Semantic annotations

BERT uses it for feature extraction

The fine-tuning method is not the only way to use BERT. Just like ELMo, you can use pre-selected BERT to create contextualized word inserts. You can then provide these inserts to existing models.

Which vector is best embedded in context? I think it depends on the task. Six options were examined (with a score of 96.4 compared to the fine-tuning model) :

How to use BERT

The best way to use BERT is through BERT FineTuning with Cloud TPUs hosted notes on Google Cloud

(colab.research.google.com/github/tens…

If you haven’t used Google Cloud TPU before, it’s a good try. BERT is also suitable for TPU, CPU and GPU

The next step is to look at the code in BERT’s repository:

  1. The model is built in modeling.py (BertModel class) and is identical to the Vanilla Transformer encoder.

    (github.com/google-rese…

  2. Run_classifier.py is an example of a fine tuning process.

    It also builds a classification layer to oversee the model.

    (github.com/google-rese…

    If you want to build your own classifier, look at the create_model() method in this file.

  3. Several pre-trained models can be downloaded.

    A multilingual model covering 102 languages, all trained on Wikipedia data.

    BERT doesn’t see words as tokens.

    Instead, it focuses on WordPieces.

    Tokenization. Py is a Tokensizer that converts your words into wordPieces suitable for BERT.

    (github.com/google-rese…

You can also check out BERT’s PyTorch implementation.

(github.com/huggingface…

The AllenNLP library uses this implementation to allow BERT embedding to be used with any model.

(github.com/allenai/all…

(github.com/allenai/all…

Note: the menu of the official account includes an AI cheat sheet, which is very suitable for learning on the commute.

You are not alone in the battle. The path and materials suitable for beginners to enter artificial intelligence download machine learning online manual Deep learning online Manual note:4500+ user ID:92416895), please reply to knowledge PlanetCopy the code

Like articles, click Looking at the