Directory:

  1. A brief review of ELMo and Transformer
  2. DAE and Masked Language Model
  3. BERT model in detail
  4. Different training methods of BERT model
  5. How to apply BERT model to actual projects
  6. How to lose weight for BERT
  7. BERT’s problems

1. A brief review of ELMo and Transformer

1.1 Polysemy

1.2 ELMo

ELMo is a language model that basically does the Context of a given text and predicts the next word. One of the most important features of ELMo is that it partially solves the problem of polysemy, but not completely. It provides a good solution to solve the polysemy. Embedding like Word2vec and Glove is static and invariant once the embedding is trained. As a matter of fact, ELMo considers more the information of Context and provides three embedding features for each word. The three embedding features can be regarded as three embedding features of a word, and three positions are added to the three embedding features. Different weights are given to different tasks. Finally, the three embedding are combined according to the weight to make vector average, and the merged embedding is regarded as the embedding of the last word.

ELMo uses long Contexts information, not Window Size Contexts information used by other models. ELMo uses BI-LSTM. If ELMo is changed to Transformer, the structure will be basically the same as BERT.

1.3 the Transformer

Differences between LSTM and Transformer:

  • LSTM training based on RNN is iterative. The next word can be entered only after the current word enters the LSTM unit, which is a serial process.
  • Transformer training is parallel, it is all the words can be parallel training at the same time, greatly speeding up the calculation efficiency. Also, Transformer includes location embedding to help models understand the order of the language. Calculation using self-attention and the full connection layer is a basic construct of Transformer.

The most important thing in the Transformer is multi-head Attention.

The five core parts of Encoder in Transformer are shown below:

Skip Connections: The problem of preventing gradients from disappearing when propagating back.

DAE and Masked Language Model

2.1 What are DAE and Masked Language Models

Traditional statistics-machine learning approaches are facing unprecedented challenges as strange, high-dimensional data, such as images and speech, emerge. The traditional feature engineering is difficult to be effective due to the high dimension of data, monotonous data and wide noise distribution.

In order to solve the problem of high dimension, the PCA dimension reduction method of linear learning appears. The mathematical theory of PCA is indeed impeccable, but it only works well for linear data. Therefore, to seek simple, automatic and intelligent feature extraction methods is still the focus of machine learning research.

Therefore, CNN found a new way to extract features from the characteristics of signal data by means of convolution and down-sampling. What about general non-signal data?

Researchers proposed AutoEncoder, which is based on the fact that the original input (set as) is weighted (), mapped (Sigmoid), and then reversely weighted mapped back to become. The network structure is shown as follows:

The autoencoder process is interesting. First, it does not use data tags to calculate error update parameters, so it is unsupervised learning. Secondly, the features of the sample are simply and roughly extracted by using the double hidden layer method similar to the neural network.

In order to alleviate the problem that classical AutoEncoder is easy to overfit, one method is to add random noise to the input, In his 2008 paper, “Extracting and Composing Robust Features with Denoising Autoencoders,” Vincent proposed a modified version of AutoEncoder Denoising AutoEncoder (DAE).

How do you make features very robust? This is to erase the original input matrix with a probability distribution (usually a binomial distribution), where each value is randomly zeroed so that it appears that some of the characteristics of some of the data are missing. This time the Corruputed data will be calculated with the lost data, and then the error iteration will be made with the original. So the Corruputed data will be learned from the network. The network structure is shown as follows:

This corrupted data is useful for two reasons:

  • Compared with the non-damaged data training, the Weight noise generated by the damaged data training is smaller. Hence the name noise reduction. The reason is not hard to understand, because the input noise was accidentally erased during erasure.
  • Damage data can reduce the generation gap between training data and test data to some extent. The damaged data is somewhat similar to the test data because parts of the data have been erased. (Training, testing must have different, of course, we require the same abandon different). This improves the robustness of the trained Weight.

2.2 Relationship between BERT and DAE and Masked Language Model

BERT is a Denoising Autoencoder (DAE) model based on Transformer Encoder. The entire architecture of BERT is based on DAE. This part is called Masked Lanauge Model (MLM) in BERT’s article. MLM is not strictly a language model because the entire training process is not based on a language model approach. BERT randomly replaced some words with MASK labels and then predicted the word to be masked. The process was actually the process of DAE. BERT has two main trained models, namely bert-Small and Bert-Large, among which Bert-Large uses 12-layer Encoder structure. The whole model has a lot of parameters.

Although BERT had a good performance, he also had some problems. For example, BERT can’t be used to generate data. Since BERT itself is trained by DAE structure, it is not as good as those models trained based on language models. Previous methods such as NNLM and ELMo are generated based on language models, so some sentences, text and so on can be generated using trained models. But methodologies based on such generative models have their own problems, because when it comes to understanding the meaning of a word in context, language models only take into account its context, not its context!

When BERT proposed it in 2018, there was an explosion of reaction, because there were so many new records in terms of effect, and then it basically started the rapid development of the field.

3. BERT model in detail

3.1 BERT profile

Bidirection: BERT’s whole model structure is similar to ELMo’s, both of which are bidirectional.

Encoder: An Encoder. BERT uses only the Encoder part of Transformer.

Representation: a Representation of words.

Transformer: Transformer is BERT’s core internal element.

The basic idea of BERT is the same as Word2Vec and CBOW, which are given context to predict the next word. BERT’s structure is similar to ELMo’s, both bidirectional. The first to use Transformer was not BERT, but GPT.

3.2 BERT’s model structure

The structure of BERT’s model is Seq2Seq, and the core is Transformer Encoder. Transformer Encoder also contains the five important parts introduced above.

3.3 BERT’s input

Next, look at BERT’s input, which has three parts: Token Embeddings, Segment Embeddings, and Position Embeddings. These three parts are learnable throughout the process.

Special characters:

  • CLS, which stands for Classification Token (CLS), is used to do some Classification tasks. Why is the “CLS” token in the first place? Since BERT itself is a parallel structure, “CLS” can be placed at the tail or in the middle. It would be convenient to put it in the first place.
  • SEP, which stands for Special Token (SEP), is used to distinguish between two sentences, as two sentences are usually typed in train BERT. From the picture below, you can see that SEP is the token that distinguishes two sentences.

  • Token Embedding: It is added every time the word is typed.
  • Segment Embedding: Whether the Token input belongs to sentence A or sentence B.
  • Position Embedding: specifies the location of each Token.

Finally, the corresponding locations of the three Embedding locations are added as BERT’s final input Embedding.

4. Different training methods of BERT model

4.1 BERT’s pre-training

How does BERT do pre-training? There are two tasks: first, Masked Language Model (MLM); Predicition (NSP) When training BERT, these two tasks were trained simultaneously. Therefore, BERT’s loss function adds up the loss function of these two tasks, which is a multi-task training.

BERT officially provides two versions of the BERT model. One is BERT’s BASE version and the other is BERT’s LARGE version. The BASE version of BERT has 12-layer Transformer, the dimension of hidden layer Embedding is 768, the head is 12, and the total number of parameters is about 110 million. BERT’s LARGE version has 24-layer Transformer, the dimension of hidden layer Embedding is 1024, the head is 16, and the total number of parameters is about 340 million.

4.2 BERT – Masked Language Model

What is Masked Language Model? It was inspired by cloze. Specifically, in BERT, there’s 15% Tokens. This conceals 15% of Tokens and is divided into three categories: First, 80% of Tokens are replaced with “MASK” characters; Second, 10% of the characters are replaced with other characters. Third, 10% of the characters remain stationary. Finally, when you calculate the loss, you only count the Tokens that are covered up, which is 15% of the Tokens that are covered up.

4.3 BERT – Next Sentence Prediction

Next Sentence Prediction focuses more on the relationship between two sentences. Next Sentence Prediction is simpler than the Masked Language Model task.

4.4 BERT – Training Tips

4.5 BERT-What it looks like?

After we’ve trained BERT, we’re going to look at BERT’s internal mechanisms. BERT’s BASE version has 12 heads. Does each head have the same function? As shown in the picture below, the first head is shown at each other’s neck intensely. For the third head to the first head, more attention is paid to the next word; For the 8th head to the 7th head, more attention is paid to the sentence separator (SEP); For the 11th head to the sixth head, more attention is paid to the period of the sentence.

Therefore, for each head, the meaning is different, which is also the reason why BERT is strong. BERT’s multi-headed structure can capture different features, including global features and local features.

BERT’s BASE version has 12 layers of Transformer. Each color in the figure below represents a layer of Transformer, and the same colors are clustered closer together. The heads of the same layer are very similar!

According to the above two figures, for BERT model with 12 layers +12 heads, the functions of its heads are similar for each layer. The function of Attention in each Head is completely different.

4.6 BERT-What It learns?

In the paper “Tenney I, Das D, Pavlick E. Bert Reinlet the Classical NLP Pipeline [J]. ArXiv preprint arXiv:1905.05950, 2019. “examines the contribution of each layer’s Transformer to different NLP tasks in BERT’s LARGE version.

As you can see from the figure above, the longer the blue part, the more the Transformer will contribute to the NLP task. In the coref. task, you can see that layers 18, 19, and 20 play a larger role.

4.7 BERT – Multilingual Version

Related GitHub address: github.com/google-rese…

5. How to apply BERT model to actual projects

We have the BERT model, we have pre-trained the BERT model, so what NLP tasks can we do with the BERT model?

  • Classification
  • Questions & Answering
  • Named Entity Recognition (NER)
  • Chat Bot (Intent Classification & Slot Filling)
  • Reading Comprehension
  • Sentiment Analysis
  • Reference Resolution
  • Fact Checking
  • etc.

5.1 Classification

[1] Devlin J, Chang M W, Lee K, et al. BERT: Research on deep Bidirectional Transformers [J]. ArXiv Preprint arXiv:1810.04805, 2018.

5.2 the Questions and Answering

[1] Devlin J, Chang M W, Lee K, et al. BERT: [J]. ArXiv Preprint arXiv:1810.04805, 2018.\

Let’s look at how to use BERT in QA systems:

Specific information can see: Bert era of innovation (application) : Bert application progress in the field of NLP each – articles – zhihu jun-lin zhang zhuanlan.zhihu.com/p/68446772

5.3 Named Entity Recognition (NER)

[1] Devlin J, Chang M W, Lee K, et al. BERT: [J]. ArXiv Preprint arXiv:1810.04805, 2018.\

5.4 Chat Bot (Intent Classification & Slot Filling)

Related papers: [1] Zhuo Z, Chen Q, Wang W. Bert for Joint Intent Classification and Slot filling[J]. ArXiv Preprint arXiv:1902.10909, 2019.

5.5 Reading comprehensions

6. How to lose weight for BERT

BERT performs very well, but it has too many parameters. Can we compress the BERT model for our convenience? Common methods of compression model are as follows:

  • Pruning-remove parts from the model
  • Quantization-covert Double to Int32
  • Distillation-teach a small model

6.1 Distillation of knowledge

7. BERT’s problems

Related papers: [1] Niven T, Probing neural Network Comprehension of traditional Languages [J]. ArXiv preprint arXiv: 197.07355, 2019. The paper points out that the existing data sets do not adequately assess the performance of the BERT model. 【2】Si C, Wang S, Kan M Y, et al. What can BERT Learn from multiple-choice Reading Comprehension Datasets? [J]. ArXiv preprint arXiv:1910.12391, 2019. In this paper, interference text was added to the data set, and the results showed that BERT performed very poorly.

8. Reference

[1] This article is the notes of Microstrong watching the live course “From Transformer to BERT Model” explained by Ge Hancheng on B website. Live broadcast Address: XLNet live.bilibili.com/11869202 [2] from BERT, RoBERTa to ALBERT – wen-zhe li’s article – zhihu https://zhuanlan.zhihu.com/p/84559048 [3] noise automatic encoder (Denoising Autoencoder), address: www.cnblogs.com/neopenx/p/4… [4] natural language processing (NLP 】 free live (greedy college), address: https://www.bilibili.com/video/av89296151?p=3

“`php

Highlights of past For beginners entry route of artificial intelligence and data download AI based machine learning online manual deep learning online manual download update (PDF to 25 sets) note: WeChat group or qq group to join this site, please reply “add group” to get a sale standing knowledge star coupons, please reply “planet” knowledge like articles, point in watching

Copy the code