What does BERT Learn

This paper is the sixth interpretation of Bert’s series of papers

TinyBert: Ultra-detailed application of model distillation, he is enough to ask questions about distillation

Quantization technique and Albert dynamic quantization

DistillBert: Bert is too expensive? I’m cheap and easy to use

[share] paper | RoBERTa: hello XLNet in, was beaten

XLNet paper introduction – Beyond Bert’s afterwave

takeaway

Since the launch of BERT, great results have been achieved in the field of text understanding. Much of the subsequent work focused on the model itself and its output, and we didn’t know much about why pre-training models worked so well. Knowing this not only allows us to better understand the model, but also allows us to find flaws in the model and help us design better models. This paper will introduce two articles, which conducted a large number of experiments from the perspective of grammatical analysis and Attention map respectively, to analyze what makes BERT become so strong.

What does BERT learn about the structure of language

Hal.inria.fr/HAL-0213163…

This paper mainly studies the grammar and syntax contents learned among different BERT layers. The ** Bert-BASE-Uncased model is adopted in this paper, which has 12 layers in total, hidden size is 768, 12 attention heads and 110M parameters. ** The author designed four different experiments to analyze BERT’s understanding of textual Chinese grammar and syntax.

Experiment 1: Phrasal Syntax

First of all, the author studied BERT’s understanding of phrases. The author calculated span representation in the output results of different layers by intercepting a certain paragraph of the text and then splicing the beginning and end of the paragraph. Then the author adopts the t-SNE method to display.

Conclusion: With the increase of the number of layers, the model’s acquisition of phrase information is gradually diluted.

Experiment 2: Probing Tasks

The detection task is to design some specific classification tasks (generally including some linguistic information) and add these tasks as auxiliary classifiers to different BERT output layers to judge the degree of learning linguistic information by the model.

Ten detection tasks are designed in this paper, which are as follows: SentLen, WC, BShift, TreeDepth, TopConst, Tense, WC, WC, BShift, TreeDepth, TopConst SubjNum(number of subjects), ObjNum(number of objects), SOMO(sensitivity to verb, noun substitution), CoordInv(random exchange of collaborative clause conjunctions).

The experimental results are as follows:

Conclusion: The results show that BERT encodes rich linguistic hierarchical information. Surface information features in the bottom network, syntactic information features in the middle network, semantic information features in the high network.

Experiment 3: Subject-verb Agreement

The goal of subject-verb agreement is to detect whether neural networks can encode grammatical structures correctly. The study found that it became difficult to predict the number of verbs in a sentence when there were more nouns related to the subject and predicate but opposite. This part also designs auxiliary classifiers for each layer. The results are as follows: the column number represents the average distance between confounded nouns inserted between subject and predicate.

Conclusion: The results show that middle layer networks perform better in most cases, which also confirms the hypothesis that the syntactic features of the above part are mainly encoded in BERT middle layer. In addition, as the number of inserted nouns increases, BERT’s higher level networks are increasingly able to handle the long range dependency problem better than the lower level networks, proving that BERT must have a deeper level to be more competitive in most natural language processing (NLP).

Experiment 4: Compositional Structure

The author uses TPDN(Tensor Product Decomposition Networks) to explore whether BERT can learn the overall composition of the article. TPDN is represented by combining input symbols based on a pre-selected role scheme using vector products and sums. Scheme of a word from the root node to the path of his own syntax tree, the role of the author for a given design, if a TPDN model can be trained to estimate a characterization of neural network to learn, then the character design is likely to be able to determine the combination of the neural network model to learn sexual characteristics.

Five different roles were designed, including left-to-right, right-to-left, Bag-of-Words, Bidirectional, and Tree.

Conclusion: BERT high – level network learns syntax tree structure.

What Does BERT Look At? An Analysis of BERT’s Attention

Links to papers: arxiv.org/abs/1906.04…

This paper mainly explores BERT’s interpretability from the perspective of attention map, because the weight of attention indicates the importance of the current word when calculating the next word. This paper aims at BERT’s 144 attention heads to analyze why BERT can achieve such a good effect.

Surface-Level Patterns in Attention

Firstly, the author visualizes the weight of attention and obtains several different attention patterns. As shown in the picture below, some attention is focused on all the words (broadly), some attention is focused on the next token, some attention is focused on SEP symbols, and some attention is focused on punctuation marks.

Probing Individual Attention Heads

The author also analyzes BERT’s recognition effect of interdependence between words. What is dependency? We can simply understand it as the dependence between words and words. The dependence relation holds that “predicate” is the center of a sentence, and other components are directly or indirectly related to the verb.

Through the analysis of the dependency relations between words, the author found that BERT could not deal with all the dependency relations well, but the specific layer could identify the specific dependency relations better.

Probing Attention Head Combinations

The author also designed a classifier to explore BERT’s learning of dependency between words. According to the experimental results, BERT has a certain perception of the dependence between English grammar and words.

Clustering Attention Heads

Do different attention heads on the same level learn the same behavior? In order to explore this problem, the author conducts clustering for the output of attention head, and the clustering results are as follows. It can be seen thatAttention heads on the same level tend to have similar behavioral characteristics.

Conclusions and Reflections

As can be seen from the above two articles, BERT was analyzed from the perspective of grammar and Attention respectively

BERT can learn linguistic information, and deep BERT can learn more information than shallow BERT
However, BERT could not learn the linguistic information in the text in a very comprehensive way in one layer, and some specific linguistic information might be learned in a specific layer

In addition, I also learned about common detection experiments in linguistics, and how to use common analysis methods in linguistics, such as grammar analysis and dependency analysis. It’s so confusing.

Of course, this paper only briefly introduces some conclusions of the two articles. The paper also introduces the experimental methods and some details of the experiment in detail. Interested students can read the original text.