This is the fourth day of my participation in the November Gwen Challenge. Check out the details: The last Gwen Challenge 2021

StructBERT

StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding

StructBERT is a BERT improvement of Ali’s model, which has achieved good results and currently ranks the second in GLUE rankings

First of all, let’s look at the following two sentences in English and Chinese

i tinhk yuo undresatnd this sentneces.

Research table research Ming, Chinese characters order is not fixed a shadow read ring read. For example, after you read this sentence, you realize that all the words in this sentence are messy

In fact, the above two sentences are out of order

This is where structBERT’s ideas for improvement come from. For a person, the order of words or characters does not affect reading, nor does a model. A good LM needs to know how to correct its own errors

StructBERT’s model architecture is the same as BERT’s. It is improved by adding two new pre-training objectives: Word Structural Objective and Sentence Structural Objective in the case of existing MLM and NSP tasks

Word Structural Objective

The partial and molecular sequences were randomly selected from the unmasked sequence (the hyperparameter KKK was used to determine the length of the subsequence), and the word order in the subsequence was scrambled, and the model was allowed to reconstruct the original word order


a r g m a x Theta. log P ( p o s 1 = t 1 . p o s 2 = t 2 . . . . . p o s k = t k t 1 . t 2 . . . . . t k . Theta. ) \mathop{argmax}\limits_{\theta}\sum \log P(pos_1=t_1,pos_2=t_2,… ,pos_k=t_k\mid t_1,t_2,… ,t_k,\theta)

Where θ\thetaθ represents the parameters of the model, and it is hoped that there is maximum likelihood of restoring the subsequence to the correct order

  • The model must learn to reconstruct more intrusive data, a difficult task
  • In Smaller KKK, the model must learn to recreate less intrusive data and the task is simple

In this paper, K=3K=3K=3 is set, which is better for single sentence task

Sentence Structural Objective

Given sentence pairs (S1,S2), judge that S2 is the next sentence of S1, the last sentence, and the unrelated sentence (three classification questions)

When sampling, for one sentence S, the probability of 13\frac{1}{3}31 samples the next sentence pair of S, the probability of 13\frac{1}{3}31 samples the last sentence of S, and the probability of 13\frac{1}{3}31 randomly samples the sentence pair of another document

Ablation Studies

Ablation studies were performed on two proposed pre-training tasks to verify the effectiveness of each task

As shown in the figure above, these two tasks have a significant impact on the performance of most downstream tasks (except SNLI)

  • The first three are single-sentence tasks, and it can be seen that Word Structural Objective has a great influence on them
  • The last three are Sentence pairs, and you can see that Sentence Structural Obejctive has a lot of influence on them

Afterword.

Unfortunately, I couldn’t find a pre-trained StructBERT model on Github

Reference

  • StructBERT research
  • StructBERT (ALICE), rounding
  • StructBert Review