The introduction

Encoders have become the basic structure in many NLP models. Whether you’re doing machine translation, whether you’re doing syntactic analysis, whether you need to get a contextual representation of a word, whether you need to get a representation of a sentence, you need a powerful encoder. Enter a sentence and the encoder eventually outputs a representation of each word or the entire sentence.

In recent years, CNN, RvNN, RNN (especially LSTM) and Transformer have been widely used in NLP field. Today, we mainly focus on the last two. Adding structural information to an encoder has many uses. First, structure information is used to enhance the structure representation of the encoder to improve the performance of downstream tasks. Second, sentence syntax tree can be learned unsupervised (if syntactic structure information is integrated).

Here are a few papers on encoders incorporating structural information.

01

Neural Language Modeling by Jointly Learning Syntax and Lexicon

Code address: github.com/yikangshen/…

Thesis unscramble: godweiyang.com/2019/03/31/…

In this paper, a new language model named PRPN is proposed to implicitly model syntactic tree information. Specifically, the model is divided into three parts: Parsing module, Reading module, and Predict module. Parsing module uses CNN to predict the syntactic distance between two adjacent words (see Straight to the Tree for the concept: Constituency with Neural Syntactic Distance), which decouple the syntax tree of a sentence from the Syntactic Distance. The Reading module is used to model the context and also incorporates the syntactic distance information predicted at the previous time. The Predict module is used to Predict what the next word will be.

02

Unsupervised Latent Tree Induction with Deep Inside-Outside Recursive Autoencoders

Code address: github.com/iesl/diora

Thesis unscramble: godweiyang.com/2019/07/25/…

This paper proposes the DIORA model, which mainly uses inside-outside algorithm to calculate the representation and score of each span. The inside process actually computes the score and representation of all spans from the bottom up, while the outside process computes the representation of spans from the top down. Finally, the objective function is different from other models. Generally, the objective function is either a language model or a downstream task, but here, the expression and scores of all words obtained by the outside process are used to calculate the loss function, that is, to maximize the sum of scores of all possible syntactic trees for each word.

03

Unsupervised Recurrent Neural Network Grammars

Code address: github.com/harvardnlp/…

Thesis unscramble: godweiyang.com/2019/04/20/…

In this paper, the URNNG model is proposed to do unsupervised syntactic analysis using variational method and RNNG. The inference network is used to deduce the conditional probability of implicit variables corresponding to sentences (i.e., syntax tree). Then, generative network RNNG is used to model the joint probabilities of sentences and implicit variables. Finally, the joint probability is summed to get the sentence probability, which is to use the language model as the target task.

04

Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks

Code address: github.com/yikangshen/…

Kexue. FM/Archives /66…

This paper is one of the best papers of ICLR2019. The main idea is to assign hierarchical information to LSTM neurons, sequence neurons (ordered neurons), and introduce two new gating units (main forgetting gate and main input gate) to model hierarchical structure information of sentences. The main guiding principle of the model is that the higher the level, the coarser the granularity, and the larger the span of the sentence. After entering a word, determine the size relationship between the word level and the history level, and then update the different dimensions of parameters according to the situation. The lower level retains the history information, the higher level is directly covered by the input information, and the middle part is updated with the ordinary LSTM.

05

PaLM: A Hybrid Parser and Language Model

Code address: github.com/Noahs-ARK/P…

Thesis unscramble: godweiyang.com/2020/01/09/…

In this paper, attention is integrated into LSTM. For each word, the attention of it and all words on the left is calculated. Then, this attention is used to integrate historical information and enhance the context representation of the current moment. When decoding the syntax tree, decoding from top to bottom, for a span, you just have to be greedy to find the split that makes the right son’s span score the most. As for attention, it can be done with or without syntactic tree supervision. In fact, the language model works better without syntactic tree supervision.

06

Tree Transformer: Integrating Tree Structures into Self-Attention

Code address: github.com/yaushian/Tr…

Thesis unscramble: godweiyang.com/2020/01/06/…

The main difference between this paper and Transformer is that attention is added in addition to attention in each layer, which is used to represent the probability of two words belonging to the same phrase. Finally, the total attention is the element multiplication of the original attention and the component attention. In this way, the attention between the same phrase will be large, while the attention between different phrases will be small. Finally, if you want to decode the syntax tree, you still use the syntax distance algorithm, decoding the syntax tree from the top down.

07

Multi-Granularity Self-Attention for Neural Machine Translation

This paper proposes a multi-granularity self-attention network, that is, different heads in the original Transformer are changed into different granularity. Divide a sentence into multiple non-overlapping phrases, then use networks like CNN to get the representation of each phrase, then self-attention to get the coarse-grained context representation of each word using the word as query and phrase as key. Different phrase segmentation methods correspond to different granularity, and phrases can be segmented by N-gram method or different layers of the syntactic tree. Finally, put together the words with different granularity.

08

You Only Need Attention to Traverse Trees

The idea of this paper is not complicated. The purpose of this paper is to design a network that can code the syntax tree, and finally get the vector representation of sentences for use in downstream tasks. For a constituent syntactic tree, the representation of a node can be obtained by self-attention of all its children through a series of transformations. For dependency syntax trees, the representation of a word can be self-attentioned by its parent word and all of its children words, and then a series of transformations can be performed. In fact, the overall structure of the network is very similar to that of a recursive neural network, except that the combination function of nodes borrowings self-attention from Transformer, and the author of the model name is also called tree-Transformer.

09

Tree-Transformer: A Transformer-Based Method for Correction of Tree-Structured Data

This post was not posted, it was posted on arxiv, so it has a lot of mistakes. This paper mainly proposes a Tree to Tree model (analogous to SEQ to SEQ model), encoding a syntax Tree (or the syntax Tree of the code, etc.) in accordance with the top-down order, and then decoder generates a syntax Tree in accordance with the top-down order. The difference with normal Transformer is that it replaces the feed-forward network in the middle with its tree Conv block, which combines the representation of a node, its parent and all its siblings, or replaces it with a zero vector if none exists.

10

Tree-Structured Attention with Hierarchical Accumulation

Code address: github.com/nxphi47/tre…

As mentioned in Reviewer #1 of this paper, the formula symbols are somewhat obscure and not very clear. The structure is complex and difficult to implement without open source. Anyway, I was too confused to see anything. The idea is that you have a matrix, where the number of columns is exactly the length of the sentence, and each row corresponds to a node in the syntax tree, plus a row of leaves. A row in the matrix, if the corresponding node tree contains a word, that column is the eigenvector with nodes, otherwise it is the zero vector. Then the matrix is accumulated by rows, and then weighted by columns, and finally the vector representation of each node is obtained. Then how to incorporate it into Transformer, writing is really difficult to understand, interested to read the original paper.

11

StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding

The thesis unscramble: zhuanlan.zhihu.com/p/103207343

This is a BERT model improved by Mr. Ali Baba siluo’s team, called StructBERT. Basically, two new pre-training tasks were added on the basis of the original BERT. One is the probability of maximizing the correct word order of a clause of length K at the word level. One is sentence level, divided into prediction of next sentence, the previous sentence and random sentences of different documents.