Make writing a habit together! This is the 8th day of my participation in the “Gold Digging Day New Plan · April More Text Challenge”. Click here for more details.

Introduction:

Learning needs, understanding and using Coreference Resolution. The learning resources for the classic Stanford CS224n 2021 courses, video link is: www.bilibili.com/video/BV18Y…

This blog continues to introduce along with the ideas of the previous article, which is mainly divided into the following parts:

  • Mention Ranking model
  • The network
  • End-to-end neural network Coreference model
  • Evaluation index of Coreference Resolution
  • conclusion

Mention Ranking Coreference model

Given the shortcomings of the Mention Pair model, we can make the following improvements:

  • According to the model’s prediction, the highest-scoring antecedent is assigned to each Mention
  • Setting up a virtual NA Mention allows the model to refuse to associate the current Mention with anything (” singleton “or” first “Mention)
    • First mention: I can only choose NA as its antecedent

Here, the probability value of each link is obtained after a Softmax.

Just edge the one with the highest probability, rather than edge everything.

So how do we calculate these probabilities? There are three ways to do this:

  1. Use statistical classifiers that are not neural networks
  2. Simple neural network
  3. More advanced neural network structure, such as LSTM, attention, Transformer, etc

A non-neural network approach can use many statistical features, such as

  • People, numbers, gender
  • Semantic compatibility
  • Syntactic constraints
  • The more recently mentioned entity is a possible reference object
  • Grammatical roles: Entities that favor subject position
  • Use parallel

In the standard feedforward neural network method, the input layer is the word vector of candidate antecedents and current referential pronouns, and some Additional features need to be added, which are also used in the statistical machine learning method above. In the middle is FFNN, that is, fully connected network, and finally outputs the probability of two referents being coreferences.

  • The embedded
    • Each mention first two words, first word, last word, head word,…
      • Head Word is the “most important” word in Mention — you can find it using the parser
      • Such as:The fluffy catCat stuck in the tree
  • Some other features are still needed
    • distance
    • The document type
    • Information about the speaker

The network

After that, Manning did a quick talk about the CNN application. The idea is: What if we computed a vector for every possible subsequence of words of a given length?

The overall idea is consistent with the application of CNN in the image, that is, set multiple convolution kernels, and then slide the window to multiply the elements at the corresponding positions and sum the elements at the corresponding positions in the convolution kernels. You can also use the padding to make Conv 1D the same size as the beginning.

Finally, average pooling or Max pooling is used to obtain the final output results.

The implementation code in Pytorch is as follows:

batch_size = 16
word_embed_size = 4
seq_len = 7
input = torch.randn(batch_size, word_embed_size, seq_len)
conv1 = Conv1d(in_channels=word_embed_size, out_channels=3,
kernel_size=3) # can add: padding=1
hidden1 = conv1(input)
hidden2 = torch.max(hidden1, dim=2) # max pool
Copy the code

Another common way to apply CNN to NLP is to combine them with character-level Word embedding for composing to generate new Word embedding. In this way, the model will be able to process arbitrary words without causing the phenomenon of out of Vocabulary. Many people integrate the original Word embedding and the embedding obtained by character to represent the complete sense embedding.

End-to-end neural network Coreference model

Kenton Lee et al. From UW (EMNLP 2017) is the earliest pure end-to-end Coreference model that completely gets rid of any pipeline. However, due to the lack of mention detection step, the model treats any text span whose length does not exceed a certain threshold as candidate Mentions when determining candidate Mentions.

The input embedding of neural network is the joint input of Word embedding and character-level CNN.

After that, the input goes through a BI-LSTM to get the hidden layer variable output.

Next, you need to compute the representation vectors for each span I. General, General Electric, General Electric said… Electric, Electric said, will all get a representation vector.

The representation vector of any span consists of four parts. Here, the span of “The Postal Service” is taken as an example to illustrate its calculation process:

  • XSTART ∗x_{START}^{*}xSTART∗ : The hidden layer variable of bi-LSTM corresponding to the START position of this span, that is, the output variable of the corresponding BI-LSTM
  • XEND ∗x_{END}^{*}xEND∗ : The hidden layer variable of bi-LSTM corresponding to the END of this span, that is, the output variable of bi-LSTM corresponding to Service
  • X ^ I \hat{x}_{I}x^ I: attention = attention
  • ϕ(I)\phi(I)ϕ(I) : an additional feature, such as span length, etc

Among them, the expression based on attention is actually a weighted sum after calculating the attention value of the last word in span for all words. The basic idea is attention. If you are not familiar with it, please refer to my previous blog juejin.cn/post/708180… Vector is directly equal to the hidden layer variable H, and here there is a neural network layer processing, i.e


Alpha. t = w Alpha. F F N N Alpha. ( x t ) \alpha_{t}=\boldsymbol{w}_{\alpha} \cdot \mathrm{FFNN}_{\alpha}\left(\boldsymbol{x}_{t}^{*}\right)

Once you have the context vector (in this case alpha t\alpha_t alpha t on the slide), do a Softmax reweighted sum to get the final presentation output.

After obtaining the feature vectors gig_IGi of all spans, the neural network is used to complete two tasks:

  1. Check whether the span is a Mention (formula 1 below)
  2. Determine whether any two spans are coreference relations, that is, formula 2 below

s m ( i ) = w m FFNN m ( g i ) s a ( i . j ) = w a FFNN a ( [ g i . g j . g i g j . ϕ ( i . j ) ] ) \begin{aligned} s_{\mathrm{m}}(i) &=\boldsymbol{w}_{\mathrm{m}} \cdot \operatorname{FFNN}_{\mathrm{m}}\left(\boldsymbol{g}_{i}\right) \\ s_{\mathrm{a}}(i, j) &=\boldsymbol{w}_{\mathrm{a}} \cdot \operatorname{FFNN}_{\mathrm{a}}\left(\left[\boldsymbol{g}_{i}, \boldsymbol{g}_{j}, \boldsymbol{g}_{i} \circ \boldsymbol{g}_{j}, \phi(i, j)\right]\right) \end{aligned}

These two tasks are also implemented by two neural network layers respectively.

Finally, for any two span I and J, their scoring formula is shown in the figure below, which contains three parts and represents three meanings respectively: SPAN I is the probability of a mention, SPAN J is the probability of a mention, and SPAN I and SPAN J are the probability of coreference.

Although this model is an end2end method that can accomplish both Mention Detection and Coreference Resolution tasks, it is too complicated. If there are T words in a sentence, enumerating all spans requires O(T^2) complexity, and then enumerating any two spans requires O(T^2) complexity, which adds up to O(T^4) complexity.

Due to the high complexity, this model needs a pruning process, such as setting the maximum length of span, setting the judging distance threshold between different spans, etc.

At present, the most advanced SOTA model with the help of large-scale pre-training models, such as BERT, SpanBERT, etc., has achieved better results than previous methods.

Evaluation index of Coreference Resolution

There are many evaluation indexes of Coreference Resolution, including MUC, B3B^3B3, BLANC, etc. The details of these indicators are not covered here, but those interested can refer to web.stanford.edu/~jurafsky/s… Score.

As you can see from the above results, the latest models are getting higher scores. But manning also said that the data sets were news data sets, mainly words such as leaders of various countries and regions, so they were relatively simple, so they achieved good performance. In practice, Coreference Resolution system still has a lot of room for improvement.

conclusion

  • Coreference Resolution is a meaningful, challenging and linguistically interesting task
    • Many different types of Coreference Resolution systems have been proposed
    • With the application of large-scale pre-training models, the performance of these systems is rapidly improving
    • However, most of the models performed well only on the OntoNotes dataset, and there is still a lot of room for improvement in practical applications
  • Try designing a Coreference system yourself
    • corenlp.run/ (ask for coref in Annotations)
    • huggingface.co/coref/