This paper mainly introduces a paper about Soft-masked BERT, a Chinese error correction model, published by ByteDance AI Lab and included in ACL 2020. Address: arxiv.org/pdf/2005.07…

Chinese Spelling Correction (CSC)

Given a sequence of n characters X =(x1, x2, · · ·, xn), the goal is to convert it to another character sequence of the same length Y =(y1,y2… ,yn), where incorrect characters of X are replaced with correct characters to obtain Y. The task can be viewed as a sequence marking problem, where the model is a mapping function F: X → Y. This task is relatively easy, however, because it usually only takes a few characters to replace.

Abstract

  • A new neural architecture is proposed to solve Chinese spelling error correction (CSC), called soft-masked BERT, which consists of an error detection network based on BI-GRU and a Bert-based error correction network to predict the probability of a character’s error at each location. The probability is then used for soft-masking for character masking at the location, and soft-masking for each location is then embedded into the correction network.
  • The proposed method can make the model learn the correct context for error correction with the help of the detection network during end-to-end joint training
  • Soft-masking is an extension of traditional “hard-masking”, and the former becomes the latter when the error probability is equal to 1
  • The performance of this method is better than other models based on BERT

contribution

  • A new neural architecture soft-masked BERT is proposed for CSC problems
  • Validation of soft-masked BERT validity
  • Using SIGHAN and News Title data sets, soft-masked BERT significantly outperforms other comparison models on both datasets in accuracy measurement

Model framework

Problem and Motivation

CSC’s most advanced approach is to complete tasks based on BERT. Now this article finds that the performance of this method can be further improved if more wrong characters can be found. In general, bert-based approaches tend to make no corrections, or just copy the original characters. The possible explanation is that in BERT’s pre-training, only 15% of characters are shielded for prediction, leading to insufficient error detection ability in model learning. This prompted us to design a new model.

Model

This paper proposes a new neural network model called soft-masked BERT, as shown in the figure. Soft-masked BERT consists of a detection network based on BI-GRU and a correction network based on BERT. It detects the probability of network prediction errors and corrects the probability of network prediction errors, and the former uses soft masking to transmit its prediction results to the latter. To be more specific:

  • Our approach starts by creating an embed, called an input embed, for each character in the input sentence.
  • Take the embedded sequence as input and use the error probability of detecting the network output character sequence.
  • It calculates the weighted sum of input embedding and [MASK] embedding with the error probability as the weight. The calculated embeddings mask possible errors in the sequence in a soft way.
  • Then the soft mask is embedded into the sequence as input, and the correction probability is output using the correction network, which is a BERT model
  • The last layer consists of softmax functions for all characters, and there is also a residual connection between the input embedding and the embedding of the last layer.

Detection Network

Detection network is a binary model, which is implemented by the classical BI-GRU. The formula is as follows:

Input is the embedding sequence E = (E1, e2, · · · en), where EI represents the embedding of character XI, which is the sum of Word embedding, position embedding and segment embedding. The output is a sequence of tags G = (g1, G2… Gn), where gi indicates the label of the i-th character, 1 indicates that the character is incorrect, and 0 indicates that the character is correct. For each character, there is a probability that PI represents the probability that it will be 1. The higher the PI, the more likely the character is to be wrong.

The soft mask is equivalent to the weighted sum of the mask embeddings of the input embeddings and with the error probability as the weight. If the error probability is high, soft-masked embedding E ‘I is close to mask embedding emask; Otherwise it’s close to the input embedding EI. Soft-masked embedding E ‘I:

Wherein ei is input embedding and EMask is mask embedding.

Correction Network

The correction network is based on BERT sequential multi – classification model. The input is soft-masked embedding sequence E’ =(E’ 1, E’ 2, · · ·, E ‘n), and the output is character sequence Y =(y1, y2… , yn).

BERT consists of 12 identical blocks, taking the entire sequence as input. Each block contains a multi-headed self-attention operation followed by a feedforward neural network.

The hidden state sequence of the last BERT layer is expressed as Hc = (hC1,hc2,···, HCN), and the probability of error correction for each character of the sequence is defined as:

Which Pc (yi = j | X) represent character xi were corrected for the candidate list the conditional probability of character j softmax is softmax function, from the candidate list, select the character with the highest probability as output character xi, h ‘I said hidden state, formula is:

Hci is the hidden state of the last layer, and EI is the input embedding of character XI.

Learning

The learning of soft-masked BERT is end-to-end, on the premise that BERT is pre-trained and the training data composed of the original sequence and the corrected sequence pairs are given, which is expressed as = {(X1,Y1),(X2,Y2),… , XN, YN)}. One way to create training data is to use obfuscation tables to repeatedly generate sequences containing errors, Xi, given an error-free sequence Yi.

The learning process is driven by two objectives, optimization, corresponding to error detection and error correction.

Ld is the training target of detection network, Lc is the training target of correction network. The linear combination of these two functions is the overall goal of learning.

The experiment

Datasets

This paper uses the SIGHAN data set, which contains 1100 texts and 461 error types, divided into training sets, development sets and test sets according to standards.

A larger dataset called News Title was also created for testing and development, which contained 15,730 texts with 5,423 texts containing errors in 3,441 types. We split the data into test sets and development sets, each containing 7,865 pieces of text.

In addition, following common practices in CSC to automatically generate data sets for training, we captured about 5 million news headlines, and we created a confusion table where each character was associated with some homophone character as a potential error. Next, we artificially generate errors by randomly replacing 15% of the characters in the text with other characters, 80% of which are homophones in the obfuscating table, and 20% of which are random characters. This is because in practice, people use input methods based on pinyin, and about 80 percent of Chinese spelling mistakes are homophones.

Experiment Setting

Sentence level accuracy, accuracy, recall rate and F1 score were used as evaluation criteria

The pre-trained BERT model used in the experiment is github.com/huggingface… Model provided. In BERT’s fine tuning, we kept the default hyperparameters and fine-tuned them only with Adam. In order to reduce the effect of training techniques, we did not use dynamic learning rate strategy and maintained learning rate 2E-5 during fine-tuning. The size of the hidden unit in bi-GRU is 256, and the batch size used by all models is 320.

In the experiment on SIGHAN, for all Bert-based models, we first fine-tune the model with 5 million training samples, and then continue fine-tuning the training samples in SIGHAN. We deleted unchanged text from the training data to improve efficiency.

In the News Title experiment, the model was fine-tuned with just 5 million training examples.

SIGHAN and News Title development sets for hyperparameter tuning. The optimal value of the hyperparameter λ is selected for each dataset.

Main Results

The table shows the experimental results of all methods on two test data sets. It can be seen from the table that the proposed soft-masked BERT model is significantly superior to the baseline method in both datasets.

Soft-masked BERT, Bert-Finetune and FASPell’s three methods perform better than other baselines, while Bert-Pretrain’s performance is quite poor. The results show that BERT without fine-tuning (i.e., Bert-pretrain) does not work, while BERT with fine-tuning (i.e., Bert-Finetune, etc.) can significantly improve performance. Here we see another successful application of BERT, which can gain a certain amount of language understanding knowledge. In addition, soft-masked BERT can significantly beat Bert-Finetune on both datasets. The results show that error detection is important for using BERT in CSC, and soft masking is indeed an effective tool.

Effect of Hyper Parameter

The results of soft-masked BERT on News Title test data are shown in the table to illustrate the influence of parameters and data size.

The results of soft-masked BERT and Bert-finetune using training data of different sizes are shown. It can be found that soft-masked BERT indicates that the more training data used, the higher performance can be achieved. It is also observed that soft-masked BERT is always superior to Bert-finetune.

A larger λ value means a higher error-correction weight. Error detection is easier than error correction because the former is essentially a dichotomous problem, while the latter is a multi-class classification problem. It is found that when λ is 0.8, the highest F1 score is obtained, which means a good compromise between detection and calibration has been achieved.

Ablation Study

Soft-masked BERT is studied by ablation in two data sets. Only the test results in News Title are given in this paper, because the results of SIGHAN are similar.

In soft-masked bert-R, residual links in the model are removed. In hard-masked BERT, if the error probability given by the detection network exceeds the threshold, the embedding of the current character is set to the embedding of the [MASK] token, otherwise the embedding remains unchanged. In Rand-masked BERT, the error probability is randomized to a value between 0 and 1. It has been found that all major components of soft-masked BERT are necessary to achieve high performance.

This paper also tries to consider the performance of the Bert-Finetune +Force model as an upper limit. In this method, we have Bert-Finetune predict only where there are errors and select one character from the rest of the candidate list. The results show that the performance of soft-masked BERT is worse than that of Bert-finetune +Force, and there is still a lot of room for improvement.