This article is from huawei cloud community “Comparison of Transformer and LSTM language Model in ESPNet — A Case study of Aishell”, author: lovely and positive.

Introduction to NLP Feature Extractor – RNN and Transformer

In recent years, deep learning has achieved SOTA results in various NLP tasks. Let’s take a look at the most commonly used feature extraction structures in the field of natural language processing.

  • Long short Term Memory Network (LSTM)

The approach of traditional RNN is to extract all the knowledge and input it to the next time step without any processing for iteration. Just like taking an exam, if you want to remember all the knowledge in the book in advance, by the time of the exam, the early knowledge may have been completely covered by the recent knowledge, and it is very normal that you cannot extract the long-term time step information. Is that how humans do it? Obviously not. What we usually do is to have a rational judgment on knowledge, give more weight to important knowledge, focus on memorizing, and forget less important knowledge in a short time. In this way, we can have a better performance in the face of exams. In my opinion, the structure of LSTM is more similar to the way humans remember knowledge. The key to understanding LSTM is to understand the two states, CT and AT, and the three internal gate mechanisms:In the figure, we can see that the LSTM Cell receives two inputs from the previous time step and two outputs from the next time step at each time step. Generally, we regard C (t) as global information and AT as the hidden state of the influence of global information on the next Cell.

Forgetting gate, input gate (Update gate in the figure) and output gate are all small single-layer neural networks whose activation function is SigmoID respectively. Since sigmoID takes a value in the range of (0,1), it is effectively used to judge whether to retain or “forget” information (multiplied by a value close to 1 means reserved, multiplied by a value close to 0 means forgotten), which provides us with the ability of selective information transmission.

Does this make LSTM look “smart”? In practice, however, LSTM has its limitations: Scheduling structure on the one hand makes it difficult to efficient parallel computing ability, the current state of computing not only depend on the current input, but also rely on the output of a state), on the other hand make the LSTM model (including other RNN model, such as GRU helped) on the whole, more similar to a markov decision process, the global information is difficult to extract.

GRU can be regarded as a simplified version of LSTM, which integrates two variables, AT and CT, and integrates the forgetting gate and input gate into update gate, while the output gate changes to remake gate. The general idea has not changed much. The performance differences between the two tend to be small, but grUs have relatively few parameters. Convergence is faster. For smaller datasets, I suggest GRU is sufficient, but for larger datasets, try LSTM with a larger number of parameters for surprising results.

  • In Transformer figure, the red box is Encoder frame, and the yellow box is Decoder frame, which are stacked by multiple Transformer blocks. Here Transformer Block replaces LSTM and CNN structure as our feature extractor, which is also the most critical part.

    The reason why the author adopts the Attention mechanism is that the calculation of RNN (or LSTM, GRU, etc.) is limited to order, that is to say, the relevant algorithm of RNN can only be calculated from left to right or from right to left. This mechanism brings two problems:

  1. The calculation of time slice T depends on the calculation results at t-1, which limits the parallel capability of the model.
  2. Information will be lost in the process of sequential calculation. Although the structure of gate mechanism such as LSTM alleviates the problem of long-term dependence to a certain extent, LSTM is still powerless to deal with the phenomenon of especially long-term dependence.

Transformer solves the above two problems. First, it uses the Attention mechanism to reduce the distance between any two positions in a sequence to a constant. Secondly, it is not a sequential structure similar to RNN, so it has better parallelism and conforms to the existing GPU framework.

Ability to extract semantic features: Transformer is significantly better than RNN and CNN, and RNN and CNN are not much different.

Long distance feature capture capability: CNN is significantly weaker than RNN and Transformer, and Transformer is slightly better than RNN model. However, RNN is slightly better than Transformer in a relatively long distance (the subject and predicate distance is greater than 13), so comprehensively, It can be considered that Transformer and RNN have similar abilities in this respect, while CNN is significantly weaker than the former two. As we have mentioned before, CNN’s ability to extract long-distance features is limited by its convolution kernel receptive field. Experiments have proved that increasing the size of the convolution kernel and increasing the network depth can increase CNN’s long-distance feature capture capability. For Transformer, its long-distance feature capture capability is mainly affected by the number of multi-heads. The more multi-heads there are, the stronger the long-distance feature capture capability of Transformer will be.

Task comprehensive feature extraction ability: Generally, machine translation task is one of the tasks with the highest comprehensive requirements for NLP processing ability. In order to obtain high-quality translation results, the performance requirements of morphology, syntax, semantics, context processing ability, long-distance feature capture and other aspects are very high. From the perspective of comprehensive feature extraction ability, Transformer is significantly stronger than RNN and CNN, while RNN and CNN are not much worse.

Parallel computing: As mentioned in many places above, parallel computing is a serious shortcoming of RNN and Transformer is similar to CNN.

Contrast experiment of TRANSFORMER and LSTM language model in ESPNET

The default language module in all espNet examples is LSTM. Here I use Aishell as an example. The epoch is set to 20 and batchsize=64.

LSTM results:

Change the language model to Transformer, transformer structure configuration:

The transformer results:

Experimental conclusions:

Transformer language models do have a smaller loss than LSTM, but because language model sequence information is so important, Transformer can only get ambiguous location information, so Transformer has a greater degree of confusion than LSTM!

This aspect should be improved in the future.

Click to follow, the first time to learn about Huawei cloud fresh technology ~