Translators’ | ‘
Edit | Vincent
AI Front Line introduction:We initially thought that timing issues (like language, pronunciation, and so on) were inherently the domain of RNN. That view may now be a thing of the past. As a powerful member of the CNN family, Temporal Convolutional Nets (TCNs) have many new features and have now beaten RNN in many major application fields. It looks like RNN might be a thing of the past.






Please pay attention to the wechat public account “AI Front”, (ID: AI-front)

Since 2014 and 2015, our deep neural network-based applications have achieved 95% accuracy in text and speech recognition and can be used to develop next-generation chatbots, personal assistants and instant translation systems.

Convolutional Neural Nets (CNNs) are recognized as the main force in the field of image and video recognition, while Recurrent Neural Nets (RNNs) play a similar role in the field of natural language processing.

However, one major difference between the two is that CNN can recognize features in still images (or video segmented by frames), while RNN performs well in text and speech, because such problems are sequence or time-dependent. That is, the next character or word to be predicted depends on the preceding (left-to-right) character or word, thus introducing the concept of time, which in turn takes into account sequences.

In fact, RNN performs well in all sequence problems, including speech/text recognition, machine translation, script recognition, sequence data analysis (prediction), and even automatic code generation in different configurations.

For a short period of time, improved versions of RNN were prevalent, including LSTM (Long Short Term Memory network) and GRU (Gate recurring Units). Both of these improve the memory range of RNN so that data can take advantage of text information at great distances.

Solve the “weird” problem

Context becomes an important issue when an RNN reads characters sequentially from left to right. For example, in an emotional analysis of a review, the first few sentences may be positive (e.g., good food, good atmosphere) but end with a negative comment (e.g., bad service, high price), and the entire review may actually be negative. This is the logical equivalent of the “No way” joke: “That tie looks nice… Just strange!”

The solution to this problem is to use two LSTM encoders that read text from both directions at the same time (i.e., bidirectional encoders). This is equivalent to having future information in the present. That solved the problem to a large extent. The accuracy has really improved.

A problem for Facebook and Google

In the early years, when Facebook and Google released their automated language translation systems, they realized that translation would take too long.

This is actually a problem with the internal design of RNN. Since the network only reads and parses one word (or character) in the input text at a time, the deep neural network must wait for the previous word to be processed before proceeding to the next word.

This means that RNN cannot do massive parallel processing (MPP) like CNN, especially when RNN/LSTM processes text bidirectionally.

This also means that RNNS are extremely computationally intensive, because all intermediate results must be saved before the entire task is run.

In early 2017, Google and Facebook proposed a similar solution to the problem — using CNN in machine translation systems to take advantage of massively parallel processing. In CNN, calculations do not depend on previous time information, so each calculation is independent and can be run in parallel.

Google’s solution is called ByteNet, while Facebook’s is called FairSeq (named after FAIR, Facebook’s in-house AI research team). The code for FairSeq has been posted to GitHub.

Facebook claims their FairSeq network is up to nine times faster than the basic RNN.

Basic Working Principle

CNN treats the image as a two-dimensional “block” (height and width) when processing the image. Moving to text processing, you can think of text as a one-dimensional object (1 unit height, n units length).

However, RNN cannot directly pre-define object length, while CNN requires length information. Therefore, to use CNN, we must increase the number of layers until the entire receptive field is covered. This approach will make CNN very deep, but thanks to the advantages of large-scale parallel processing, no matter how deep the network is, parallel processing can be carried out to save a lot of time.

Special construction: Gate + jump = attention

Of course, the specific solutions are not as simple as those mentioned above. Google and Facebook have also added a special structure to the web: the “Attention” function.

The original attentional function was proposed last year by Researchers at Google Brain and the University of Toronto under the name Transformer.

Link to original paper:

https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.

At the time, Facebook and Google used almost exactly the same function, so it got a lot of attention and was called the “attention” function. This function has two unique characteristics.

The first feature is what Facebook calls “multiple jumps.” Instead of “seeing” each sentence once in the traditional RNN approach, multiple jumps allow the system to “see” a sentence “many times.” This behavior is more similar to human translation.

Each “glance” may focus on a noun or verb, but these words are not necessarily a sequence, so their meaning can be understood more deeply with each iteration. Each glance may be independent of the other, or may depend on the previous glance, and then focus on related adjectives, adverbs, auxiliary verbs, etc.

Here’s an example of a French-English translation from Facebook, showing the first iteration. This iteration encodes each French word and then uses a “multi-jump” method to select the most appropriate English translation.

The second feature is gating (gate control), which controls the flow of information between hidden layers. In the process of context understanding, gate determines which information can better predict the next word through scale control of CNN.

More than Machine Translation — Time Convolutional Networks (TCN)

By mid-2017, Facebook and Google had completely solved the time efficiency problem of machine translation by using CNN and the attention function. More importantly, there is a lot of potential for this technology to be buried in the small task of speeding up machine translation. Can we generalize this to all problems that apply to RNN? The answer is yes, of course.

In 2017, a number of studies were published; Some of them were published around the same time as Facebook and Google. An Empirical Evaluation of Generic Convolutional and written by Shaojie Bai, J. Zico Kolter and Vladlen Koltun is one of the more comprehensive papers “Recurrent Networks for Sequence Modeling”.

The original link: https://arxiv.org/pdf/1803.01271.pdf.

Some of my colleagues have named this new architecture time convolutional networks. Of course, the name is subject to change with industrial application.

The work of the above paper is to directly compare TCN with RNN, LSTM and GRU on 11 different industry standard RNN problems of non-verbal translation.

The results show that TCN is not only faster but also more accurate in 9 of the problems. Tied with GRU in 1 question (bold text in the table below represents the highest precision items. Images taken from the original paper).

TCN pros and cons

Shaojie Bai, J. Zico Kolter and Vladlen Koltun also gave the following practical list of advantages and disadvantages of TCN.

  • Speed is important. Faster networks can make feedback loops shorter. Because large-scale parallel processing can be performed in TCN, the time of network training and verification will be shorter.

  • TCN provides more flexibility to change the size of the receptive field, mainly by stacking more convolutional layers, using larger expansion coefficients and increasing the filter size. These operations provide better control over the memory size of the model.

  • The back propagation path of TCN is different from the time direction of the sequence. This avoids the problem of gradient explosion or gradient extinction that often occurs in RNN.

  • Less memory is required for training, especially for long input sequences.

However, the authors point out that TCN may not be as adaptable as CNN in terms of transfer learning. This is because the amount of historical information required for model predictions may be different in different fields. Thus, TCN may perform poorly when migrating a model from a problem requiring less memory information to a problem requiring longer memory because its receptive field is not large enough.

Consider further that TCN has been applied in many important fields with great success and can solve almost any sequence problem. Therefore, we need to reconsider our previous views. Sequence problems are no longer the exclusive domain of RNN, and TCN should be a priority for our future projects.

About the author: Bill Vorhies, chief editor of the Data Science Center, is a data scientist and has been working in the field of data science since 2001. Email: [email protected].

This article appears in Data Science Central.

https://www.datasciencecentral.com/profiles/blogs/temporal-convolutional-nets-tcns-take-over-from-rnns-for-nlp-pred