From The Gradient, by Sebastian Ruder, compiled by Machine Heart.

Pre-trained models in ImageNet are often used in the field of computer vision, and they can be further used for different CV tasks such as target detection and semantic segmentation. However, in the field of natural language processing, we usually only use pre-training words to embed the relationships between vectors, so there is no pre-training method that can be used for the whole model. Sebastian Ruder says the language model has the potential to serve as an overall pre-training model, extracting language features from shallow to deep and using them for a wide range of NLP tasks such as machine translation, question answering systems and automatic summarization. Ruder also showed the effect of using a language model as a pre-training model, and said that “ImageNet” in the NLP domain was finally coming.

The world of natural language processing (NLP) is changing dramatically.

For a long time, word vector has been the core representation technology in natural language processing. However, its dominance is being shaken by a number of exciting new challenges, such as ELMo, ULMFiT and OpenAI Transformer. These approaches have attracted a lot of attention for demonstrating that pre-trained language models can achieve current optimal levels in a wide range of NLP tasks. These approaches herald a watershed: they may have as broad an impact in NLP as the pre-trained ImageNet model has in computer vision.


From the shallow to the deep pre-training

The pre-trained word vector brings great improvement to NLP. A language modeling approximation, Word2VEc, proposed in 2013, was adopted due to its efficiency and ease of use in an era when hardware was much slower and deep learning models were not widely supported. Since then, the standard approach to conducting NLP projects has remained largely unchanged: Word embedding, which preprocesses large amounts of unlabeled data with algorithms such as Word2VEc and GloVe, is used to initialize the first layer of the neural network, and other layers are then trained on task-specific data. In most tasks with limited training data, this approach helped improve by two to three percentage points. Influential as these pre-trained word embeds are, they have limitations: they contain only a priori knowledge of the first layer of the model — the rest of the network needs to be trained from scratch.

Relationships captured by Word2vec. (Source: TensorFlow Tutorial)

Word2vec and other related methods are shallow methods that sacrifice expressiveness for efficiency. Using word embedding is like initializing a computer vision model with pre-trained representations that encode only the edge information of an image: they work well for many tasks, but fail to capture the high-level information that might be more useful. Models initialized with word vectors need to be learned from scratch, not only to disambiguate, but also to extract meaning from sentences made up of words. This is the core of language understanding, which needs to model many complex linguistic phenomena such as semantic combination, polysemous word, anthesis repetition, long-term dependence, consistency and negation. Therefore, it is not surprising that the NLP model initialized with these shallow representations still requires a large number of examples to get good performance.

At the heart of the latest advances in ULMFiT, ELMo, and OpenAI Transformer is a key paradigm shift: from just initializing the first layer of a model to preprocessing the entire model with a hierarchical representation. If learning word vectors is like learning only edges of an image, then these approaches are like learning the full hierarchy of features, from edges to shapes to high-level semantic concepts.

Interestingly, pre-training the entire model for both primary and advanced features has been adopted in the computer vision community for several years. In most cases, the pre-training model is trained by learning classified images on the ImageNet large data set. ULMFiT, ELMo and OpenAI Transformer have brought to the NLP community “ImageNet” in natural language, a task that allows models to learn the advanced nuance of the language. This is similar to ImageNet, which allows pre-training of CV models to learn universal image features. Later in this article, we draw an analogy between language modeling and ImageNet computer vision modeling and show why this approach seems so promising.


ImageNet

ImageNet Large-scale Visual Identity Challenge. (Credit: Xavier Giro-o-Nieto)

ImageNet has an important influence on machine learning research. The data set was originally published in 2009 and quickly evolved into the ImageNet Large-scale Visual Identity Challenge (ILSVRC). In 2012, The deep neural network submitted by Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton exceeded the second place by 41%, indicating that deep learning is a feasible machine learning strategy. The deep neural network can be said to have triggered the explosion of deep learning in machine learning research.

ImageNet’s success highlights that in the era of deep learning, data is at least as important as algorithms. Not only did the ImageNet data set make possible the very important demonstration of deep learning capabilities in 2012, but an equally important breakthrough was made in transfer learning: the researchers quickly realized that weights learned in ImageNet’s existing models could be used to fully initialize models from other data sets and significantly improve performance. This “fine-tuning” approach allows for good performance with only one positive example per category (Donahue et al., 2014).

The features trained on ILSVRC-2012 are generalized to sun-397 dataset. (Source: Donahue et al., 2014)

In object recognition, semantic segmentation, human pose estimation and video recognition, the pre-processed ImageNet model has been used to achieve the optimal level. At the same time, they allow CV to be used in areas where training examples are few and annotation costs are high. In CV, transfer learning via preprocessing on ImageNet is actually so effective that not using it now would be considered as a brute force (Mahajan et al., 2018).


What’s in ImageNet?

To determine what ImageNet might look like for language processing, we first need to determine what makes ImageNet favorable for transfer learning. Previous research has revealed only part of the problem: reducing the number of examples or categories per class only results in performance deterioration, and fine-grained classes and more data do not always mean better results.

Rather than looking directly at the data, it is more prudent to explore what the model trained on the data has learned. It is well known that the feature transfer sequence of the deep neural networks trained on ImageNet is from the first layer to the last layer, and from general tasks to specific tasks: the lower levels learn to model low-level features, such as edges, while the higher levels learn to model high-level concepts, such as patterns and whole parts or objects, as shown in the figure below. Importantly, knowledge about object edges, structure, and visual composition is relevant to many CV tasks, which reveals why these layers are migrated. Thus, a key attribute of a dataset like ImageNet is to encourage model learning of features that can be generalized to new tasks in the problem domain.

Visualization of information captured by features of different layers in GoogLeNet trained on ImageNet. (Credit: Distill)

Beyond that, it is difficult to generalize further about why migration on ImageNet does so well. For example, another benefit of the ImageNet dataset could be the quality of the data. The creators of ImageNet go to great lengths to ensure that the annotations are reliable and consistent. However, the work of remote policing is a contrast, which shows that large amounts of weak marker data are often sufficient. In fact, Facebook researchers recently showed that they could pre-train models by predicting the latest accuracy of hashtags on billions of social media images to ImageNet.

Without any more specific insights, but we have two key needs:

  1. A data set like ImageNet should be large enough, with approximately millions of training examples.
  2. It should represent the problem space of the discipline.


ImageNet for language tasks

NLP’s models are generally much shallower than CV’s. Therefore, most analysis of features focus on the first embedding layer, and few people study the higher-level properties of transfer learning. We think about data sets that are large enough. In the current NLP situation, there are several common tasks that may be used in the pre-training model of NLP.

Reading comprehension is the task of answering natural language questions about a paragraph. The most popular Dataset for this task is the Stanford Question Answering Dataset (SQuAD), which contains more than 100,000 Question Answering pairs and makes the model answer a Question by highlighting a few words in a paragraph, as shown below:

Visualization of information captured by different layer features of GoogLeNet trained on ImageNet (Rajpurkar et al., 2016, SQuAD: 100,000+ Questions for Machine Comprehension of Text).

Natural language reasoning is the task of identifying relationships (implication, contradiction, neutrality, etc.) between a piece of text and a hypothesis. The mission’s most popular data set is the Stanford Natural Language Inference (SNLI) Corpus, which contains 570,000 human-written English sentence pairs. An example of this dataset is shown in the figure below.

SNLI:nlp.stanford.edu/projects/sn…

Example of an SNLI dataset. (Bowman et al., 2015, A Large Annotated Corpus for Learning Natural Language Inference)

Machine translation, or the conversion of text from one language to another, is one of the most well-researched tasks in NLP. And over the years, people have accumulated a lot of training data for common language pairs, such as the 40 million English and French sentence pairs in WMT2014. Below are two sample translation pairs.

French to English Translation from Newstest2014 (Artetxe et al., 2018, Unsupervised Neural Machine Translation)

Constituency parsing extracts the syntactic structure of a sentence in the form of a (linearized) parse tree, as shown in the figure below. In the past, millions of weak marker parsing has been used in this task to train sequence-to-sequence models (see Grammar as a Foreign Language).

Parse trees and their linearization (Vinyals et al., 2015, Grammar as a Foreign Language)

Language modeling (LM) attempts to predict the next word given the previous one. The existing baseline data set consists of about 1 billion words, but since the task is unsupervised, any number of words can be used to train. Below is an example of a common Wikitext-2 dataset composed of Wikipedia articles.

An example of modeling a data set in the Wikitext-2 language. (Credit: Salesforce)

WikiText – 2: Einstein. Ai/research/th…

All of these tasks provide or allow the collection of a sufficient number of examples to train with. In fact, these tasks (and many others such as sentiment analysis, skip-Thoughts, and self-coding) have been used in recent months to pre-train representations.

Although any data contains some bias, human annotations may inadvertently introduce additional information that will be used by the model. Recent research has shown that the current optimal model in tasks such as reading comprehension and natural language reasoning does not actually produce deep natural language understanding, but rather pays attention to certain cues to perform superficial pattern matching. For example, Gururangan et al. (2018) in Annotation Artifacts in Natural Language Inference Data showed that annotators tended to generate implication examples by removing gender or quantity information, And generate contradictions by introducing negative words. Using only these clues, the model was able to classify the assumptions on the SNLI dataset with 67% accuracy without looking at the premises.

So the harder question should be: Which task is most representative of NLP? Put another way, which task will enable us to learn the most about natural language understanding or relationships?


Modeling language

In order to predict the most likely next word in a sentence, the model not only needs to be grammatically expressive, that is, the grammatical form of the next word that the model predicts must match its modifier or verb. Models also need to understand semantics, and the most accurate models must include things like world knowledge or common sense. If you think about an incomplete sentence like “The service was poor, but The food was,” in order to predict subsequent words like “yummy” or “delicious,” The model not only needs to remember The attributes used to describe The food, but also needs to recognize The conjunction “but” to introduce The opposite semantics. So the new attribute should be the opposite of the sentiment word “poor.”

Language modeling is the last approach that has been shown to capture many language-related attributes such as long-term dependencies, hierarchies, and emotional semantics for downstream tasks. Compared with unsupervised learning tasks such as autoencoder, language modeling can perform very well on syntactic tasks even with only a small amount of training data.

The biggest advantage of language modeling is that training data can be obtained from any text corpus for free, so there is almost unlimited training data available. This is important because NLP is not limited to English. There are 4,500 languages spoken by more than 1,000 people. Language modeling as a pre-training task opens the door to languages that were previously poorly supported by language services. We can directly use text data to train language models unsupervised and apply them to tasks such as translation and information extraction. For those rare languages with insufficient unlabeled data, multilingual language modeling can be first trained in multiple related languages, such as cross-language word embedding.

Different phases of ULMFiT (Howard and Ruder, 2018)

So far, our argument for language modeling as a pre-training task has been purely conceptual. In recent months, however, we have also seen some experimental evidence: Language Model Word Embedding (ELMo), Universal Language Model Refinement (ULMiT), and OpenAI Transformer have experimentally demonstrated that language models can be used for pre-training tasks, such as ULMFiT shown above. All three approaches use pre-trained language models to implement current optimal natural language processing tasks, such as text categorization, question answering systems, natural language inference, referential disambiguation, and sequence labeling.

In many cases, such as ELMo as shown below, algorithms that use a pre-trained language model as their core are 10 to 20 percent better than the current optimal results on a widely studied basis. ELMo was also awarded the best paper of NLP summit NAACL-HLT 2018. Finally, these models exhibit very high sample efficiency, requiring only a few hundred samples to achieve optimal performance, and even zero-shot learning.

ELMo’s progress in a series of NLP missions. (Credit: Matthew Peters)

Given the changes achieved in this step, NLP practitioners are likely to download preprocessed language models, rather than preprocessed word embedding, for use in their own models after a year, just as the starting point for most CV projects today is how to preprocess the ImageNet model.

However, like Word2VEc, the task of language modeling has its own natural limitations: it serves only as a proxy for true language understanding, and a single model is not sufficient to capture the information needed for a specific downstream task. For example, in order to answer questions about or follow the trajectory of a character in a story, the model needs to learn to perform antecedent or resolution. Moreover, language models can only capture what they have seen. Certain types of information, such as most common sense, are difficult to learn from text alone and require integration with external information.

A prominent problem is how to migrate information from a pre-trained language model to a downstream task. There are two main paradigms. One is whether to take a pre-trained language model as a fixed feature extractor and integrate its representations as features into a randomly initialized model (as ELMo did); The second is whether to fine-tune the full language model (as ULMFiT does). The latter is commonly used in computer vision, where training adjusts the top or top layers of the model. Although NLP models are generally shallower and therefore require different fine-tuning techniques than their visual counterparts, recent pre-training models have gone deeper. Next month I will demonstrate the role of each of the core components of NLP transfer learning: expressive language model encoders (such as deep BiLSTM or Transformer), the amount and nature of the data used for pre-training, and ways to fine-tune the use of pre-training models.


But what is the theory?

So far, our analysis has been largely conceptual and empirical, and we still find it difficult to understand why models first trained on ImageNet migrate so well on language modeling. A more formal way to consider the generalization ability of pre-training models is based on “bias learning” models (Baxter, 2000). Suppose our problem domain covers all permutations of tasks in a particular discipline, such as computer vision — it constitutes the environment. We provide a number of data sets for this, allowing us to induce a series of hypothetical Spaces H=H’. Our goal in bias learning is to find the bias, that is, assuming a space H’∈H, which maximizes performance across the entire (possibly infinite) environment.

Empirical and theoretical results in multi-tasking learning (Caruana, 1997; Baxter, 2000) suggests that the bias learned in a sufficient number of tasks may be generalized to tasks not seen in the same environment. Through multi-tasking learning, models trained on ImageNet can learn a large number of binary classification tasks (one per class). These tasks come from natural, real-world image Spaces and are probably representative of many other CV tasks as well. Similarly, language models by learning a large number of classification tasks (one per word) may induce representations that help many other tasks in the field of natural language. However, more research is needed to better understand theoretically why language modeling seems to be so effective in transfer learning.


NLP’s ImageNet era

NLP is ripe for a real shift to transfer learning. Given the impressive experimental results of ELMo, ULMFiT, and OpenAI, it seems only a matter of time before this development, pre-trained word embedding, falls out of fashion and is replaced by pre-trained language models in every NLP practitioner’s toolbox. This may create more possibilities for NLP when the amount of annotated data is insufficient. Heaven is dead, let the yellow sky stand!

Thegradient.pub/nLP-Imagene…