Recently, many friends want to understand the development of deep learning in text classification. Therefore, the author He Congqing sorted out the classic deep text classification methods in recent years, hoping to help friends understand the application of deep learning in text classification. \

Convolutional Neural Networks for Sentence Classification (EMNLP 2014)

The TextCNN method proposed by Kim in EMNLP2014 has achieved good results on multiple data sets. Because of its fast computing speed and parallelism, it has been widely used in the industry. The diagram of TextCNN model is shown in the figure below.

TextCNN model first maps the text into vectors, then uses multiple filters to capture the local semantic information of the text, and then uses maximum pooling to capture the most important features. These features are recently entered into the full connection layer to obtain the probability distribution of the label.

Code reference:



Figure 1: TextCNN model architecture

Document Modeling with Gated Recurrent Neural Network for Sentiment Classification (EMNLP 2015)

Tang et al. proposed an emotion classification model using GRU to model documents. The model is shown below.

The model first maps text to vector, and then uses CNN/LSTM (CNN with three filters in the paper) for sentence representation. In addition, to capture the global semantic representation of the sentence, it is conveyed to the average pooling layer and then connected to the TANH activation function. Finally, the vector representation of the convolution kernel with different width of the whole sentence is connected to an Average layer to obtain the sentence Average vector representation.

Then input the resulting sentence representation into the GRU to obtain the document vector representation. Finally, the document vector is conveyed to softMax layer to obtain the probability distribution of labels.

Figure 2: Neural network model for document level sentiment classification

Recurrent Convolutional Neural Networks for Text Classification (AAAI 2015)

Lai et al. proposed a cyclic convolutional neural network classification method without artificial features, referred to as RCNN.

RCNN first uses BI-RNN to capture the contextual representations before and after, and then concatenates them, then uses the convolution layer of filter Filter_size =1, and uses maximum pooling operation to get the most relevant vector representations of documents. Finally, these vectors are input into softmax layer to get the probability representations of tags. \

Code reference:



Figure 3: Schematic diagram of model structure of RCNN \

Recurrent Neural Network for Text Classification with Multi-Task Learning (IJCAI 2016)

Liu et al. proposed three different information sharing mechanisms based on RNN to model specific tasks and texts for text multi-classification tasks.

Model 1(Uniformity-Layer Architecture): All tasks share the same LSTM Layer and spliced a randomly generated trainable vector after each particular task. The last hidden layer of the LSTM layer is passed into the SoftMax layer as input.

Model 2 (the Coupled – Layer Architecture) : Each task has its own independent LSTM layer, but the hidden state of all tasks at each moment will be input together with the character of the next moment, and the hidden State of the last moment will be classified.

Model 3(shared-Layer Architecture): In addition to a Shared Bi-LSTM Layer for obtaining Shared information, each task has its own independent LSTM Layer, and the INPUT of LSTM includes the character at each moment and the hidden state of bi-LSTM.

Figure 4: Modeling multitasking learning with three architectures

Hierarchical Attention Networks for Document Classification (NAACL 2016)

Yang et al. proposed a hierarchical attention mechanism network (HAN) for document classification. And Tang, this article is to document the problem of classification, however, this article put forward attention at the sentence level and document level mechanism, in building the modeling when the document is an important content to give different weights, and at the same time, also can alleviate RNN disappeared in capture the sequence of document information of gradient. The schematic diagram of HAN model is shown below.

The HAN model first uses BI-GRU to capture context information at the word level. Since each word in a sentence does not contribute equally to sentence representation, the author introduces an attentional mechanism to extract words that are significant to sentence representation and aggregate the representation of these information words to form sentence vectors. The specific principle of attention mechanism can be referred to:


Then, for all the sentence vectors input into bi-GRU, the sentence-level context information is captured to get the document vector. Similarly, to reward cue sentences for correctly classifying documents, the authors again use the attentional mechanism to measure the importance of sentences and get document vectors. Finally, all the document vectors are input into softMax layer to obtain the probability distribution of labels.

Code reference:



Figure 3: Schematic diagram of HAN model structure

Bag of Tricks for Efficient Text Classification (EACL 2017)

Joulin et al. proposed a simple and effective text classification model, abbreviated as fastText.

The fastText model inputs a sequence of words (a piece of text or a sentence), and the words in the sequence form feature vectors, which are then mapped to the middle layer through linear transformation, and the middle layer maps to the tag. Output the probability that this word sequence belongs to different categories. FastText uses non-linear activation function in prediction tag, but does not use non-linear activation function in middle layer. \

Code reference:



Figure 4: fastText model structure diagram

Deep Pyramid Convolutional Neural Networks for Text Categorization (ACL 2017)

Johnson and Zhang proposed a deep CNN model at the word level to capture the global semantic representation of the text. This model can achieve the best performance by increasing the network depth without increasing too much computing overhead, referred to as DPCNN. The schematic diagram of the model structure is shown below.

DPCNN model firstly uses “text region embedding” to extend the commonly used word embedding to the embedding of text area containing one or more words, which is similar to adding a layer of convolutional neural network.

Then the convolution fast superposition (two convolution layers and a shortcut connection, where the shortcut connection is similar to the residual connection) is downsampled with the maximum pooling layer of step size 2. Finally, a maximum pooling layer is used to obtain a document vector for each document.

Code reference:…

Figure 4: Schematic diagram of DPCNN model structure

Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm (EMNLP 2017)

Felbo et al. used millions of emoticons to learn emoticons in any field to detect mood, mood and sarcasm, proposing the DeepMoji model and achieving competitive results. At the same time, DeepMoji model can also achieve good results on text classification tasks.

The DeepMoji model first uses the embedding layer to map words to vectors, and maps each embedding dimension to [-1,1] using a double tangent function. The authors then use a two-tier BI-LSTM to capture context characteristics. Then we propose a new attentional mechanism in which embeddding layer and bi-LSTM layer are used as inputs to obtain vector representations of documents. Finally, the vector is input into softmax layer to get the probability distribution of labels.

Code reference:…

Figure 5: Schematic diagram of DeepMoji model structure

Investigating Capsule Networks with Dynamic Routing for Text Classification (EMNLP 2018)

Zhao et al. proposed a text classification model based on capsule network, and improved the dynamic routing proposed by Sabour et al. by proposing three stable dynamic routing. The model is as follows:

The model first uses standard convolutional networks to extract local semantic representation of sentences through multiple convolutional filters. Then, the scalar output of CNN is replaced by vector output Capsule to construct the Primary Capsule layer. Then input to the improved dynamic routing proposed by the author (dynamic routing of sharing mechanism and dynamic routing of non-sharing mechanism), get the convolution capsule layer. Finally, the convolutional capsule layer is flattened and sent to the fully connected capsule layer. Each capsule represents the probability of each category.


Code reference:… .


Figure 6: Capsule network architecture for text classification

Sentiment Analysis by Capsules (WWW 2018)

Wang et al. proposed an RNN Capsule network model for emotion classification, abbreviated as RNN-capsule. (This article does a good job of visualization.) A schematic diagram of the model structure is shown below.

Rnn-capsule first uses RNN to capture text context information, and then inputs it into the Capsule structure, which consists of three parts: Representation Module, Probability Module, and Reconstruction Module. Specifically, the attentional mechanism was used to calculate capsule representation. Then, the probability of capsule state was calculated by using capsule representation. Finally, the representation of capsule and the representation of capsule state probability reconstruction example are used.

Figure 7: Structure diagram of RNN-capsule model

Graph Convolutional Networks for Text Classification (AAAI 2019)

Yao et al. proposed a text classification based on graph convolutional Networks (GCN). We build a large heterogeneous text graph containing word nodes and document nodes, explicitly model global Word using co-occurrence information, and then treat the text classification problem as a Node classification problem.

Code reference:…

Figure 7: Model structure of Text GCN

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (NAACL 2019)

The BERT model proposed by Google breaks through the problem that static word vector cannot solve polysemy. BERT is a dynamic word vector based on language model, which achieves the best results in multi-task of natural language processing. By fine-tuning the BERT model, the author has achieved very competitive performance in many fields of text classification, such as law and emotion.

The architecture of BERT’s model is a multi-layer bi-directional Transformer encoder (see Attention is all you need). The authors use two sets of parameters to generate BERTBASE model and BERTLARGE model respectively (refer to the original paper for details). All downstream tasks can be fine-tuned in these two sets of models.

Code reference:…

Figure 8: BERT’s pre-training structure and fine-tuning structure

Author’s official account:

Please follow and share ↓↓↓\

ID: 92416895\

Currently, it ranks no.1 in the knowledge planet of machine learning

Past wonderful review \

  • Conscience Recommendation: Introduction to machine learning and learning recommendations (2018 edition) \

  • Github Image download by Dr. Hoi Kwong (Machine learning and Deep Learning resources)

  • Printable version of Machine learning and Deep learning course notes \

  • Machine Learning Cheat Sheet – understand Machine Learning like reciting TOEFL Vocabulary

  • Introduction to Deep Learning – Python Deep Learning, annotated version of the original code in Chinese and ebook

  • The mathematical foundations of machine learning

  • Machine learning essential treasure book – “statistical learning methods” python code implementation, ebook and courseware

  • Blood vomiting recommended collection of dissertation typesetting tutorial (complete version)

  • Installation of Python (Anaconda+Jupyter Notebook +Pycharm)

  • What if Python code is ugly? Recommend a few artifacts to save you

  • Blockbuster | complete AI learning course, the most detailed resources arrangement!