Understand the convolutional neural network of NLP

When we hear about convolutional Neural networks (CNN), we usually think about computer vision. CNN is responsible for major breakthroughs in image classification that are at the heart of most current computer vision systems, from Facebook’s automated photo tagging to self-driving cars.

Recently, we have also begun to apply CNN to problems in natural language processing, and have obtained some interesting results. In this article, I will try to summarize what CNNS are and how they are used in NLP. The intuition behind CNN is a little easier to understand for computer vision use cases, so I’ll start there and work my way up to NLP.

 

What is convolution?

The easiest way to think about convolution is as a sliding window function applied to a matrix. It’s a mouthful, but looking at the visualization becomes very clear:

Convolved with a 3×3 filter. Source: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution

Imagine that the matrix on the left represents the black and white image. Each entry corresponds to a pixel, with 0 being black and 1 being white (typically between 0 and 255 for grayscale images). Sliding Windows are called kernels, filters or feature detectors. Here, we use a 3×3 filter, multiply its element values by the original matrix, and then add them. To obtain the full convolution, we perform this operation for each element by sliding the filter across the entire matrix.

You may be wondering what you can actually do. Here are some intuitive examples.

Averaging each pixel using its adjacent values blurs the image:

 

 

 

Using the difference between pixels and their neighbors to detect edges:

(To see this visually, consider what happens to the smooth part of the image where the pixel color is equal to its neighbor’s color: Add cancel, resulting in a value of 0 or black. If the intensity has sharp edges, for example, transitioning from white to black, you’ll get a big difference and produce white values)

 

 

 

There are some other examples in the GIMP manual. To learn more about how convolution works, I also recommend checking out Chris Olah’s post on the subject.

What is a convolutional neural network?

Now you know what convolution is. But CNN? CNN is basically just a few layers of convolution where nonlinear activation functions such as ReLU or TANH are applied to the results. In a traditional feedforward neural network, we connect each input neuron to each output neuron in the next layer. This is also called the completely connected layer or affine layer. We don’t do that at CNN. Instead, we use convolution at the input layer to compute the output. This results in local connections, where each region of the input is connected to a neuron in the output. Apply different filters to each layer, usually hundreds or thousands, as shown above, and combine the results. There are also things called pool (subsampling) layers, but I’ll get to that later. During the training phase, CNN automatically learns the values of its filters based on the tasks you want to perform. For example, in image classification, CNN can learn to detect edges from raw pixels in the first layer, then use edges to detect simple shapes in the second layer, and then use those shapes to block higher-level features, such as facial shapes at higher levels. The final layer is a classifier that uses these advanced capabilities.

Two aspects of this calculation are of interest: position invariance and compositionality. Suppose you want to categorize whether there are elephants in the image. Because you slide your filter across the entire image, you don’t really care where the elephant happens. In practice, pooling also gives you the invariance of panning, rotation, and scaling, but more on that later. The second key aspect is the (local) mix. Each filter component translates a local patch of lower-level functionality into a higher-level representation. That’s what makes CNN so powerful in computer vision. Intuitively, you can build pixel edges, edge shapes, and more complex objects in shapes.

So how does this apply to NLP?

Instead of image pixels, the input for most NLP tasks is sentences or documents represented as matrices. Each row of the matrix corresponds to a mark, usually a word, but it can be a character. That is, each row is a vector representing words. Typically, these vectors are Word inserts (low-dimensional representations), such as word2vec or GloVe, but they can also be single-heat vectors that index words into vocabularies. For a 10-word sentence with 100-dimensional embedding, we will use a 10-by-100 matrix as input. That’s our image.

In vision, our filters slide over local blocks of color in the image, but in NLP, we usually use filters (words) that slide over entire lines of the matrix. Thus, the “width” of our filter is usually the same as the width of the input matrix. The height or area size may vary, but sliding Windows of more than 2-5 words at a time is typical. Put all of the above together, NLP’s convolutional neural network might look something like this (take a few minutes to try to understand this picture and how to calculate dimensions. You can ignore pooling for now, as we’ll explain later) :

Examples of convolutional Neural network (CNN) architecture for sentence classification. Here we describe three filter region sizes: 2,3 and 4, each with 2 filters. Each filter performs a convolution of the sentence matrix and generates a (variable-length) feature map. 1-max pooling is then performed on each map, that is, the maximum number from each feature map is recorded. Thus, univariate feature vectors are generated from all six graphs, and the six features are joined to form the second-to-last feature vector. Then the final SoftMax layer receives the feature vector as input and uses it to classify sentences. Here we assume binary classification, thus describing two possible output states. Source: Zhang, Y. And Wallace B. (2015).

What are our intuitions about computer vision? Position invariance and local composition have intuitive significance for images, but are less important for NLP. You might have a word in a sentence. Pixels that are close to each other may be semantically related (parts of the same object), but this is not always the case for words. In many languages, part of a phrase can be separated by several other words. Composition is also not obvious. It’s obvious that words are formed in some way, such as adjectives that modify nouns, but exactly how this works, the higher-level representation actually “means” is not as obvious as the computer vision case.

Given all this, it seems that CNN is not a good fit for the NLP mission. Recursive neural networks are more intuitive. They are similar to the way we process language (or at least the way we think we process language) : read from left to right. Fortunately, that doesn’t mean CNN doesn’t work. All models are wrong, but some models are useful. CNN, applied to the NLP problem, turns out to work pretty well. The simple Bag of Words model obviously oversimplifies incorrect assumptions, but has been the standard approach for many years and has achieved fairly good results.

One of CNN’s big arguments is that they’re fast. Very fast. Convolution is a core part of computer graphics and is implemented at the hardware level on the GPU. CNN is also effective in representation compared to something like n-gram. With a large vocabulary, calculating anything over 3 grams can quickly become expensive. Even Google doesn’t offer anything more than 5 grams. Convolution filters automatically learn good representations without having to represent the entire vocabulary. It is perfectly reasonable to have filters greater than 5. I like to think that many of the learning filters in the first layer capture features that are very similar (but not limited to) to N-gram, but represent them in a more compact way.

CNN superparametric

Before explaining how CNN applies to NLP tasks, let’s look at some of the choices you need to make when building CNN. Hopefully this will help you better understand the literature in this field.

Narrow convolved with wide

When I explained the loop above, I omitted some details of how we apply the filter. Applying a 3×3 filter to the center of the matrix works fine, but what about the edges? How do I apply the filter to the first element of the matrix that does not have any adjacent elements at the top and left? You can use zero padding. I’m going to zero everything that falls outside the matrix. By doing so, you can apply the filter to each element of the input matrix and get a larger or equally sized output. Adding zero fill is also called wide convolution, and not using zero fill would be a narrow convolution. The following is an example in 1D:

Narrow and extensive convolution. The filter size is 5 and the input size is 7. Source: Convolutional Neural Network for Modeling Sentences (2014)

When you have a large filter relative to the input size, you can see that extensive convolution is useful, even necessary. In the above context, narrow convolution produces an output of size, and the wide convolution produces the output of the size. More generally, the formula for output size is.

Stride size

Another hyperparameter of your convolution is the step size, which defines how much you want to move the filter in each step. In all of the examples above, the step size is 1, and the successive applications of the filters overlap. The larger the step, the less the filter is used and the smaller the output size. The following results from the Stanford CS231 website show step sizes of 1 and 2 applied to one-dimensional inputs:

Convolution step size. Left: stride length size 1. Right: stride length size 2. Source: http://cs231n.github.io/convolutional-networks/

In the literature, we typically see step sizes of 1, but larger steps might allow you to build a model that behaves like a recursive neural network, that is, looks like a tree.

Pooling layer

One of the key aspects of convolutional neural networksLayer,Usually applied after the convolution layer. The pool layer subsamples its input. Summarize it to make itThe most common way to manipulate the results applied to each filter. You don’t necessarily need to pool on the entire matrix; you can also pool on the window. For example, the following shows the maximum pool for a 2×2 window (in NLP, we typically apply pooling on the entire output, producing only one number per filter) :

Largest merger in CNN. Source: http://cs231n.github.io/convolutional-networks/#pool

Why the convergence? There are several reasons. One feature of pooling is that it provides a fixed-size output matrix, which is often required for classification. For example, if you have 1,000 filters and apply a maximum pool to each filter, you will get 1000-dimensional output regardless of the size of the filter or the size of the input. This allows you to use variable-size sentences and variable-size filters, but always get the same output dimension to feed to the classifier.

Pooling also reduces the output dimension, but (hopefully) preserves the most significant information. You can think of each filter as detecting specific features, such as whether the sentence contains negatives such as “not surprising.” If the phrase occurs somewhere in the sentence, the result of applying the filter to that area will produce larger values, but smaller values in other areas. By performing the maximum operation, you retain information about whether the feature appears in the sentence, but you are losing information about where it appears. But is this information about the place really useless? Yeah, it’s kind of similar to what a bag of N-Gram models are doing. You’re losing global information about places (what’s happening in sentences),

In imaginary recognition, merge also provides fundamental invariance of translation (shift) and rotation. When you assemble on an area, even if you move/rotate the image a few pixels, the output will remain roughly the same because the maximum operation will choose the same value anyway.

channel

The last concept we need to understand is channel. ** channels are different “views” of input data. For example, in image recognition, you usually have RGB (red, green, blue) channels. You can apply convolution across channels, with different or equal weights. In NLP you can imagine having various channels: you can have separate channels with different words embedded (word2vec and GloVe, for example), or you can have a channel with the same sentence expressed in different languages, or expressed in different ways.

Convolutional neural networks are applied to NLP

Now let’s take a look at some of CNN’s applications of natural language processing. I will try to summarize some of the findings. Always I’ll miss lots of fun apps (let me know in the comments), but I hope to cover at least some of the more popular results.

CNN’s natural fit seems to be sorting tasks, such as sentiment analysis, spam detection or subject sorting. Convolution and pooling lose information about the local order of words, so sequence tags in PoS tags or entity extraction are somewhat difficult to fit into a pure CNN architecture (though not impossible, you can add positional features to the input).

[1] The CNN architecture of various classification data sets is evaluated, mainly including sentiment analysis and topic classification tasks. The CNN architecture achieves very good performance in data sets and implements new and up-to-date technologies on some data sets. The network used in this article is surprisingly simple, which is why it is so powerful. The input layer is a sentence word embedding consisting of linked word2vec. Next comes the convolution layer with multiple filters, then the maximum pool layer, and finally the SoftMax classifier. We also experiment with two different channels in the form of static and dynamic word embedding. One channel is adjusted during training and the other channel is not adjusted. A similar but more complex architecture was previously proposed in [2]. [6] Add an additional layer to perform “semantic clustering” on the network architecture.

 

Kim, Y. (2014). Convolutional neural network for sentence classification

[4] Train CNN from scratch without pre-trained word vectors like Word2vec or GloVe. It applies convolution directly to a single heat vector. The authors also propose a space-saving word bag representation for input data, which reduces the number of parameters that the network needs to learn. In [5], the author extends the model with another unsupervised “region embedding”, which is learned by using the context of CNN prediction text region. The methods in these papers seem to work for longer texts (such as film reviews), but it is not clear how they perform for shorter texts (such as tweets). Intuitively, it makes sense that using pre-trained word embedments for short texts will yield greater benefits than using them in long texts.

Building CNN architecture means that there are many hyperparameters to choose from, some of which I have described above: input representations (word2vec, GloVe, one-hot), number and size of convolution filters, pooling strategies (maximum, average) and activation functions (ReLU, TANH). [7] The influence of different hyperparameters in CNN architecture is empirically evaluated, and their influence on performance and variance of multiple runs is studied. If you want to implement your own CNN for text categorization, it would be a good idea to use the results of this article as a starting point. Some of the results that stand out are that the maximum pool always exceeds the average pool, that the ideal filter size is important but task dependent, and that regularization does not seem to differ greatly among the NLP tasks considered.

[8] Wang Y, Wang Y, Wang Y, et al. Relationship extraction and relationship classification in CNN. In addition to the word vector, the author also uses the relative position of the word to the entity of interest as the input of the convolution layer. The model assumes that the location of the entity is given and that each sample input contains a relationship. [9] and [10] explore similar models.

Another interesting use case for CNN in NLP can be found in Microsoft Research [11] and [12]. These papers describe how to learn semantically meaningful sentence representations that can be used in information retrieval. Examples given in the paper include recommending potentially interesting documents to users based on what they are currently reading. Train sentence representation based on search engine log data.

Most CNN architectures learn embedding (low dimensional representation) of words and sentences in one way or another as part of their training process. Not all papers focus on this aspect of training or investigate the implications of learning embedding. [13] A CNN architecture is proposed to predict the subject tags of Facebook posts while generating meaningful embeddings for words and sentences. The embedding of these learnings is then successfully applied to another task – recommending potentially interesting documents to the user and training based on clickstream data.

CNN character level

So far, all the models presented are word-based. But there are studies that apply CNN directly to characters. [14] Learning character-level embedding, connecting them to pre-trained word embedding, and using CNN for part-of-speech tagging. [15] [16] Discusses the use of CNN to learn directly from the role without any pre-trained embedding. It is worth noting that the authors used a relatively deep network with a total of nine layers and applied it to sentiment analysis and text categorization tasks. The results show that learning directly from character-level input works well on large datasets (millions of examples), but poorly on smaller datasets (hundreds of thousands of examples). [17] The application of character-level convolution to language modeling is discussed, and the output of character-level CNN is used as the input of LSTM in each time step. The same model applies to all languages.

Surprisingly, almost all of these papers were published within the past 1-2 years. Obviously CNN has done very well on NLP before, like Scratch with natural language processing (almost), but the pace of new results and the latest systems released is clearly accelerating.

Questions or feedback? Let me know in the comments. Thanks for reading!

 

The paper reference

 

  • [1] Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), 1746 — 1751.
  • [2] Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A Convolutional Neural Network for Modelling Sentences. Acl, 655 — 665.
  • [3] Santos, C. N. dos, & Gatti, M. (2014). Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts. In Coling-2014 (pp. 69 — 78).
  • [4] Johnson, R., & Zhang, T. (2015). Effective Use of Word Order for Text Categorization with Convolutional Neural Networks. To Appear: NAACL-2015, (2011).
  • [5] Johnson, R., & Zhang, T. (2015). Semi-supervised Convolutional Neural Networks for Text Categorization via Region Embedding.
  • [6] Wang, P., Xu, J., Xu, B., Liu, C., Zhang, H., Wang, F., & Hao, H. (2015). Semantic Clustering and Convolutional Neural Network for Short Text Categorization. Proceedings ACL 2015, 352–357.
  • [7] Zhang, Y., & Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification,
  • [8] Nguyen, T. H., & Grishman, R. (2015). Relation Extraction: Convolutional Neural Networks. Workshop on Vector Modeling for NLP, 39 — 48.
  • [9] Sun, Y., Lin, L., Tang, D., Yang, N., Ji, Z., & Wang, X. (2015). Modeling Mention , Context and Entity with Neural Networks for Entity Disambiguation, (Ijcai), 1333 — 1339.
  • [10] Zeng, D., Liu, K., Lai, S., Zhou, G., & Zhao, Relation Classification via Convolutional Deep Neural Network. Coling, (2011), 2335 — 2344.
  • [11] Gao, J., Pantel, P., Gamon, M., He, X., & Deng, L. (2014). Modeling Interestingness with Deep Neural Networks.
  • [12] Shen, Y., He, X., Gao, J., Deng, L., & Mesnil, G. (2014). A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval. Proceedings of the 23rd ACM International Conference on 23rd 23rd ACM International Conference on Information and Knowledge Management — CIKM ’14, 101-110.
  • [13] Weston, J., & Adams, K. (2014). # T AG S PACE: Embeddings from Hashtags, 1822 — 1827.
  • [14] Santos, C., & Zadrozny, B. (2014). Learning Character-level Representations for Part-of-Speech Tagging. Proceedings of the 31st International Conference on Machine Learning, ICML-14(2011), 1818 — 1826.
  • [15] Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level Convolutional Networks for Text Classification, 1 — 9.
  • [16] Zhang, X., & LeCun, Y. (2015). Text Understanding from Scratch. arXiv E-Prints, 3, 011102.
  • [17] Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2015). Character-Aware Neural Language Models.
  •