CNN is a convolutional neural network, which is usually used in the image field and achieves very good results in image classification. In 2014, Yoon Kim applied CNN’s ideas to text processing in her thesis Convolutional Neural Networks for Sentence Classification. Much subsequent work using ConvNet for NLP tasks was based on this paper.

1.CNN text classification model

This paper is mainly an arrangement of the content of the original paper “Convolutional Neural Networks for Sentence Classification”. First, it learns the overall structure of CNN text Classification model, and the model diagram in the paper is shown in the following figure.

It can be seen that the model is relatively simple, which is mainly divided into four parts: input layer, convolution layer, maximum pooling layer, and full connection layer.

1.1 input layer

Given a sentence, the input layer receives a sentence word vector matrix X, X is an (n× K) matrix, n represents the number of words in the sentence, k represents the dimension of the word vector. You can see that each line of X corresponds to a word vector, and the word vectors are arranged in the order they appear in the sentence.

1.2 the convolution layer

After obtaining the input word vector matrix X, we need to convolve X. The convolution kernel is an (h×k) matrix. Note that the number of columns of the convolution kernel is fixed (k), as is the word vector dimension. The convolution kernel keeps moving down to get the convolution value.

Because the number of columns of the convolution kernel is the same as the word vector matrix X, the feature map after convolution has only one column. Multiple convolution kernels will generate multi-column vectors, as shown in the figure below. The output of four convolution kernels is shown in the figure below.

1.3 Maximum pooling layer

In the convolution layer, multiple different convolution kernels will generate multi-column feature maps, and maximum pooling will take out the maximum value in each column and eventually form a one-dimensional vector.

1.4 Full connection layer

The final layer is the full connect layer, which uses Dropout to prevent overfitting and then uses Softmax for classification.

2. Other details of the model

2.1 Use of multiple convolution kernels of different sizes

We have just learned that the column number of the convolution kernel of this model must be K, but its height (number of rows) can be changed. Convolution kernels of different heights can be used for convolution operations in the process of use, as shown in the figure below.

2.2 Variations of the model

In the paper, the author also proposed four variants of the model

  • Cnn-rand, the word vector is randomly initialized and constantly updated during training.
  • Cnn-static, using a trained word vector, such as Word2Vec, and leaving the word vector unchanged in subsequent training.
  • Cnn-non-static, similar to CNn-static, uses the trained word vector, but will continue to fine-tune the word vector in the subsequent training.
  • Cnn-multichannel, which uses two sets of word vectors, both initialized with Word2Vec. One set of word vectors remains unchanged during training, and the other is fine-tuned during training. Input sentences can get two different word vectors. It can be understood as RGB channels in similar images, where the two channels represent two word vectors respectively.

The following figure shows the results of four models. It can be seen that CNN-static is better than CNN-RAND, indicating that pre-trained word vectors are more effective. In addition, THE effect of CNN-non-static is better than CNn-static, indicating that fine-tuning will make the word vector more suitable for the current task. In experiments, CNN-multichannel usually works better on small data sets.

3. Summary of CNN text classification

CNN is similar to N-gram in THE NLP task. We define the height of the convolution kernel as 3, which is equivalent to using the convolution kernel to perform a 3-gram operation in a sentence. Meanwhile, CNN has high efficiency and is much faster than traditional N-GRAM.

The width of the convolution kernel in CNN should be the same as the dimension of the word vector, and the convolution kernel moves in the direction of the word in the sentence.

4. References

  • Convolutional Neural Networks for Sentence Classification
  • Understanding Convolutional Neural Networks for NLP