1. First meeting TextCNN

In the recent research on life prediction, the data obtained are one-dimensional data. Traditional data preprocessing methods mainly include PCA, LDA, LLE, etc. Considering the application of CNN for feature extraction, the accuracy of prediction can be improved. However, CNN as we have learned before is mostly applied in image processing, and its input data is two-dimensional or multidimensional data. Therefore, we need to learn more about TextCNN applied in text classification. The next article will introduce several specific application examples of CNN through journal papers, mainly introducing the network structure of the model.

TextCNN model is proposed by Yoon Kim in Convolutional Neural Networks for Sentence Classification (2014), which uses Convolutional Neural network (CNN) to deal with text Classification problem (NLP). This algorithm uses several kernels of different sizes to extract key information in sentences, so as to extract important features more efficiently and achieve better classification effect.

2. TextCNN structure

The structure of the model is shown below.

The detailed process of TextCNN is as follows :(take a sentence as an example)

(1) Input: Natural language input is one sentence, for example: Wait for the video and don’t rent it.

(2) Data preprocessing: first, divide a sentence into multiple words, for example, divide the sentence into 9 words, respectively: wait, for, the, video, and, do, n’t rent, it, and then convert the words into numbers to represent the word index of the word in the dictionary.

(3) Embedding layer: Each word is mapped into a low-dimensional space by embedding in word2vec or GLOV, which is essentially a feature extractor and encodes semantic features in the specified dimension. For example, if each word is represented by a one-dimensional vector of length 6, wait can be represented as [1,0,0,0,0,0,0], and so on, the sentence can be represented by a two-dimensional vector of 9*6.

(4) Convolution layer: Different from the convolution kernel of image processing, the text expressed by word vector is one-dimensional data, so the convolution in TextCNN is one-dimensional convolution. The width of TextCNN convolution kernel is consistent with the dimension of word vector, and the height can be set by oneself. Taking the size of the convolution kernel as [2,3] as an example, since two convolution kernels are set, two vectors will be obtained, the size of which is T1:81 and T2:71 respectively. The calculation process of vector size is (9-2-1) =8 and (9-3-1) =7 respectively, that is, (the length of the word – the size of the convolution kernel -1).

(5) Pooling layer: after convolution with convolution kernels of different heights, the output vector dimensions are different. Pool each feature vector into a value by 1-max-pooling, that is, extract the maximum value of each feature vector to represent the feature. After the pooling is completed, each value needs to be spliced together to obtain the final feature vector of the pooling layer. Therefore, the pooling result of this sentence is 2*1.

(6) Flat layer and full connection layer: like the CNN model, the output of the output pooling layer is flattened before the input of the full connection layer. To prevent overfitting, add dropout before the output layer to prevent overfitting, and the output is the predicted text type.

3. Model implementation

(1) data preprocessing: TextCNN for text classification, the original data for statements and corresponding labels, data pretreatment process first will sentence for each word, and then converts each word to a positive integer number used to represent the words, and finally the principle of using multiple delete fill less set each sentence for such a long word, testing and training sets of data.

import pandas as pd
import numpy as np
import jieba
import keras
from keras.layers.merge import concatenate
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers.embeddings import Embedding
from keras.layers import Conv1D, MaxPooling1D, Flatten, Dropout, Dense, Input
from keras.models import Model
from sklearn.model_selection import train_test_split
from sklearn importMetrics def data_process(path, max_len=50Max_len is the fixed-length dataset = pd.read_csv(path, sep='\t', names=[ 'text'.'label']). Astype (STR) CW = lambda x: list(jieba.cut(x)) # dataset['words'] = dataset['text'Tokenizer = tokenizer () # create a tokenizer object that converts a word into a positive integer tokenizer. Fit_on_texts (dataset[)'words'X_train, x_test, y_train, y_test = train_test_split(dataset[]) # vocab = tokenizer. Word_index'words'], dataset['label'], test_size=0.1X_train_word_ids = tokenizer.texts_to_sequences(x_train) # Convert each word in the list of test sets to a number x_test_word_ids = Tokenizer.texts_to_sequences (x_test) # Convert each word in the training set list to a number x_train_PADded_seqs = PAD_sequences (x_train_word_ids, Maxlen =max_len) # set each sentence to equal length50X_test_padded_seqs = PAD_sequences (x_test_word_ids, maxlen=max_len) # truncate portions exceeding fixed values0fillreturn x_train_padded_seqs,y_train,x_test_padded_seqs,y_test,vocab
Copy the code

(2) Construction of network structure: TextCNN network structure mainly consists of embedding layer – convolution layer – pooling layer – Dropout – full connection layer. It combines basic network operations (loss function, model training, and definition of accuracy) and construction of network structure into one function. The codes for this part are as follows: https://github.com/Asia-Lee

Def TextCNN_model_1(x_train, y_train, x_test, y_test): main_input = Input(shape=(50,), dtype='float64'Embedder = Embedding(using pre-trained word vector) embedder = Embedding(len(vocab) + 1.300, input_length=50, trainable=False) Embed = embedder(main_input) # embed = embedder(main_input3.4.5
    cnn1 = Conv1D(256.3, padding='same', strides=1, activation='relu')(embed)
    cnn1 = MaxPooling1D(pool_size=48)(cnn1)
    cnn2 = Conv1D(256.4, padding='same', strides=1, activation='relu')(embed)
    cnn2 = MaxPooling1D(pool_size=47)(cnn2)
    cnn3 = Conv1D(256.5, padding='same', strides=1, activation='relu')(embed)
    cnn3 = MaxPooling1D(pool_size=46# Merge the output vector CNN = concatenate([CNn1, CNn2, CNN3], axis=- 1)
    flat = Flatten()(cnn)
    drop = Dropout(0.2# We can add dropout before pooling layer to fully connected layer to prevent overfitting.3, activation='softmax')(drop)
    model = Model(inputs=main_input, outputs=main_output)
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    one_hot_labels = keras.utils.to_categorical(y_train, num_classes=3Fit (x_train, one_HOT_labels, batCH_size =) # Convert labels to one-hot encoding model.fit(x_train, one_HOT_labels, batCH_size =800, epochs=2Result_labels = np.argmax(result, axis=1# select * from y_predict = list(map(str, result_labels))
    print('Accuracy', metrics.accuracy_score(y_test, y_predict))
Copy the code

(3) Main function: check the test accuracy

if __name__ == '__main__':
    path = 'data_train.csv'
    x_train, y_train, x_test, y_test, vocab = data_process(path)
    TextCNN_model_1(x_train, y_train, x_test, y_test)
Copy the code

4. Model summary

  • TextCNN processes NLP and the input is a whole sentence, so the width of the convolution kernel is consistent with the dimension of the word vector. In this way, when convolution is carried out with the convolution kernel, not only the word meaning but also the word order and its context are considered.
  • The structure optimization of TextCNN has two directions, one is word vector construction, the other is network parameters and hyperparameter tuning.
  • The biggest difference between TextCNN and CNN lies in the dimension of input data. The image is two-dimensional data, and the convolution kernel of the image slides from left to right and top to bottom to extract features. Natural language is one-dimensional data. Although 2d vector is generated by word-embedding, it is meaningless to slide word vector from left to right for convolution.

References:

Convolutional Neural Networks for Sentence Classification

https://arxiv.org/abs/1408.5882

Keras 英 文 版

https://keras.io/zh/layers/embeddings/

Code reference:

https://github.com/Asia-Lee

Author: xianyang94 zhihu.com/people/xianyang94

Recommended reading:

Use Python for system clustering analysis

Use Python for data correlation analysis \

Two ways to do anOVA in Python \

How to add comments and inline diagrams to Matplotlib

How to make gEvent crawler 100% faster with one line of code

▼ clickBecome a community member and click on itIn the see