Written by Ahmed BESBES, Heart of the Machine.

This paper introduces seven models for text classification tasks, including traditional word bag model, recurrent neural network, convolutional neural network commonly used for computer vision tasks, and RNN + CNN.

This is before I wrote a article to sentiment analysis based on twitter data (ahmedbesbes.com/sentiment-a…). The extension of the content. At that time, I built a simple model: a two-layer feedforward neural network based on KERAS training. Input tweets are represented by the weighted average of the embeddings of the words that make up the tweet as a document vector.

The embedding I used was the Word2vec model trained from scratch by Gensim based on the corpus. This is a binary task with an accuracy of 79%.

The goal of this paper is to explore other NLP models trained on the same data set and then evaluate the performance of these models on a given test set.

We’ll look at different models, from simple ones that rely on bag representations to complex ones that deploy convolutional/cyclic networks, to see if we can get better than 79% accuracy!

First, you will start with a simple model and gradually increase its complexity. The purpose of this work is to show that simple models can be effective.

I would try these things:

  • Word level Ngram did logistic regression
  • Logistic regression was performed with character-level Ngram
  • Logistic regression was performed for word level nGRAM and character level NGRAM
  • Training recurrent neural Networks without pre-training for Word embedding (Bidirectional GRU)
  • Word embedding was pre-trained with GloVe, and then the recurrent neural network was trained
  • Multichannel convolutional neural network
  • RNN (bidirectional GRU) + CNN model

Boilerplate code for these NLP technologies is attached at the end. This code can help you start your OWN NLP project and get the best results (some of these models are very powerful).

We can also provide a comprehensive baseline that we can use to identify which models best predict emotions in tweets.

There are also different models, their predictions, and test sets in the related GitHub library. You can try it yourself and get believable results.

import os
import re

import warnings
warnings.simplefilter("ignore". UserWarning) from matplotlib import pyplot as plt %matplotlib inline import pandas as pd pd.options.mode.chained_assignment = None import numpy as np from string import punctuation from nltk.tokenize import word_tokenize from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, auc, roc_auc_score from sklearn.externals import joblib import scipy from scipy.sparse import hstackCopy the code


0. Data preprocessing

You can follow this link (thinknook.com/twitter-sen…) Download the data set.

Load the data and extract the required variables (emotion and emotional text).

The dataset contained 1,578,614 categorised tweets, each row marked with a 1 (positive emotion) and a 0 (negative emotion).

The authors suggest using 1/10 of the data for testing and the rest for training.

data = pd.read_csv('./data/tweets.csv', encoding='latin1', usecols=['Sentiment'.'SentimentText'])
data.columns = ['sentiment'.'text']
data = data.sample(frac=1, random_state=42)
print(data.shape)
(1578614, 2)
for row in data.head(10).iterrows():
    print(row[1]['sentiment'], row[1]['text']) 
1 http://www.popsugar.com/2999655 keep voting for robert pattinson in the popsugar100 as well!! 
1 @GamrothTaylor I am starting to worry about you, only I have Navy Seal typesleep hours. 0 sunburned... no sunbaked! ow. it hurts to sit. 1 Celebrating my 50th birthday by doing exactly the same as Ido every other day - working on our websites.  It's just another day. 1 Leah and Aiden Gosselin are the cutest kids on the face of the Earth 1 @MissHell23 Oh. I didn't even notice. 0 WTF is wrong with me? !!!!!!!!! I'm completely miserable. I need to snap out of this 0 Was having the best time in the gym until I got to the car and had  messages waiting for me... back to the down stage! 1 @JENTSYY oh what happened?? 0 @catawu Ghod forbid he should feel responsible for anything!Copy the code

There was a lot of noise in the tweet data. We deleted the URL, subject tag and user mention in the tweet to clean up the data.

def tokenize(tweet):
    tweet = re.sub(r'http\S+'.' ', tweet)
    tweet = re.sub(r"#(\w+)".' ', tweet)
    tweet = re.sub(r"@(\w+)".' ', tweet)
    tweet = re.sub(r'[^\w\s]'.' ', tweet)
    tweet = tweet.strip().lower()
    tokens = word_tokenize(tweet)
    return tokens
Copy the code

Save the cleaned data to the hard disk.

data['tokens'] = data.text.progress_map(tokenize)
data['cleaned_text'] = data['tokens'].map(lambda tokens: ' '.join(tokens))
data[['sentiment'.'cleaned_text']].to_csv('./data/cleaned_text.csv')

data = pd.read_csv('./data/cleaned_text.csv')
print(data.shape)
(1575026, 2)
data.head()
Copy the code

Now that the data set has been cleaned up, you are ready to split the training and test sets to build the model.

The data in this article is segmented in this way.

x_train, x_test, y_train, y_test = train_test_split(data['cleaned_text'], 
                                                    data['sentiment'], test_size = 0.1, random_state = 42, stratify = data ['sentiment'])

print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
(1417523,) (157503,) (1417523,) (157503,)
Copy the code

Store test set labels on hard disk for later use.

pd.DataFrame(y_test).to_csv('./predictions/y_true.csv', index=False, encoding='utf-8')
Copy the code

Then you can apply machine learning methods.


1. Word-bag model based on word-level Ngram

So, what is an N-gram?

As shown in the figure, ngram is all combinations of adjacent words of length N that can be found in the source text.

Our model will feature unigrams (n=1) and Bigrams (n=2).

The data set is represented by a matrix, with each row of the matrix representing a tweet and each column representing the features (unary model or binary model) extracted from the tweet (which has been word broken and cleaned). Each cell is a TF-IDF score (simpler values could be used, but TF-IDF is more generic and works better). We call this matrix the document-term matrix.

After some reflection, we can see that the number of unary and binary models with 1.5 million tweets in the corpus is still large. In fact, we can set this number to a fixed value for computational power. You can determine this value by cross-validation.

After vectorization, the corpus is shown in the figure below:

I like pizza a lot

Suppose the model predicts this statement using the above features.

Since the unitary model and the binary model are used, the following features are extracted from the model:

i, like, pizza, a, lot, i like, like pizza, pizza a, a lot

Thus, the sentence becomes a vector of size N (the total number of participles) containing zeros and tF-IDF fractions of these Ngrams. So what we’re going to do is actually deal with this big, sparse vector.

In general, linear models can handle large and sparse data well. In addition, the training speed of linear models is also faster than that of other models.

From past experience, logistic regression can operate well on sparse TF-IDF matrix.

Vectorizer_word = TfidfVectorizer(max_features=40000, min_df=5, max_df=0.5, Analyzer ='word', 
                             stop_words='english', 
                             ngram_range=(1, 2))

vectorizer_word.fit(x_train, leave=False)

tfidf_matrix_word_train = vectorizer_word.transform(x_train)
tfidf_matrix_word_test = vectorizer_word.transform(x_test)
Copy the code

After the TF-IDF matrix is generated for the training set and test set, the first model can be built and tested.

Tf-idf matrix is the characteristic of logistic regression.

lr_word = LogisticRegression(solver='sag', verbose=2)
lr_word.fit(tfidf_matrix_word_train, y_train)
Copy the code

Once the model has been trained, it can be applied to test data to obtain predicted values. These values are then stored on hard disk along with the model.

joblib.dump(lr_word, './models/lr_word_ngram.pkl')

y_pred_word = lr_word.predict(tfidf_matrix_word_test)
pd.DataFrame(y_pred_word, columns=['y_pred']).to_csv('./predictions/lr_word_ngram.csv', index=False)
Copy the code

Get the accuracy rate:

y_pred_word = pd.read_csv('./predictions/lr_word_ngram.csv')
print(accuracy_score (y_test, y_pred_word)) 0.782042246814Copy the code

The first model was 78.2 percent accurate! Not bad. Let’s take a look at the second model.


2. Word bag model based on character-level Ngram

We have never said that Ngram is only for words, but can be applied to characters as well.

As you can see, we’ll use the same code as in the figure for character-level Ngram, and now we’ll jump right into 4-grams modeling.

Basically what this means is that a sentence like “I like this movie” will have the following characteristics:

I, l, i, k, e, … , I li, lik, like, … , this, … , is m, s mo, movi, …

Character-level Ngram is very effective and can even perform better than words in language modeling tasks. Tasks such as spam filtering or natural language recognition rely heavily on character-level NGram.

Unlike previous models that learn word combinations, this model learns letter combinations so that it can deal with word formation.

One advantage of character-based representation is that it can better solve the problem of misspelling words.

Let’s run the same process:

Vectorizer_char = TfidfVectorizer(max_features=40000, min_df=5, max_df=0.5, Analyzer ='char', 
                             ngram_range=(1, 4))

vectorizer_char.fit(tqdm_notebook(x_train, leave=False));

tfidf_matrix_char_train = vectorizer_char.transform(x_train)
tfidf_matrix_char_test = vectorizer_char.transform(x_test)

lr_char = LogisticRegression(solver='sag', verbose=2)
lr_char.fit(tfidf_matrix_char_train, y_train)

y_pred_char = lr_char.predict(tfidf_matrix_char_test)
joblib.dump(lr_char, './models/lr_char_ngram.pkl')

pd.DataFrame(y_pred_char, columns=['y_pred']).to_csv('./predictions/lr_char_ngram.csv', index=False)
y_pred_char = pd.read_csv('./predictions/lr_char_ngram.csv')
print(accuracy_score (y_test, y_pred_char)) 0.80420055491Copy the code

80.4% accuracy! The character-level NGRAM model performs better than the word-level NGRAM.

3. Word bag model based on word-level NGRAM and character-level NGRAM

Character-level Ngram features seem to provide better accuracy than word-level Ngram features. What about combining character-level ngram with word-level Ngram?

We connect two TF-IDF matrices together to establish a new hybrid TF-IDF matrix. This model is helpful for learning the morphological structure of words and the morphological structure of words with large probability adjacent to the word.

Combine these attributes together.

tfidf_matrix_word_char_train =  hstack((tfidf_matrix_word_train, tfidf_matrix_char_train))
tfidf_matrix_word_char_test =  hstack((tfidf_matrix_word_test, tfidf_matrix_char_test))

lr_word_char = LogisticRegression(solver='sag', verbose=2)
lr_word_char.fit(tfidf_matrix_word_char_train, y_train)

y_pred_word_char = lr_word_char.predict(tfidf_matrix_word_char_test)
joblib.dump(lr_word_char, './models/lr_word_char_ngram.pkl')

pd.DataFrame(y_pred_word_char, columns=['y_pred']).to_csv('./predictions/lr_word_char_ngram.csv', index=False)
y_pred_word_char = pd.read_csv('./predictions/lr_word_char_ngram.csv')
print(accuracy_score (y_test, y_pred_word_char)) 0.81423845895Copy the code

The accuracy rate was 81.4%. This model adds only one whole unit, but the results are better than the previous two.

About the word bag model

  • Pros: Word bag models are powerful considering their simplicity, they are fast to train and easy to understand.
  • Disadvantages: Even though Ngram has some context between words, the bag model cannot model long-term dependencies between words in a sequence.

Now we’re going to use the deep learning model. The deep learning model outperforms the bag of words model because the deep learning model can capture the sequential dependencies between words in a sentence. This may be due to the emergence of a special neural network structure, the recurrent neural network.

This article does not cover RNN theoretical basis, but the link (colah. Making. IO/posts / 2015 -… It’s worth reading. This article, from Cristopher Olah’s blog, details a particular RNN model: Long short-term Memory Network (LSTM).

Before you start, set up a deep learning-specific environment to use Keras on TensorFlow. To be honest, I tried to run this code on a personal laptop, but it was impractical given the size of the dataset and the complexity of the RNN architecture. Another good option is AWS. I generally in EC2 p2 xlarge instance with deep learning AMI (aws.amazon.com/marketplace…). . The Amazon AMI is a pre-configured VM diagram with all the packages (TensorFlow, PyTorch, Keras, etc.) installed. Highly recommended!

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.text import text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences

from keras.models import Model
from keras.models import Sequential

from keras.layers import Input, Dense, Embedding, Conv1D, Conv2D, MaxPooling1D, MaxPool2D
from keras.layers import Reshape, Flatten, Dropout, Concatenate
from keras.layers import SpatialDropout1D, concatenate
from keras.layers import GRU, Bidirectional, GlobalAveragePooling1D, GlobalMaxPooling1D

from keras.callbacks import Callback
from keras.optimizers import Adam

from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.models import load_model
from keras.utils.vis_utils import plot_model
Copy the code


4. Recurrent neural network without pre-trained word embedding

RNN might look scary. Although they are difficult to understand because of their complexity, they are very interesting. RNN model encapsulates a very beautiful design to overcome the shortcomings of traditional neural networks in processing sequence data (text, time series, video, DNA sequence, etc.).

RNN is a series of neural network modules that are connected to each other like chains. Each passes the message backwards. It is highly recommended that you take a look at Colah’s blog to learn more about its internals, as illustrated below.

The type of sequence we are dealing with is text data. Word order is important for meaning. RNN takes this into account and can capture long-term dependencies.

To use Keras on textual data, we first preprocess the data. You can use Keras’s Tokenizer class. This object takes num_words as an argument. Num_words is the maximum number of words that can be saved after word segmentation based on word frequency.

MAX_NB_WORDS = 80000
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)

tokenizer.fit_on_texts(data['cleaned_text'])
Copy the code

When a tokenizer is applied to data, we can use the tokenizer to convert text character-level Ngrams to numeric sequences.

These numbers represent each word’s position in the dictionary (think of it as a map).

As shown in the following example:

x_train[15]
'breakfast time happy time'
Copy the code

This shows how the word splitter converts it into a sequence of numbers.

tokenizer.texts_to_sequences([x_train[15]])
[[530, 50, 119, 50]]
Copy the code

Then apply the classifier to the training sequence and test sequence:

train_sequences = tokenizer.texts_to_sequences(x_train)
test_sequences = tokenizer.texts_to_sequences(x_test)
Copy the code

Map tweets to integer lists. But there’s no way to stack them together in the matrix because of the lengths. Fortunately, Keras allows sequences to be filled to maximum length with zeros. We set this length to 35 (the maximum number of participles in a tweet).

MAX_LENGTH = 35 padded_train_sequences = pad_sequences(train_sequences, maxlen=MAX_LENGTH) padded_test_sequences = pad_sequences(test_sequences, maxlen=MAX_LENGTH) padded_train_sequences array([[ 0, 0, 0, ..., 2383, 284, 9], [ 0, 0, 0, ..., 13, 30, 76], [ 0, 0, 0, ..., 19, 37, 45231],... [0, 0, 0,..., 43, 502, 1653], [0, 0, 0,..., 5, 1045, 890], [0, 0, 0,..., 13748, 38750, 154]]) padded_train_sequences.shape (1417523, 35)Copy the code

Now you are ready to pass the data to the RNN.

Here are some elements of the architecture I will use:

  • The embedded dimension is 300. This means that each of the 80,000 words we use is mapped to a dense (floating-point) vector of 300 dimensions. This mapping will be adjusted during training.
  • Apply the Spatial Dropout layer over the embedding layer to reduce overfitting: look at the 35*300 matrices by batch and randomly remove the word vectors (rows) (set to 0) from each matrix. This helps to distract attention from specific words and facilitate the generalization of the model.
  • Two-way gated loop unit (GRU) : This is the loop network part. This is a faster variant of the LSTM architecture. Think of it as a combination of two cyclic networks so that you can scan text sequences in two directions simultaneously: left to right and right to left. This allows the network to read a given word, combining what comes before and what comes after to understand the text. The dimension, or number of units, of the output h_T of each network block in the GRU is set to 100. Because of the bi-directional GRU, the final output of each RNN block is 200 dimensions.

The output of a bidirectional GRU is dimensional (batch size, time step, and cell). This means that if you use the classic batch size of 256, the dimensions will be (256, 35, 200).

Global average pooling is applied on each batch, which contains the average of the output vectors for each time step (that is, word).

  • We apply the same operation, but use maximum pooling instead of average pooling.
  • Concatenate the output of the first two operations.
def get_simple_rnn_model(): embedding_dim = 300 embedding_matrix = np.random.random((MAX_NB_WORDS, embedding_dim)) inp = Input(shape=(MAX_LENGTH, )) x = Embedding(input_dim=MAX_NB_WORDS, output_dim=embedding_dim, input_length=MAX_LENGTH, weights=[embedding_matrix], Trainable =True)(INP) x = SpatialDropout1D(0.3)(x) x = Bidirectional(GRU(100, return_sequences=True))(x) avg_pool = GlobalAveragePooling1D()(x) max_pool = GlobalMaxPooling1D()(x) conc = concatenate([avg_pool, max_pool]) outp = Dense(1, activation="sigmoid")(conc)

    model = Model(inputs=inp, outputs=outp)
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])
    return model

rnn_simple_model = get_simple_rnn_model()
Copy the code

The different layers of the model are shown below:

plot_model(rnn_simple_model, 
           to_file='./images/article_5/rnn_simple_model.png', 
           show_shapes=True, 
           show_layer_names=True)
Copy the code

Model checkpoints were used during training. This allows the best model (which can be measured by accuracy) to be automatically stored (on hard disk) at the end of each epoch.

filepath="./models/rnn_no_embeddings/weights-improvement-{epoch:02d}-{val_acc:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')

batch_size = 256
epochs = 2

history = rnn_simple_model.fit(x=padded_train_sequences, 
                    y=y_train, 
                    validation_data=(padded_test_sequences, y_test), 
                    batch_size=batch_size, 
                    callbacks=[checkpoint], 
                    epochs=epochs, 
                    verbose=1)

best_rnn_simple_model = load_model('/ models/rnn_no_embeddings/weights - improvement - 01-0.8262 hdf5')

y_pred_rnn_simple = best_rnn_simple_model.predict(padded_test_sequences, verbose=1, batch_size=2048)

y_pred_rnn_simple = pd.DataFrame(y_pred_rnn_simple, columns=['prediction'])
y_pred_rnn_simple['prediction'] = y_pred_rnn_simple['prediction'].map(lambda p: 1 ifP > = 0.5else 0)
y_pred_rnn_simple.to_csv('./predictions/y_pred_rnn_simple.csv', index=False)
y_pred_rnn_simple = pd.read_csv('./predictions/y_pred_rnn_simple.csv')
print(accuracy_score (y_test, y_pred_rnn_simple)) 0.826219183127Copy the code

The accuracy rate reached 82.6%! That’s a pretty good result! The current model already performs better than the previous bag of words model because we take into account the sequential nature of the text.

Could it have been better?


5. Recurrent neural network embedded with GloVe pre-training words

In the last model, the embedding matrix is randomly initialized. What about initializing it with pre-trained word embedding? For example: suppose there is the word “pizza” in the corpus. After initializing it following the previous schema, you get a 300-dimensional vector of random floating point values. That’s all well and good. This is easy to implement, and the embed can be adjusted during training. But you can also use another model, trained on a large corpus, to generate word embeds for “pizza” instead of randomly selected vectors. This is a special kind of transfer learning.

Using knowledge from external embedding improves the accuracy of the RNN because it incorporates new information (lexical and semantic) about the word that has been trained and refined based on a large data corpus.

The pre-training insert we used was GloVe.

The official description is this: GloVe is an unsupervised learning algorithm that obtains vector representations of words. The training of the algorithm is based on the global word-word co-occurrence data of the corpus, and the resulting representations show the interesting linear substructures of the word vector space.

The GloVe embedded training data used in this paper is network capture with large data volume, including:

  • 840 billion participles;
  • 2.2 million words.

It takes 2.03GB to download the compressed file. Please note that this file cannot be easily loaded on a standard laptop.

The GloVe is embedded in 300 dimensions.

The GloVe embedding comes from raw text data in which each line contains a word and 300 floating points (corresponding to the embedding). So the first step is to convert this structure into a Python dictionary.

def get_coefs(word, *arr):
    try:
        return word, np.asarray(arr, dtype='float32')
    except:
        return None, None

embeddings_index = dict(get_coefs(*o.strip().split()) for o in tqdm_notebook(open('./embeddings/glove.840B.300d.txt')))

embed_size=300
for k in tqdm_notebook(list(embeddings_index.keys())):
    v = embeddings_index[k]
    try:
        ifv.shape ! = (embed_size, ): embeddings_index.pop(k) except: pass embeddings_index.pop(None)Copy the code

Once the embedded index is created, we can extract all the vectors, stack them together and calculate their mean and standard deviation.

values = list(embeddings_index.values())
all_embs = np.stack(values)

emb_mean, emb_std = all_embs.mean(), all_embs.std()
Copy the code

The embedding matrix is now generated. Initialize the matrix according to the normal distribution of mean= EMb_mean and STD = EMb_std. Traverse 80,000 words in the corpus. For each word, if the word is in the GloVe, we get the word embedded, if not, we skip it.

word_index = tokenizer.word_index
nb_words = MAX_NB_WORDS
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))

oov = 0
for word, i in tqdm_notebook(word_index.items()):
    if i >= MAX_NB_WORDS: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
    else:
        oov += 1

print(oov) def get_rnn_model_with_glove_embeddings(): embedding_dim = 300 inp = Input(shape=(MAX_LENGTH, )) x = Embedding(MAX_NB_WORDS, embedding_dim, weights=[embedding_matrix], input_length=MAX_LENGTH, Trainable =True)(INP) x = SpatialDropout1D(0.3)(x) x = Bidirectional(GRU(100, return_sequences=True))(x) avg_pool = GlobalAveragePooling1D()(x) max_pool = GlobalMaxPooling1D()(x) conc = concatenate([avg_pool, max_pool]) outp = Dense(1, activation="sigmoid")(conc)

    model = Model(inputs=inp, outputs=outp)
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])
    return model

rnn_model_with_embeddings = get_rnn_model_with_glove_embeddings()

filepath="./models/rnn_with_embeddings/weights-improvement-{epoch:02d}-{val_acc:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')

batch_size = 256
epochs = 4

history = rnn_model_with_embeddings.fit(x=padded_train_sequences, 
                    y=y_train, 
                    validation_data=(padded_test_sequences, y_test), 
                    batch_size=batch_size, 
                    callbacks=[checkpoint], 
                    epochs=epochs, 
                    verbose=1)

best_rnn_model_with_glove_embeddings = load_model('/ models/rnn_with_embeddings/weights - improvement - 03-0.8372 hdf5')

y_pred_rnn_with_glove_embeddings = best_rnn_model_with_glove_embeddings.predict(
    padded_test_sequences, verbose=1, batch_size=2048)

y_pred_rnn_with_glove_embeddings = pd.DataFrame(y_pred_rnn_with_glove_embeddings, columns=['prediction'])
y_pred_rnn_with_glove_embeddings['prediction'] = y_pred_rnn_with_glove_embeddings['prediction'].map(lambda p: 
                                                                                                    1 ifP > = 0.5else 0)
y_pred_rnn_with_glove_embeddings.to_csv('./predictions/y_pred_rnn_with_glove_embeddings.csv', index=False)
y_pred_rnn_with_glove_embeddings = pd.read_csv('./predictions/y_pred_rnn_with_glove_embeddings.csv')
print(accuracy_score (y_test, y_pred_rnn_with_glove_embeddings)) 0.837203100893Copy the code

The accuracy rate reached 83.7%! Transfer learning from external word embedding works! The rest of this tutorial will use GloVe embedding in the embedding matrix.


6. Multi-channel convolutional neural network

This part of the experiment I ever know the convolutional neural network structure (www.wildml.com/2015/11/und)… . CNN is often used for computer vision tasks. But RECENTLY I tried it on NLP tasks, and the results were pretty promising.

Take a quick look at what happens when convolutional networks are used on textual data. To explain this, I found this very famous image (shown below) from Wildm.com, a great blog.

See an example of use: I like this movie very much! (7 participles)

  • Each word has an embedded dimension of 5. Thus, this statement can be represented by a matrix of dimensions (7,5). You can think of it as a “graph” (a matrix of numbers or floating-point numbers).
  • Six filters, two filters each of size (2, 5) (3, 5) and (4, 5). These filters are applied to the matrix, which is special in that neither is a square matrix, but their width is equal to the width of the embedded matrix. So the result of each convolution is going to be a column vector.
  • Each column of vector generated by convolution is downsampled using a maximum pooling operation.
  • Concatenate the results of the maximum pooling operation to the final vector that will be passed to the Softmax function for classification.


What’s the rationale behind it?

The result of each convolution is activated when a particular pattern is detected. By changing the size of the convolution kernel and the output connecting them, you can detect patterns of multiple sizes (2, 3, or 5 adjacent words).

Patterns can be expressions like “I hate” or “very good” (word-level ngram?). , so CNN can distinguish them from sentences without considering their position.

def get_cnn_model():
    embedding_dim = 300

    filter_sizes = [2, 3, 5]
    num_filters = 256
    drop = 0.3

    inputs = Input(shape=(MAX_LENGTH,), dtype='int32')
    embedding = Embedding(input_dim=MAX_NB_WORDS,
                                output_dim=embedding_dim,
                                weights=[embedding_matrix],
                                input_length=MAX_LENGTH,
                                trainable=True)(inputs)

    reshape = Reshape((MAX_LENGTH, embedding_dim, 1))(embedding)
    conv_0 = Conv2D(num_filters, 
                    kernel_size=(filter_sizes[0], embedding_dim), 
                    padding='valid', kernel_initializer='normal', 
                    activation='relu')(reshape)

    conv_1 = Conv2D(num_filters, 
                    kernel_size=(filter_sizes[1], embedding_dim), 
                    padding='valid', kernel_initializer='normal', 
                    activation='relu')(reshape)
    conv_2 = Conv2D(num_filters, 
                    kernel_size=(filter_sizes[2], embedding_dim), 
                    padding='valid', kernel_initializer='normal', 
                    activation='relu'0 maxPOOL_0 = MaxPool2D(pool_size=(max_length-Filter_sizes [0] + 1,1), strides=(1,1), padding= 0'valid')(conv_0) maxPOOL_1 = MaxPool2D(pool_size=(max_length-Filter_sizes [1] + 1,1), strides=(1,1), padding='valid')(conv_1) maxPOOL_2 = MaxPool2D(pool_size=(max_length-Filter_sizes [2] + 1,1), strides=(1,1), padding='valid')(conv_2)
    concatenated_tensor = Concatenate(axis=1)(
        [maxpool_0, maxpool_1, maxpool_2])
    flatten = Flatten()(concatenated_tensor)
    dropout = Dropout(drop)(flatten)
    output = Dense(units=1, activation='sigmoid'Outputs =output (inputs=inputs, outputs= outputs) Adam = Adam(LR = 1E-4, beTA_1 =0.9, beTA_2 =0.999, epsilon= 1E-08, Decay = 0.0) model.com running (optimizer = Adam, loss ='binary_crossentropy', metrics=['accuracy'])

    return model

cnn_model_multi_channel = get_cnn_model()

plot_model(cnn_model_multi_channel, 
           to_file='./images/article_5/cnn_model_multi_channel.png', 
           show_shapes=True, 
           show_layer_names=True)
Copy the code

filepath="./models/cnn_multi_channel/weights-improvement-{epoch:02d}-{val_acc:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')

batch_size = 256
epochs = 4

history = cnn_model_multi_channel.fit(x=padded_train_sequences, 
                    y=y_train, 
                    validation_data=(padded_test_sequences, y_test), 
                    batch_size=batch_size, 
                    callbacks=[checkpoint], 
                    epochs=epochs, 
                    verbose=1)

best_cnn_model = load_model('/ models/cnn_multi_channel/weights - improvement - 04-0.8264 hdf5')

y_pred_cnn_multi_channel = best_cnn_model.predict(padded_test_sequences, verbose=1, batch_size=2048)

y_pred_cnn_multi_channel = pd.DataFrame(y_pred_cnn_multi_channel, columns=['prediction'])
y_pred_cnn_multi_channel['prediction'] = y_pred_cnn_multi_channel['prediction'].map(lambda p: 1 ifP > = 0.5else 0)
y_pred_cnn_multi_channel.to_csv('./predictions/y_pred_cnn_multi_channel.csv', index=False)
y_pred_cnn_multi_channel = pd.read_csv('./predictions/y_pred_cnn_multi_channel.csv')
print(accuracy_score (y_test, y_pred_cnn_multi_channel)) 0.826409655689Copy the code

The accuracy is 82.6%, not as high as RNN, but still better than BOW model. Maybe tweaking the hyperparameters (number and size of filters) will give some improvement?


7. RNN + CNN

RNN is powerful. But it has been found that the network can be made more powerful by superimposing convolutional layers on top of cyclic layers.

The principle behind this is that RNN allows embedding of sequence and related information of previous words, and CNN can use these embedding and extract local features from it. These two layers working together can be called a combination of strengths.

For more information, see: konukoii.com/blog/2018/0…

def get_rnn_cnn_model(): embedding_dim = 300 inp = Input(shape=(MAX_LENGTH, )) x = Embedding(MAX_NB_WORDS, embedding_dim, weights=[embedding_matrix], input_length=MAX_LENGTH, Trainable =True)(INP) x = SpatialDropout1D(0.3)(x) x = Bidirectional(GRU(100, return_sequences=True))(x) x = Conv1D(64, kernel_size = 2, padding ="valid", kernel_initializer = "he_uniform")(x)
    avg_pool = GlobalAveragePooling1D()(x)
    max_pool = GlobalMaxPooling1D()(x)
    conc = concatenate([avg_pool, max_pool])
    outp = Dense(1, activation="sigmoid")(conc)

    model = Model(inputs=inp, outputs=outp)
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])
    return model

rnn_cnn_model = get_rnn_cnn_model()

plot_model(rnn_cnn_model, to_file='./images/article_5/rnn_cnn_model.png', show_shapes=True, show_layer_names=True)
Copy the code

filepath="./models/rnn_cnn/weights-improvement-{epoch:02d}-{val_acc:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')

batch_size = 256
epochs = 4

history = rnn_cnn_model.fit(x=padded_train_sequences, 
                    y=y_train, 
                    validation_data=(padded_test_sequences, y_test), 
                    batch_size=batch_size, 
                    callbacks=[checkpoint], 
                    epochs=epochs, 
                    verbose=1)

best_rnn_cnn_model = load_model('/ models/rnn_cnn/weights - improvement - 03-0.8379 hdf5')

y_pred_rnn_cnn = best_rnn_cnn_model.predict(padded_test_sequences, verbose=1, batch_size=2048)

y_pred_rnn_cnn = pd.DataFrame(y_pred_rnn_cnn, columns=['prediction'])
y_pred_rnn_cnn['prediction'] = y_pred_rnn_cnn['prediction'].map(lambda p: 1 ifP > = 0.5else 0)
y_pred_rnn_cnn.to_csv('./predictions/y_pred_rnn_cnn.csv', index=False)
y_pred_rnn_cnn = pd.read_csv('./predictions/y_pred_rnn_cnn.csv')
print(accuracy_score (y_test, y_pred_rnn_cnn)) 0.837882453033Copy the code

This resulted in 83.8 percent accuracy, the best result to date.


8. To summarize

After running 7 different models, we compare:

import seaborn as sns
from sklearn.metrics import roc_auc_score
sns.set_style("whitegrid")
sns.set_palette("pastel")

predictions_files = os.listdir('./predictions/')

predictions_dfs = []
for f in predictions_files:
    aux = pd.read_csv('./predictions/{0}'.format(f))
    aux.columns = [f.strip('.csv')]
    predictions_dfs.append(aux)

predictions = pd.concat(predictions_dfs, axis=1)

scores = {}

for column in tqdm_notebook(predictions.columns, leave=False):
    ifcolumn ! ='y_true':
        s = accuracy_score(predictions['y_true'].values, predictions[column].values)
        scores[column] = s

scores = pd.DataFrame([scores], index=['accuracy'])

mapping_name = dict(zip(list(scores.columns), 
                        ['Char ngram + LR'.'(Word + Char ngram) + LR'.'Word ngram + LR'.'CNN (multi channel)'.'RNN + CNN'.'RNN no embd.'.'RNN + GloVe embds.']))

scores = scores.rename(columns=mapping_name)
scores = scores[['Word ngram + LR'.'Char ngram + LR'.'(Word + Char ngram) + LR'.'RNN no embd.'.'RNN + GloVe embds.'.'CNN (multi channel)'.'RNN + CNN']]

scores = scores.T

ax = scores['accuracy'].plot(kind='bar', figsize=(16, 5), ylim=(scores.accuracy. Min ()*0.97, Max ()* 1.01), color= (scores.accuracy'red', alpha=0.75, rot=45, fontsize=13) ax.set_title(alpha=0.75, rot=45, fontsize=13)'Comparative accuracy of the different models')

for i inAx. patches: ax.annotate(STR (round(i.ge_height (), 3)), (i.ge_x () + 0.1, i.ge_height () * 1.002), color='dimgrey', fontsize=14)
Copy the code

We can quickly see correlations between the predicted values in these models.

fig = plt.figure(figsize=(10, 5))
sns.heatmap(predictions.drop('y_true', axis=1).corr(method='kendall'), cmap="Blues", annot=True);
Copy the code

conclusion

Here are a few findings I thought were worth sharing:

  • The word bag model using character-level Ngram works well. Don’t underestimate the bag of words model, which is computationally cheap and easy to interpret.
  • RNN is powerful. But you can also use external pre-trained inserts like GloVe on the RNN model. You can also use other common embeddings such as Word2vec and FastText.
  • CNN can also be applied to text. CNN’s main advantage is the speed of training. In addition, CNN’s ability to extract local features from text is also interesting for NLP tasks.
  • RNN and CNN can be stacked together, and both structures can be utilized simultaneously.

This article is very long. I hope this article can be helpful to you.

Original link: ahmedbesbes.com/overview-an…