2: Updated sentiment analysis

In the previous article, we have learned the basic working process of emotional analysis, here we will learn how to improve the optimization model: involves how to use the compression filling sequences, load, and use pre training word vector, using different optimizer, choose different RNN architecture (including two-way RNN, multi-layer RNN) and regularization.

The main contents of this chapter are as follows:

  • The sequence filling
  • Pre-trained word embedding
  • LSTM
  • Two-way RNN
  • Multilayer RNN
  • regularization
  • To optimize the

2.1 Preparing Data

The seed is first set up and sorted into training, testing, and validation sets.

When preparing the data, note that since the RNN can only handle non-PADDED elements (that is, non-zero data) in a sequence, the output is 0 for any PADDED element. So notice that we set include_length to True when preparing the data to get the actual length of the sentence, which will be used later.

import torch
from torchtext.legacy import data

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = 'spacy',
                  tokenizer_language = 'en_core_web_sm',
                  include_lengths = True)

LABEL = data.LabelField(dtype = torch.float)
Copy the code

Load the IMDb dataset

from torchtext.legacy import datasets

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
Copy the code

Select part from the training set as the verification set

import random

train_data, valid_data = train_data.split(random_state = random.seed(SEED))
Copy the code

2.2 term vectors

Next, initialization is done with pre-trained word vectors, which are obtained by specifying parameters passed to build_VOCab.

Here, we select the Word GloVe vector, which stands for Global Vectors for Word Representation. There is a detailed introduction and a lot of resources on it here. In this tutorial, we will not explain how the word vector is created. We will simply describe how to use the word vector. Here we use “glove. 100D means that the word vector is 100 dimensions (note that the word vector is over 800 megabytes)

Of course, you can also choose other word vectors. Theoretically, the distance of these pre-trained word vectors in the word embedding vector space can to some extent represent the semantic relationship between words, for example, “terrible”, “awful”, “dreadful”, their words will be dreadfully inserted into the vector space.

Text.build_vocab represents the Vocab(vocabulary) of the current training set extracted from the word vector of the current training data. For words that are not present in the current word vector corpus (denoted as UNK, unknown), random initialization is performed by Gaussian distribution (unk_init = torch.Tensor. Normal_).

MAX_VOCAB_SIZE = 25 _000

TEXT.build_vocab(train_data, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

LABEL.build_vocab(train_data)
Copy the code

2.3 Creating an Iterator + Selecting a GPU

BATCH_SIZE = 64

Select whether to call GPU for training based on the current environment
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Create a data iterator
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    sort_within_batch = True,
    device = device)
Copy the code

2.4 Model Building

LSTM

LSTM is a variant of standard RNN, which adds a method to carry information across multiple time steps, overcoming the problem of gradient disappearance in standard RNN to some extent. Specifically, LSTM joins the memory unit CTC_TCT, which can be regarded as the “memory” of LSTM, which stores the memory of LSTM at the time of TTT. It can be considered that it stores all necessary information from the past to the time of TTT. At the same time, multiple gates are used to control the inflow and outflow of information into and out of memory. So we can think of LSTM as a function of xTX_txt, hTH_THt, and cTC_TCT, not just xTX_txt and hTH_THt.


( h t . c t ) = LSTM ( x t . h t . c t ) (h_t, c_t) = \text{LSTM}(x_t, h_t, c_t)

Thus, the structure of the model using LSTM looks like the following (the embedding layer is omitted) :

As with the initial hidden state, the initial memory state C0C_0C0 is initialized to an all-zero tensor. Note that affective prediction only uses the final hidden state, not the final memory unit state, i.e. Y ^= F (hT)\hat{y}= F (h_T)y^= F (hT).

Two-way RNN

Bidirectional RNN adds an RNN layer of reverse processing to the previous standard RNN layer. Then, the hidden state of the two RNN layers at each moment is spliced as the final hidden state vector. That is, in time step TTT, forward RNN processes word xtx_txt and backward RNN processes word xT− T +1x_{t-T +1}xT− T +1. With this two-way processing, each word’s corresponding hidden state can aggregate information from the left and right directions, so that the vectors encode more balanced information.

We use the last hidden state of forward RNN (taken from the last word of the sentence) hT→h_T^\rightarrowhT→ and the last hidden state of backward RNN (taken from the first word of the sentence) hT←h_T^\leftarrowhT← for emotion prediction. That is, y^= F (hT→,hT←)\hat{y}= F (h_T^\rightarrow, h_T^\leftarrow)y^= F (hT→,hT←). The figure below shows a bidirectional RNN, with the forward RNN being orange, the back RNN being green, and the linear layer being silver.

Multilayer RNN

Multilayer RNN (also called deep RNN) : Add several layers of RNN to the original standard RNN. The output hidden state of the first (bottom) RNN at time step TTT will be the input of the RNN above it at time step TTT, and then predict according to the final hidden state of the final (highest) layer. The figure below shows a multi-layer one-way RNN, where the layer number is given in the notation above. Also note that each layer needs its own initial hidden state.

regularization

We have improved the model in various aspects, but we need to pay attention to that when the parameters of the model gradually increase, the possibility of overfitting of the model increases. To solve this problem, we add dropout regularization. Dropout works by randomly Dropout (set to 0) of the neurons in the layer during forward propagation. The probability of whether or not each neuron is dropped is set by a hyperparameter and is not affected by other neurons.

One theory for why dropout is effective is that models of dropout parameters can be viewed as models with “weaker” parameters. Thus, the final model can be thought of as a collection of all of these weaker models, none of which are overly parameterized, thus reducing the possibility of overfitting.

Implementation Details

1. A supplement in the model training process: In the model training process, the model should not train the pad token added after the completion of each sample, that is, it will not learn the embedding of “<pad>” tag. Because the padding token has nothing to do with the emotion of the sentence. This means that the pad token’s embedding layer (word vector) will always remain initialized (initialized to all zero). We specify that the pad token’s index is passed to the nn.Embedding as the padding_idx parameter.

2. Because the bidirectional LSTM used in the experiment includes both forward propagation and backward propagation processes, the final hidden state vector contains both forward and backward hidden states, so the shape of the input in the nn.Linear layer at the next layer is twice the dimension shape of the hidden layer.

3. Before embeddings were input to the RNN, we needed to “pack” them with nn.utils.rnn.packed_padded_sequence to ensure that the RNN would only process tokens that were not pad. The outputs we get include pacKED_output (a Packed sequence), hidden sate and cell State. If there is no ‘pack’ operation, then the hidden state and cell state output are most likely pad tokens from the sentence. If packed Sentences is used, the output will be the hidden State and cell state of the last non-padded element.

Then we use nn.utils.rnn.pad_packed_sequence to translate the output sentence ‘unzip’ into a tensor. It should be noted that the output from padding Tokens is a zero tensor. Normally, we need to ‘decompress’ the output only when it is used in the subsequent model. Although not required in this case, this is just to show the steps.

5. Final hidden sate: hidden, its shape is [num layers * NUM Directions, Batch size, HID Dim]. Because we only need the last forward and backward propagation hidden States, we only need the last two hidden layers hidden[-2,:,:] and hidden[-1,:,:], and then merge them together and pass in the Linear layer. ##### here do not know how to explain will be better, but also need to adjust.

import torch.nn as nn

class RNN(nn.Module) :
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout, pad_idx) :
        
        super().__init__()
        Embedding layer (word vector)
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        
        # RNN variant -- bidirectional LSTM
        self.rnn = nn.LSTM(embedding_dim,  # input_size
                           hidden_dim,  #output_size
                           num_layers=n_layers,  # layer
                           bidirectional=bidirectional, # Two-way
                           dropout=dropout) # Random removal of neurons
        # Linear connection layer
        self.fc = nn.Linear(hidden_dim * 2, output_dim) # Because forward propagation + backward propagation has two hidden sates, which are merged together, multiply by 2
        
        # Random removal of neurons
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text, text_lengths) :
        
        [sent len, batch size]
        
        embedded = self.dropout(self.embedding(text))
        
        # Embedded shape [Sent Len, Batch Size, emb Dim]
        
        # pack sequence
        # lengths need to be on CPU!
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.to('cpu'))
        
        packed_output, (hidden, cell) = self.rnn(packed_embedded)
        
        #unpack sequence
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)

        [sent len, Batch size, hid Dim * num Directions]
        The padding tokens in #output are tensors with a value of 0
        
        #hidden shape [num layers * num Directions, Batch size, hid dim]
        [num layers * num Directions, Batch size, hid dim]
        
        #concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers
        #and apply dropout
        
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
                
        #hidden [Batch size, HID Dim * num Directions]
            
        return self.fc(hidden)
Copy the code

2.5 Instantiating the model + passing in parameters

To ensure that the pre-trained word vector can be loaded into the model, EMBEDDING_DIM must be equal to the size of the pre-trained GloVe word vector.

INPUT_DIM = len(TEXT.vocab) # 250002: take 25,000 most frequent words, plus PAD_token and unknown token
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] Pad_token index (pad_token)

model = RNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PAD_IDX)
Copy the code

View the number of model parameters

def count_parameters(model) :
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')
Copy the code

Next, the pre-trained word vector loaded previously was copied into the embedding layer of our model, and the weight parameters initialized by the original model were replaced with the pre-trained embeddings word vector.

[vocab size, embedding dim] [VOCab size, embedding dim]

pretrained_embeddings = TEXT.vocab.vectors
[vocab size, embedding Dim]
print(pretrained_embeddings.shape)
Copy the code

The weight parameters initialized by the original model are replaced with the pre-trained embedding vector
model.embedding.weight.data.copy_(pretrained_embeddings)
Copy the code

Since our

and tokens are not in the pre-trained vocabulary, they were already initialized when building our vocabulary using unk_init (an N(0,1)\mathcal{N}(0,1)N(0,1) distribution). So, it’s better to explicitly tell the model to initialize them to zero, they’re not emotional.

We manually set their word vector weight to 0.

Note: As with initializing the embed, this should be done on “weight.data”, not “weight”!

# Set unknown and padding token to 0
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

print(model.embedding.weight.data)
Copy the code

We can now see that the first two rows of the embedded weight matrix are 0, and it should be noted that the word vector of pad token will never be learned during model training. Unknown token’s word vector is learned.

2.6 Training Model

Now start training the model!

We changed the STOCHASTIC gradient descent optimizer from ‘SGD’ to ‘Adam’. SGD uses the learning rate set by us to update all parameters synchronously, while Adam will adjust the learning rate of each parameter and give parameters with higher update frequency, lower update frequency and lower update frequency. For more information about “Adam” (and other optimizers), see here.

To change ‘SGD’ to ‘Adam’ we simply change ‘optim.sgd’ to ‘optim.adam’ and also note that we do not provide Adam initial learning rate because PyTorch provides the default initial learning rate.

2.6.1 Setting the optimizer

import torch.optim as optim

optimizer = optim.Adam(model.parameters())
Copy the code

2.6.2 Setting the loss function and GPU

The other steps of the training model remain unchanged.

criterion = nn.BCEWithLogitsLoss() Criterion has been used to describe a loss function

model = model.to(device)
criterion = criterion.to(device)
Copy the code

2.6.3 Calculation accuracy

def binary_accuracy(preds, y) :
    "" Returns accuracy per batch, i.e. if you get 8/10 right, this Returns 0.8, NOT 8 ""

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float(a)#convert into float for division 
    acc = correct.sum(a) /len(correct)
    return acc
Copy the code

2.6.4 Define a training function to train the model

Just as we set “include_length=True”, our “batch.text” is now a tuple, with the first element being the number tensor and the second element being the actual length of each sequence. Before passing them to the model, we split them into separate variables “text” and “text_length.”

Note: Since we are using dropout now, we must remember to use model.train () to make sure we turn dropout on during training.

def train(model, iterator, optimizer, criterion) :
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad() # Gradient zero clearing
        
        text, text_lengths = batch.text # batch.text returns a tuple (a digitized tensor, the length of each sentence)
        
        predictions = model(text, text_lengths).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)
Copy the code

2.6.5 Define a test function

Note: Because we are using dropout now, we must remember to use Model.eval () to ensure that the dropout is turned off during evaluation.

def evaluate(model, iterator, criterion) :
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval(a)with torch.no_grad():
    
        for batch in iterator:

            text, text_lengths = batch.text  #batch.text returns a tuple (a digitized tensor, the length of each sentence)
            
            predictions = model(text, text_lengths).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)
Copy the code

You can also create a function that tells us how long epoCHS training takes.

import time

def epoch_time(start_time, end_time) :
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs
Copy the code

2.6.6 Formal training model

N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    Keep the model parameter with the best training result and load it for prediction
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')
    
    print(f'Epoch: {epoch+1: 02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:3.f} | Train Acc: {train_acc*100:2.f}% ')
    print(f'\t Val. Loss: {valid_loss:3.f} |  Val. Acc: {valid_acc*100:2.f}% ')
Copy the code

2.6.7 Final test results

model.load_state_dict(torch.load('tut2-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:3.f} | Test Acc: {test_acc*100:2.f}% ')
Copy the code
The Test Loss: 0.334 | Test Acc: 85.28%Copy the code

2.7 Model Verification

We can now use this model to predict the emotion of any sentence we give, noting that the sentence we need to provide is a movie review.

When using the model for actual prediction, the model should always be in the evaluation mode.

The predict_sentiment function does the following:

  • Switch the model to Evaluate mode
  • Perform word segmentation on sentences
  • Convert each word, corresponding to the vocabulary, into the corresponding index index,
  • Gets the length of the sentence
  • Take the Indexes from list to tensor
  • Adds a batch dimension by unsqueezing
  • Translate length into a tensor tensor
  • The sigmoid function is used to compress the predicted value between 0 and 1
  • Using the item () method, you take the tensor that only has one value and translate it into integers

Negative comments return a value close to 0, and positive comments return a value close to 1.

import spacy
nlp = spacy.load('en_core_web_sm')

def predict_sentiment(model, sentence) :
    model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    length = [len(indexed)]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    length_tensor = torch.LongTensor(length)
    prediction = torch.sigmoid(model(tensor, length_tensor))
    return prediction.item()
Copy the code

Examples of negative comments:

predict_sentiment(model, "This film is terrible")
Copy the code
0.05380420759320259
Copy the code

Examples of positive comments:

predict_sentiment(model, "This film is great")
Copy the code
0.94941645860672
Copy the code

summary

We have now built a sentiment analysis model for film reviews. In the next section, we will implement a model that achieves higher accuracy and faster training speed with fewer parameters.

The resources

Blog.csdn.net/weixin_4216…