1.1 introduction

In this section, we will use PyTorch and TorchText to construct a simple machine learning model to predict the emotion of a sentence (i.e. whether the sentence expresses a positive or negative emotion). This series of tutorials will be completed on the movie review dataset: THE IMDb dataset.

In order to quickly lead you into the study of emotion analysis, in the first part, we will not involve difficult theoretical knowledge and do not pay attention to the effect of the model, but just set up a small example of emotion analysis. In the following study, we will improve this system by learning more knowledge.

MDb dataset source:  @InProceedings{maas-EtAl:2011:ACL-HLT2011, author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {142--150}, url = {http://www.aclweb.org/anthology/P11-1015} }Copy the code

1.2 Data Preprocessing

One of TorchText’s main concepts is fields, which define how data should be processed. Our dataset is a tagged dataset, that is, the data consists of raw strings of comments and emotions, with “pos” for positive emotions and “NEg” for negative emotions.

The Field parameter specifies how the data should be processed.

We use the TEXT field to define how comments should be handled, and the LABEL field to handle emotions.

Our TEXT field takes tokenize=’spacy’ as an argument. This defines that “tokenization” (the act of breaking up a string into discrete “tokens”) should be done using the spaCy marker. If the tokenize parameter is not set, the default is to split the string with Spaces. We also need to specify a Tokenizer_language to tell TorchText which spaCy model to use. We use the EN_CORE_web_SM model.

To download the en_core_web_sm model: python -m spacy download en_core_web_sm

LABEL is defined by LabelField, which is a special subset of the Field class dedicated to processing labels. We’ll explain the dtype parameter later.

For more information about Field, visit here.

import torch
from torchtext.legacy import data

# set the random seed number, which ensures that the random number is repeatable
SEED = 1234

# Set seed
torch.manual_seed(SEED)
# If this flag is set to True, the returned convolution algorithm will be determined each time, i.e. the default algorithm. If you set the Torch's random seed to a fixed value, you should be able to ensure that the output of the same input is fixed each time you run the network
torch.backends.cudnn.deterministic = True  

Read data and tags
TEXT = data.Field(tokenize = 'spacy', tokenizer_language = 'en_core_web_sm')
LABEL = data.LabelField(dtype = torch.float)
Copy the code

Another handy feature of TorchText is its support for common data sets used in natural language processing (NLP).

The following code automatically downloads the IMDb dataset and splits it into the canonical training set and test set as torchText.datasets objects. It processes the data using the Fields we defined earlier. The IMDb dataset contains 50,000 movie reviews, each labeled positive or negative.

from torchtext.legacy import datasets

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
Copy the code

Check out our training set and test set sizes:

print(f'Number of training examples: {len(train_data)}')
print(f'Number of testing examples: {len(test_data)}')
Copy the code

Take a look at the sample data:

print(vars(train_data.examples[0]))
Copy the code

The IMDb data set divides the training set and the test set, and here we also need to create a validation set. You can do this using the.split() method.

By default, data is split into training sets and validation sets by 70% and 30%. You can set the ratio of training sets and validation sets by setting the split_ratio parameter, meaning that a split_ratio of 0.8 means that 80% of the samples make up training sets and 20% make up validation sets.

Here we also need to pass the random SEED we set earlier to the random_state parameter, ensuring that we get the same training set and validation set each time.

import random

train_data, valid_data = train_data.split(split_ratio=0.8 , random_state = random.seed(SEED))
Copy the code

Now let’s see how much data we have in the training set, validation set, and test set

print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')
Copy the code

Next, we must build a vocabulary. This is a lookup table where each word in the dataset has a unique index (integer) corresponding to it.

We did this because our model can’t operate on strings, only numbers. Each index is used to construct a one-hot vector for each word, usually represented by VVV.

The number of different words in our training set is over 100,000, which means our one-hot vector is over 100,000 dimensions, which would greatly prolong the training time and would not even be suitable for running locally.

There are two ways to optimize our one-hot vector. One is to take only the first n words that occur most frequently as the basis for one-hot, and the other is to ignore the words that occur less than m times. In this case, we use the first approach: one-hot encoding with 25,000 of the most common words.

There is a problem: some words appear in the dataset but cannot be directly one-hot encoded. Here we use a special

to encode them. For example, if our sentence is “This film is great and I love it”, but the word “love” is not in the vocabulary, we convert the sentence to: “This film is great and I

it”.

Now we build the vocabulary, keeping only the most common max_size tags.

MAX_VOCAB_SIZE = 25000

TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)
Copy the code

Why build vocabularies only on training sets? Because when you test the model, you can’t affect the test set in any way. Validation sets are certainly not included, because you want validation sets to mirror test sets as closely as possible.

print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}")
Copy the code

Why is the dictionary 25002 and not 25000? Two additional tokens are

and .

When we enter sentences into our model, we enter _batch_ one at a time, and all sentences in the batch need to be of the same length. Therefore, a maxlength is set. To ensure that each sentence in the batch is of the same size, fill in any sentence shorter than maxLength and set the filled portion to 0.

We can also view the most common words in the vocabulary and how many times they appear in the dataset.

print(TEXT.vocab.freqs.most_common(20))
Copy the code

You can also use stoI (string to int) or ITOS (int to string) to output the first 10 words of text-vocab.

print(TEXT.vocab.itos[:10])
Copy the code

The final step in preparing the data is to create iterators. Iterators for validation sets, test sets, and training sets need to be created. Each iteration returns a batch of data.

We’ll use a “BucketIterator,” a special type of iterator that returns a batch of samples where each sample is about the same length to minimize the padding per sample.

If you have a GPU, of course you can put the tensor returned by the iterator on the GPU. You can use torch. Device, you can put the tensor on the GPU or CPU.

BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    device = device)
Copy the code

1.3 Model Building

The next step is to build the model we will use for Train and evaluate

When using the RNN model in PyTorch, instead of using the RNN class, the RNN is created using a subclass of nn.module.

In __init__ the number of layers of the model we define. Three layers of model are embedded layer, RNN layer, and finally the full connection layer. Parameter initialization for all layers is random, unless some parameters are specifically set. Embedding converts sparse one-hot into densely embedded vectors in space. Embedding is a simple single-layer full connection layer. This also reduces the dimension input to the RNN and the amount of computation required to calculate the data. One theory is that words that have similar effects on the sentiment of comments are tightly mapped together in the vector space. More information can be found here.

The RNN layer accepts the previous state HT − 1H_ {T-1} HT −1 and the dense embedding vector corresponding to the current input. These two parts are used to calculate the state of the hidden layer of the next layer, HTH_THT.

Finally, the last linear layer will get the state of the last hidden layer of the RNN output. F (hT)f(h_T)f(hT), finally converted to batch_size* num_classments. forward method is when we train data, validate data set, When the test data set is input to the model, the data is passed to the forward method to get the model output.

In each batch, text is a tensor of _[sentence length, batch size]_. This is the transformation of the one-hot vector corresponding to each sentence.

The one-HOT vector of each sentence is indexed by the corresponding dictionary, and then the one-HOT vector representation of each sentence can be obtained according to the index value.

Each input batch is embedded through embedding layer to get the dense vector representation of each sentence. The vector size after embedding is [Sentence length, Batch size, embedding Dim].

In some frameworks, using RNN requires h0H_0h0 to be initialized, but not in PyTorch, which defaults to all zeros. Using RNN returns 2 Tensors, outputs and hidden. The output size is _[Sentence length, Batch size, hidden DIM]_ and the hidden size is _[1, batch size, Hidden Dim]_. Output is the hidden layer state of each layer, and hidden is the hidden layer state of the last layer. Note: The squeeze method eliminates dimensions of 1.

In fact, we usually just use hidden, we don’t care about output. Finally, the final prediction is generated through the linear layer FC.

import torch.nn as nn

class RNN(nn.Module) :
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim) :
        
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        
        self.rnn = nn.RNN(embedding_dim, hidden_dim)
        
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text) :

        #text = [sent len, batch size]
        
        embedded = self.embedding(text)
        
        #embedded = [sent len, batch size, emb dim]
        
        output, hidden = self.rnn(embedded)
        
        #output = [sent len, batch size, hid dim]
        #hidden = [1, batch size, hid dim]
        
        assert torch.equal(output[-1,:,:], hidden.squeeze(0))
        
        return self.fc(hidden.squeeze(0))
Copy the code

Now, we can do an example of setting up an RNN

The input dimension is the dimension corresponding to the one-hot vector, which is also equivalent to the dictionary dimension.

The embedding dimension is a configurable hyperparameter. Usually set to 50-250 dimensions, depending to some extent on the size of the dictionary.

The hidden layer dimension is the size of the last hidden layer. It can be set to 100-500 dimensions, depending on the size of the dictionary and the complexity of the task

The dimension of the output is the number of categories to classify.

INPUT_DIM = len(TEXT.vocab) # dictionary size
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)
Copy the code

You can also print the number of parameters to train and see.

def count_parameters(model) :
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')
Copy the code

1.4 Training model

Before model training, the optimizer should be set first. Here we choose SGD, random gradient descent calculation, model.parameters() represents the parameter to be updated, and LR represents the learning rate

import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=1e-3)
Copy the code

Next, the loss function is defined. BCEWithLogitsLoss is generally used for binary classification.

criterion = nn.BCEWithLogitsLoss()
Copy the code

With.to, you can compute tensors on the GPU.

model = model.to(device)
criterion = criterion.to(device)
Copy the code

The loss function is used to calculate the loss value, and also needs to calculate the accuracy function.

Input the prediction result of sigmoid layer output into the function of calculation accuracy, rounded to the nearest integer. Greater than 0.5, take 1. If you do the opposite, you take 0.

Calculate the value that the predicted result is consistent with the label and divide by all the values to obtain the accuracy rate.

def binary_accuracy(preds, y) :
    "" Returns accuracy per batch, i.e. if you get 8/10 right, this Returns 0.8, NOT 8 ""

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds)) # round
    correct = (rounded_preds == y).float(a)#convert into float for division 
    acc = correct.sum(a) /len(correct)
    return acc
Copy the code

The train function iterates over all samples, one batch at a time.

Model.train () puts model in “training mode “and also turns on dropout and Batch normalization. In each batch, the gradient is first cleared. Each parameter of the model has a Grad attribute, which stores the gradient value calculated by the loss function. PyTorch does not automatically delete (or “zero out”) the gradient calculated from the last gradient calculation, so it must be reset manually.

Each time you type batch.text into the model. Just call the model.

Gradients are calculated with loss.Backward () and optimizer.step() is used to update parameters.

The lost values and accuracy accumulated throughout the epoch, and.item() extracted values in tensors that contained only one value.

Finally, we returned the loss and accuracy, and averaged the whole epoch. Len could get the batch number in the epoch

Of course when you calculate, you need to translate LongTensor to torch. Float. That’s because TorchText sets the tensor to LongTensor by default.

def train(model, iterator, optimizer, criterion) :
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
                
        predictions = model(batch.text).squeeze(1) # Get a prediction
        
        loss = criterion(predictions, batch.label) # Loss calculation
        
        acc = binary_accuracy(predictions, batch.label) # Calculation accuracy
        
        loss.backward() # Calculate the gradient of backpropagation
        
        optimizer.step() # update parameters
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)  Return the average loss and accuracy across the epoch
Copy the code

Evaluate is similar to train, just modify the train function slightly.

Model.eval () puts the model in “evaluation mode “, which turns off dropout and Batch normalization.

With no_grad(), no gradient calculation is performed. This results in less memory and faster computations.

The other functions are similar in train, with optimizer.zero_grad(), loss.backward() and optimizer.step() removed in evaluate, since the parameters no longer need to be updated

def evaluate(model, iterator, criterion) :
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval(a)with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)
Copy the code

Next, create a function that calculates how much time each epoch will consume.

import time

def epoch_time(start_time, end_time) :
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs
Copy the code

We then train the model through multiple epochs, each of which is a complete transfer of all samples in the training and validation set.

In each epoch, if the loss value on the verification set is the best we have seen so far, we will save the parameters of the model and then we will use the model on the test set after the training is completed.

N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1: 02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:3.f} | Train Acc: {train_acc*100:2.f}% ')
    print(f'\t Val. Loss: {valid_loss:3.f} |  Val. Acc: {valid_acc*100:2.f}% ')
Copy the code

As shown above, the losses didn’t really decrease much, and the accuracy was poor. This is due to a few issues with the model that we will improve on baseline in the next NotBook.

Finally, to get the metrics of real concern, loss and accuracy on the test set, parameters will be obtained from already trained models that provide us with the best validation of loss on the set.

model.load_state_dict(torch.load('tut1-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:3.f} | Test Acc: {test_acc*100:2.f}% ')
Copy the code

1.5 summary

In the next article, there will be the following optimizations:

  • Compression fill tensor
  • Pre-trained word embedding
  • Different RNN architectures
  • Two-way RNN
  • Multilayer RNN
  • regularization
  • Different optimizers

Improved accuracy (84%)