6: Use Transformer for sentiment analysis

In this Notebook, we’ll use the Transformer model first introduced in the Attention is All You Need paper. Specifically, we will use BERT model in the paper BERT: pre-training of Deep Bidirectional Transformers for Language Understanding.

The Transformer model is much larger than any of the other models covered in this tutorial. Therefore, we will use the Transformers Library to get pre-trained Transformers and use them as our embedding layer. We will fix (rather than train) Transformer and only train the rest of the model we learn from the representation transformer produces. In this case, we will continue to extract the features from Bert embedding using bi-directional GRU. Finally, output the final result on the FC layer.

6.1 Data Preparation

First, as usual, we import the library, and then set the random seed

import torch

import random
import numpy as np

SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
Copy the code

Transformer has been trained with specific words, which means we need to train with exactly the same words and mark our data the same way Transformer was originally trained.

Fortunately, the Transformers library comes with a participle for every Transformer model provided. In this case, we use the case-insensitive BERT model (that is, every word is lowercase). We do this by loading a pre-trained “bert-base-uncased” marker.

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
Copy the code

Tokenizer has a VOCab property that contains the actual words we will use. We can check how many words there are by checking their length.

len(tokenizer.vocab)
Copy the code
30522
Copy the code

Tokenizer.tokenize is used for word segmentation and uniform capitalization of strings.

tokens = tokenizer.tokenize('Hello WORLD how ARE yoU? ')

print(tokens)
Copy the code
['hello', 'world', 'how', 'are', 'you', '?']
Copy the code

We can use our vocabulary to digitize tags using tokenizer.convert_tokens_to_ids. The following tokens are the list of tokens that we used above with word segmentation and uniform capitalization.

indexes = tokenizer.convert_tokens_to_ids(tokens)

print(indexes)
Copy the code
[7592, 2088, 2129, 2024, 2017, 1029]
Copy the code

Transformer is also trained with special tokens to mark the beginning and end of sentences with detailed information. Just as we standardized the padding and unknown tokens, we can also get these from the Tokenizer.

Note: Tokenizer does have sequence start and sequence end attributes (BOs_token and eOS_Token), but we have not set these and do not apply to our transformer for this training.

init_token = tokenizer.cls_token
eos_token = tokenizer.sep_token
pad_token = tokenizer.pad_token
unk_token = tokenizer.unk_token

print(init_token, eos_token, pad_token, unk_token)
Copy the code
[CLS] [SEP] [PAD] [UNK]
Copy the code

We can get an index of special tokens by reversing the vocabulary

init_token_idx = tokenizer.convert_tokens_to_ids(init_token)
eos_token_idx = tokenizer.convert_tokens_to_ids(eos_token)
pad_token_idx = tokenizer.convert_tokens_to_ids(pad_token)
unk_token_idx = tokenizer.convert_tokens_to_ids(unk_token)

print(init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx)
Copy the code
101 102 0 100
Copy the code

Or get them directly through tokenizer

init_token_idx = tokenizer.cls_token_id
eos_token_idx = tokenizer.sep_token_id
pad_token_idx = tokenizer.pad_token_id
unk_token_idx = tokenizer.unk_token_id

print(init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx)
Copy the code
101 102 0 100
Copy the code

The other thing we need to deal with is that the model is trained on a sequence with a defined maximum length — it doesn’t know how to handle sequences longer than the training. We can get the maximum length of these input sizes by checking max_model_input_sizes for the version of the converter we want to use.

max_input_length = tokenizer.max_model_input_sizes['bert-base-uncased']

print(max_input_length)
Copy the code
512
Copy the code

Earlier we used the spaCy marker to mark our example. However, we now need to define a function that we will pass to our TEXT field, which will handle all the tokenization for us. It also reduces the number of tokens to the maximum length. Note that our maximum length is 2 less than the actual maximum length. This is because we need to append two tags to each sequence, one at the beginning and one at the end.

def tokenize_and_cut(sentence) :
    tokens = tokenizer.tokenize(sentence) 
    tokens = tokens[:max_input_length-2]
    return tokens
Copy the code

Now that we’re starting to define our fields, Transformer expects the Batch dimension to be on the first dimension, so we set batCH_first = True. Now that we have the lexical data for the text, provided by Transformer, we set use_VOCab = False to tell TorchText that we no longer need to shard the data. We pass the tokenize_and_cut function as a marker. The preprocessing parameter is a function, and this is where we convert the token to its index. Finally, we define special tokens — note that we define them as their index values instead of their string values, i.e. “100” instead of “[UNK]” because the sequence has been converted to an index.

We define the label field as before.

from torchtext.legacy import data

TEXT = data.Field(batch_first = True,
                  use_vocab = False,
                  tokenize = tokenize_and_cut,
                  preprocessing = tokenizer.convert_tokens_to_ids,
                  init_token = init_token_idx,
                  eos_token = eos_token_idx,
                  pad_token = pad_token_idx,
                  unk_token = unk_token_idx)

LABEL = data.LabelField(dtype = torch.float)
Copy the code

Load the data and divide it into training set and verification set

from torchtext.legacy import datasets

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

train_data, valid_data = train_data.split(random_state = random.seed(SEED))
Copy the code
print(f"Number of training examples: {len(train_data)}")
print(f"Number of validation examples: {len(valid_data)}")
print(f"Number of testing examples: {len(test_data)}")
Copy the code
Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000
Copy the code

Take a random example and see how it works, and print a one-hot vector for one of the sentences.

print(vars(train_data.examples[6]))
Copy the code
{'text': [1042, 4140, 1996, 2087, 2112, 1010, 2023, 3185, 5683, 2066, 1037, 1000, 2081, 1011, 2005, 1011, 2694, 1000, 3947, 1012, 1996, 3257, 2003, 10654, 1011, 28273, 1010, 1996, 3772, 1006, 2007, 1996, 6453, 1997, 5965, 1043, 11761, 2638, 1007, 2003, 2058, 13088, 10593, 2102, 1998, 7815, 2100, 1012, 15339, 14282, 1010, 3391, 1010, 18058, 2014, 3210, 2066, 2016, 1005, 1055, 3147, 3752, 2068, 2125, 1037, 16091, 4003, 1012, 2069, 2028, 2518, 3084, 2023, 2143, 4276, 3666, 1010, 1998, 2008, 2003, 2320, 10012, 3310, 2067, 2013, 1996, 1000, 7367, 11368, 5649, 1012, 1000, 2045, 2003, 2242, 14888, 2055, 3666, 1037, 2235, 2775, 4028, 2619, 1010, 1998, 2023, 3185, 2453, 2022, 2062, 2084, 2070, 2064, 5047, 2074, 2005, 2008, 3114, 1012, 2009, 2003, 7078, 5923, 1011, 27017, 1012, 2023, 2143, 2069, 2515, 2028, 2518, 2157, 1010, 2021, 2009, 21145, 2008, 2028, 2518, 2157, 2041, 1997, 1996, 2380, 1012, 4276, 3773, 2074, 2005, 1996, 2197, 2184, 2781, 2030, 2061, 1012], 'label': 'neg'}
Copy the code

We can convert these indexes back to readable tokens using convert_IDs_to_tokens.

tokens = tokenizer.convert_ids_to_tokens(vars(train_data.examples[6[])'text'])

print(tokens)
Copy the code
['f', '##ot', 'the', 'most', 'part', ',', 'this', 'movie', 'feels', 'like', 'a', '"', 'made', '-', 'for', '-', 'tv', '"', 'effort', '.', 'the', 'direction', 'is', 'ham', '-', 'fisted', ',', 'the', 'acting', '(', 'with', 'the', 'exception', 'of', 'fred', 'g', '##wyn', '##ne', ')', 'is', 'over', '##wr', '##ough', '##t', 'and', 'soap', '##y', '.', 'denise', 'crosby', ',', 'particularly', ',', 'delivers', 'her', 'lines', 'like', 'she', "'", 's', 'cold', 'reading', 'them', 'off', 'a', 'cue', 'card', '.', 'only', 'one', 'thing', 'makes', 'this', 'film', 'worth', 'watching', ',', 'and', 'that', 'is', 'once', 'gage', 'comes', 'back', 'from', 'the', '"', 'se', '##met', '##ary', '.', '"', 'there', 'is', 'something', 'disturbing', 'about', 'watching', 'a', 'small', 'child', 'murder', 'someone', ',', 'and', 'this', 'movie', 'might', 'be', 'more', 'than', 'some', 'can', 'handle', 'just', 'for', 'that', 'reason', '.', 'it', 'is', 'absolutely', 'bone', '-', 'chilling', '.', 'this', 'film', 'only', 'does', 'one', 'thing', 'right', ',', 'but', 'it', 'knocks', 'that', 'one', 'thing', 'right', 'out', 'of', 'the', 'park', '.', 'worth', 'seeing', 'just', 'for', 'the', 'last', '10', 'minutes', 'or', 'so', '.']
Copy the code

Although we’ve dealt with the vocabulary for text, we certainly need to build the vocabulary for tags.

LABEL.build_vocab(train_data)
Copy the code
print(LABEL.vocab.stoi)
Copy the code
defaultdict(None, {'neg': 0, 'pos': 1})
Copy the code

As before, we create iterators. Based on past experience, using the largest batch size works best for transformer, but you can also try using other batch sizes if you have a good graphics card.

BATCH_SIZE = 128

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE, 
    device = device)
Copy the code

6.2 Model Building

Next, we import the pre-training model.

from transformers import BertTokenizer, BertModel

bert = BertModel.from_pretrained('bert-base-uncased')
Copy the code

Next, we will define our actual model.

Instead of using the embedding layer to get the embedding of text, we’ll use a pre-trained Transformer model. These are then added to GRU to generate a prediction of the mood of the input sentence. We get the embedded dimension size (called hidden_size) from Transformer through its Config property. The rest of the initialization is standard.

In forward passing, we wrap transformer in a NO_grad to ensure that gradients are not computed in this part of the model. Transformer actually returns the embedding and pooled output for the entire sequence. The Bert model documentation states that the pooled output “is usually not a good summary of the semantic content of the input, and you are usually better off averaging or merging hidden state sequences across the input sequence”, so we won’t use it. The rest of the forward pass is a standard implementation of the cyclic model, where we take the hidden state in the final time step and pass it to a linear layer to get our prediction.

import torch.nn as nn

class BERTGRUSentiment(nn.Module) :
    def __init__(self, bert, hidden_dim, output_dim, n_layers, bidirectional, dropout) :
        
        super().__init__()
        
        self.bert = bert
        
        embedding_dim = bert.config.to_dict()['hidden_size']
        
        self.rnn = nn.GRU(embedding_dim,
                          hidden_dim,
                          num_layers = n_layers,
                          bidirectional = bidirectional,
                          batch_first = True,
                          dropout = 0 if n_layers < 2 else dropout)
        
        self.out = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text) :
        
        #text = [batch size, sent len]
                
        with torch.no_grad():
            embedded = self.bert(text)[0]
                
        #embedded = [batch size, sent len, emb dim]
        
        _, hidden = self.rnn(embedded)
        
        #hidden = [n layers * n directions, batch size, emb dim]
        
        if self.rnn.bidirectional:
            hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
        else:
            hidden = self.dropout(hidden[-1,:,:])
                
        #hidden = [batch size, hid dim]
        
        output = self.out(hidden)
        
        #output = [batch size, out dim]
        
        return output
Copy the code

We use standard hyperparameters to create instances of the model.

HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.25

model = BERTGRUSentiment(bert,
                         HIDDEN_DIM,
                         OUTPUT_DIM,
                         N_LAYERS,
                         BIDIRECTIONAL,
                         DROPOUT)
Copy the code

We can check how many parameters the model has, our standard model has less than 5M parameters, but this model has 112M luckily, and 110M of those parameters come from transformer so we don’t have to train them again.

def count_parameters(model) :
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')
Copy the code
The model has 112,241,409 trainable parameters
Copy the code

To fix the parameters (without training them), we need to set their requiRES_grad attribute to False. To do this, we simply iterate over all the named_parameters in the model, and if they are part of the Bert converter model, we set REQUIRES_grad = False, or for fine tuning, we set Requires_grad to True

for name, param in model.named_parameters():                
    if name.startswith('bert'):
        param.requires_grad = False
Copy the code

We can now see that our model has less than 3M trainable parameters, which makes it almost comparable to the FastText model. However, text still has to be propagated through Transformer, which makes training take longer.

def count_parameters(model) :
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')
Copy the code
The model has 2,759,169 trainable parameters
Copy the code

We can go through the names of trainable parameters to make sure they make sense. We can see that they are both parameters of GRU (RNN) and linear layer (out).

for name, param in model.named_parameters():                
    if param.requires_grad:
        print(name)
Copy the code
rnn.weight_ih_l0
rnn.weight_hh_l0
rnn.bias_ih_l0
rnn.bias_hh_l0
rnn.weight_ih_l0_reverse
rnn.weight_hh_l0_reverse
rnn.bias_ih_l0_reverse
rnn.bias_hh_l0_reverse
rnn.weight_ih_l1
rnn.weight_hh_l1
rnn.bias_ih_l1
rnn.bias_hh_l1
rnn.weight_ih_l1_reverse
rnn.weight_hh_l1_reverse
rnn.bias_ih_l1_reverse
rnn.bias_hh_l1_reverse
out.weight
out.bias
Copy the code

6.3 Training Model

Conventionally, we construct our own model evaluation criteria (loss function), again dichotomous

import torch.optim as optim

optimizer = optim.Adam(model.parameters())
Copy the code
criterion = nn.BCEWithLogitsLoss()
Copy the code

Put the model and evaluation criteria (loss function) on the GPU, if you have one

model = model.to(device)
criterion = criterion.to(device)
Copy the code

Next, we will define functions to calculate the accuracy, define the train, evalute functions, and calculate the time required for each epoch of the training/evaluation period.

def binary_accuracy(preds, y) :
    "" Returns accuracy per batch, i.e. if you get 8/10 right, this Returns 0.8, NOT 8 ""

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float(a)#convert into float for division 
    acc = correct.sum(a) /len(correct)
    return acc
Copy the code
def train(model, iterator, optimizer, criterion) :
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        
        predictions = model(batch.text).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)
Copy the code
def evaluate(model, iterator, criterion) :
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval(a)with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)
Copy the code
import time

def epoch_time(start_time, end_time) :
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs
Copy the code

Finally, we will train our model. This is much longer than any previous model due to the size of the Transformer. Even though we didn’t train any transformer parameters, we still needed to pass data through the model, which takes a lot of time on a standard GPU.

N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
        
    end_time = time.time()
        
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
        
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut6-model.pt')
    
    print(f'Epoch: {epoch+1: 02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:3.f} | Train Acc: {train_acc*100:2.f}% ')
    print(f'\t Val. Loss: {valid_loss:3.f} |  Val. Acc: {valid_acc*100:2.f}% ')
Copy the code
Epoch: 01 | Epoch Time: 7 m 13 "Train" s Loss: 0.502 | "Train" the Acc: 74.41% Val. Loss: 0.270 | Val. The Acc: 89.15% Epoch: 02 | Epoch Time: 7 m 7 s Train Loss: 0.281 | "Train" the Acc: 88.49% Val. Loss: 0.224 | Val. The Acc: 91.32% Epoch: 03 | Epoch Time: 7 m 17 "Train" s Loss: 0.239 | "Train" the Acc: 90.67% Val. Loss: 0.211 | Val. The Acc: 91.91% Epoch: 04 | Epoch Time: 7 m 14 "Train" s Loss: 0.206 | "Train" the Acc: 91.81% Val. Loss: 0.206 | Val. The Acc: 92.01% Epoch: 05 | Epoch Time: 7 m 15 s Train Loss: 0.188 | "Train" the Acc: 92.63% Val. Loss: 0.211 | Val. The Acc: 91.92%Copy the code

We’ll load the parameters that give us the best lost values on the validation set and apply those parameters on the test set – and achieve the best results on the test set.

model.load_state_dict(torch.load('tut6-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:3.f} | Test Acc: {test_acc*100:2.f}% ')
Copy the code
The Test Loss: 0.209 | Test Acc: 91.58%Copy the code

6.4 Model Verification

We will then use this model to test some sequences of emotions. We mark the input sequence, trim it to its maximum length, add special tokens to either side, convert it to a tensor, add one dimension using the unsqueeze function, and then pass it to our model.

def predict_sentiment(model, tokenizer, sentence) :
    model.eval()
    tokens = tokenizer.tokenize(sentence)
    tokens = tokens[:max_input_length-2]
    indexed = [init_token_idx] + tokenizer.convert_tokens_to_ids(tokens) + [eos_token_idx]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(0)
    prediction = torch.sigmoid(model(tensor))
    return prediction.item()
Copy the code
predict_sentiment(model, tokenizer, "This film is terrible")
Copy the code
0.03391794115304947
Copy the code
predict_sentiment(model, tokenizer, "This film is great")
Copy the code
0.8869886994361877
Copy the code