Torchtext + TextCNN emotion classification

The data used in this example are English data of three categories. Torchtext is used to process data, construct iterator and build TextCNN, train data with TextCNN and obtain training results. Validation sets are not used to evaluate the model in this example.

I. Development environment and data set

1. Development environment

Ubuntu 16.04.6

Python: 3.7

Pytorch:, version 1.8.1

Torchtext: 0.9.1

2. Data sets

Data set: Train_data_sentiment Extraction code: GW77

2. Use TorchText to process data sets

1. Import the necessary libraries

# import common libraries
import torch
import pandas as pd
import matplotlib.pyplot as plt
from gensim.models import KeyedVectors
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as Data
import torch.nn.functional as F
import torchtext
New versions of torchtext use torchtext.legacy.data. Older versions of Torchtext use Torchtex.data
from torchtext.legacy.data import TabularDataset 
import warnings
warnings.filterwarnings("ignore")
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu") # When I was writing my blog, no one was using 3 cards on our lab server, so I used 3 cards
Copy the code

2. Import and view the data set

Import Dataset, this step is just to show you the Dataset, later in the Dataset construction can also be directly processed Dataset
train_data = pd.read_csv('train_data_sentiment.csv')
train_data
Copy the code

3. Use TorchText to process data sets

Torchtext data processing mainly includes the definition of Field, Dataset and iterator, which can be very convenient for text data processing, such as word segmentation, truncation complement, word list construction, etc. For those unfamiliar with TorchText, take a look at the official documentation or explain the blog.

3.1. Define Field

TEXT = torchtext.legacy.data.Field(sequential=True,lower=True,fix_length=30)The default toggle is split()
LABEL = torchtext.legacy.data.Field(sequential=False,use_vocab=False)
Copy the code

Sequential: Whether to represent data as a sequence. If False, no participles can be used. Default value: True.
Lower: whether to convert data to lowercase. Default value: False.
Fix_length: During iterator construction, change the length of each piece of text data to this value, perform truncation complement, and complete with PAD_token. Default value: None.
Use_vocab: specifies whether to use dictionary objects. If False, the data type must already be numeric. Default value: True.

3.2. Define Dataset

TabularDataset can easily read data files in CSV, JSON, or TSV formats.

train_x = TabularDataset(path = 'train_data_sentiment.csv'.format = 'csv',skip_header=True,
                        fields = [('utterance',TEXT),('label',LABEL)])
Copy the code

Skip_header = True to treat column names as data.
Fields should be in the same order as the original data columns.

You can see that the text data has been segmented

3.3. Construct the word list and load the pre-trained word vector

Since computers do not recognize text, we need to convert text data into values or vectors so that we can input textCNN or deep neural network for training. Firstly, text data is constructed in the form of word index. Then, after the pre-trained word vector is loaded, each word corresponds to a word vector, namely the form of word vector. Finally, in the following model training, we can use word Embedding matrix, namely the word Embedding layer. Now we have converted each word into a vector, which we can input into the model for training.

I want to explain the word embedding matrix here, because it took me a long time to understand the word embedding matrix when I was learning. After we construct the word table, we can use the index to represent a sentence. For example, we construct the word table: “It is you” can be expressed as 10, 9 and 3. Of course, we can directly use the index input to the network for training, but the features represented by the index are too few. In order to get better features, we usually use word2vec vector or glove vector to train the network. Glove vector is used in this paper, so that the features of each word can be better represented, which is more conducive to our training network. Each word in it is you will be represented by a 300-dimensional vector, which will input more of its features into the network, and our network model will be better trained. The word embedding matrix is summarized as follows: (1) construct word list, i.e. word-index; ② Loading pre-trained word vector, i.e. word-vector; Thirdly, word embedding matrix is obtained, that is, index-vector.

# build the word list
TEXT.build_vocab(train_x) # build 10440 words from 0 to 10439
for w,i in TEXT.vocab.stoi.items():
    print(w,i)
Copy the code

The word vector will be automatically downloaded when used for the first time. You can also download the word vector by yourself. I used 400,000 words here, and each word is represented by a 300-dimensional vector
TEXT.vocab.load_vectors('glove.6B.300d',unk_init=torch.Tensor.normal_) The words in the data but not in the Glove word vector are randomly initialized and allocated a 300-dimensional vector
Copy the code

We can check the vectors in the constructed word embedding matrix. Here we show the word vectors with index 3 in the constructed word list, namelyyouThe word, of course, is represented by a 300-dimensional vector, which is only partially shown here.

So let’s look at the glove vectoryouThe vector corresponding to this word is only partially shown here, which shows that we can obtain the vector corresponding to the word through the index, that is, the meaning of the word embedding matrix.

# Look at the word vector dimension
print(TEXT.vocab.vectors.shape) #torch.Size([10440, 300])
Copy the code

As you can see, our data is divided into a total of 10440 words, each of which is represented by a 300-dimensional vector. The iterator can then be built.

3.4. Build iterators

Iterators are iterators and bucketiterators

Iterator: Builds batch data in the same order as raw data.
BucketIterator: Builds data of similar length into a batch, which reduces padding during truncation fill operations.

I set batCH_size =64, so 9989//64+1=157 batch, because we have 9989 pieces of data in total, each batch has 64 pieces of data, and 9989/64=156 more than 5. Then the remaining 5 items of data will constitute a batch.

batch_size = 64
train_iter = torchtext.legacy.data.Iterator(dataset = train_x,batch_size=64,shuffle=True,sort_within_batch=False,repeat=False,device=device)
len(train_iter) # 157
Copy the code

Shuffle: Indicates whether to shuffle data
Sort_within_batch: specifies whether to sort data within each batch
Repeat: Whether batch data is iterated repeatedly in different epochs

Look at the iterators built and the internal data representation:

# View the iterator built
list(train_iter)
Copy the code

Check the size of the batch data
for batch in train_iter:
    print(batch.utterance.shape)
Copy the code

We can see that each batch of data is 64 (except for the last batch of data), i.e. Batch_size =64, and each batch is made up of 30 words. We can also see that the last 5 pieces of data form a batch.

# View the first data
batch.utterance[:,0]We chose column 1 because column 1 represents the first data, i.e. column 64 represents the 64th data. Each piece of data is composed of 30 words. The following non-1 part represents the index of the words in the first piece of data in the word table, and the remaining 1 represents the supplementary part.
Copy the code

Check the index value of the word in the first data
list_a=[]
for i in batch.utterance[:,0] :ifi.item()! =1:
        list_a.append(i.item())
print(list_a)
for i in list_a:
    print(TEXT.vocab.itos[i],end=' ')
Copy the code

View the data in the iterator and its corresponding text
l =[]
for batch in list(train_iter)[:1] :for i in batch.utterance:
        l.append(i[0].item())
    print(l)
    print(' '.join([TEXT.vocab.itos[i] for  i in l]))
Copy the code

Now that the data is processed, let’s move on to TextCNN.

Textcnn knowledge and PyTorch version framework construction

Textcnn knowledge

Textcnn is similar to CNN that we are familiar with for image processing, except that the size of convolution kernel in CNN is generallyk * k, and the convolution kernel in NLP is generallyk * embedding_sizeWhere embedding_size is each word represented by a word vector of the EMbDDing_size dimension. For example, a sentence in the figure below consists of three words, each of which is represented by a 3-dimensional word vector. We select two convolution kernels with a size of 2*3, and the following results can be obtained. The pooled results will actually be spliced together.As can be seen from the figure above, the result obtained after the convolution of a convolution kernel is[len (sentence) - k + 1, 1]Where Len (sentence) indicates the number of words in a sentence, which will be obtained after pooling(2, 1]This result is spliced.

After understanding the basic theoretical knowledge, we can build the network framework of TextCNN. If we want to know more about TextCNN, we can find materials for further study.

2. Use PyTorch to build TextCNN

I set up a two-layer TextCNN network, whose framework is mainly convolution, activation and pooling.

Description of parameters in the network framework:

Vocab_size: The number of words in the constructed word table
Embedding_size: The word vector dimension for each word
Num_channels: the number of output channels, that is, the number of convolution kernels
Kernel_sizes: convolution kernel size

kernel_sizes, nums_channels =  [3.4], [150.150]
embedding_size = 300
num_class = 3
vocab_size = 10440
Copy the code

Here, I built two layers of TextCNN network. The size of convolutional kernels at the first layer is 3 *300, and there are 150 convolution kernels with the same shape. The size of convolutional kernels at the second layer is 4*300, and there are also 150 such convolutional kernels.

class TextCNN(nn.Module) :
    def __init__(self,kernel_sizes,num_channels) :
        super(TextCNN,self).__init__()
        self.embedding = nn.Embedding(vocab_size,embedding_size) # embedding layer
        self.dropout = nn.Dropout(0.5) 
        self.convs = nn.ModuleList()
        for c,k in zip(num_channels,kernel_sizes):
            self.convs.append(nn.Sequential(nn.Conv1d(in_channels=embedding_size, 
                                       out_channels = c,            # here the number of output channels is [150,150], that is, there are 150 convolutional kernels with the size of 3*embedding_size and 4*embedding_size
                                       kernel_size = k),            # the size of the convolution kernel, here 3 and 4
                             nn.ReLU(),                             # activation function
                             nn.MaxPool1d(30-k+1)))                 # pooled, select a maximum value from 30-k+1 results after convolution, 30 indicates that each data is composed of 30 words,
        self.decoder = nn.Linear(sum(num_channels),3)               # Full connection layer, input is 300 dimension vector, output is 3 dimension, namely classification number
        
    def forward(self,inputs) :
        embed = self.embedding(inputs)   #,64,300 [30]
        embed = embed.permute(1.2.0)     #[64,300,30], this step is to swap dimensions in order to conform to the input of the subsequent convolution operation
        After passing through two layers of textCNN in the next encoding, each layer will get a result of [64,150], squeeze is [64,150], then concatenate the two results to get [64,300].
        encoding = torch.cat([conv(embed).squeeze(-1) for conv in self.convs],dim=1) #[64,300] 
        outputs =self.decoder(self.dropout(encoding))   Input [64,300] to the full connection layer, resulting in the result [64,3]
        return outputs 
Copy the code

Let’s take a look at the network framework

net = TextCNN(kernel_sizes, nums_channels).to(device)
net
Copy the code

It can be seen from the figure that there are two layers in the network. The difference between the two layers is that the size of the convolution kernel and the pooling layer are different, and everything else is the same.

In_channels =300;
Out_channels =150;
Convolution kernel size: the first layer is 3 * 300, and the second layer is 4 * 300
The stride for the convolution is equal to 1
The step size of the pooling layerstride = 30-kernel_size+1Because each set of data is made up of 30 words, then the purpose of pooling is to pick the largest value from the result of the convolution, that’s the purpose of maximum pooling, and the result of convolution is[30 - kernel_size + 1, 1]This step you can push yourself, very easy.

Iv. Model training and results

1. Define training function, optimizer, loss function and other parameters

Generally, I define the training function, and then call the model training, you can operate at will.

net.embedding.weight.data.copy_(TEXT.vocab.vectors)    The Embedding matrix is passed to the model's Embedding layer
optimizer = optim.Adam(net.parameters(),lr=1e-4)       # define optimizer, LR is learning rate can be adjusted by yourself
criterion = nn.CrossEntropyLoss().to(device)           # Define the loss function
train_x_len = len(train_x)                             # This step is the amount of data I get to calculate the Acc later, which is 9989
Copy the code

# define the training function
def train(net,iterator,optimizer,criterion,train_x_len) :
    epoch_loss = 0                           Initialize the loss value
    epoch_acc = 0                            Initialize acc value
    for batch in iterator:
        optimizer.zero_grad()                # Gradient zero clearing
        preds = net(batch.utterance)         # Forward propagation, calculate the predicted value
        loss = criterion(preds,batch.label)  # loss calculation
        epoch_loss +=loss.item()             # Add loss as the molecule below to average loss
        loss.backward()                      # Backpropagation
        optimizer.step()                     Update weight parameters in the network
        epoch_acc+=((preds.argmax(axis=1))==batch.label).sum().item()   # add acc as the molecule to average ACC below
    return epoch_loss/(train_x_len),epoch_acc/train_x_len    # Return loss and ACC values
Copy the code

2. Train

A total of 100 training rounds were printed out every 10 rounds.

n_epoch = 100
acc_plot=[]     # for later drawing
loss_plot=[]    # for later drawing
for epoch in range(n_epoch):
    train_loss,train_acc = train(net,train_iter,optimizer,criterion,train_x_len)
    acc_plot.append(train_acc)
    loss_plot.append(train_loss)
    if (epoch+1) %10= =0:
        print('epoch: %d \t loss: %.4f \t train_acc: %.4f'%(epoch+1,train_loss,train_acc))
Copy the code

The results are as follows:

3. Visualization results

Use the drawing function matplotlib
plt.figure(figsize =(10.5),dpi=80)
plt.plot(acc_plot,label='train_acc')
plt.plot(loss_plot,color='coral',label='train_loss')
plt.legend(loc = 0)
plt.grid(True,linestyle = The '-',alpha=1)
plt.xlabel('epoch',fontsize = 15)
plt.show()
Copy the code

Five, the summary

This paper mainly uses Torchtext to process the training data set and constructs it as an iterator to be used for model input. Instead of using validation sets to evaluate the model, we can try to separate the validation sets from the training set to evaluate the model. The complete code is as follows:

# I have tested the code and there is no problem. I have commented out some of the code used for printing out.
# import common libraries
import torch
import pandas as pd
import matplotlib.pyplot as plt
from gensim.models import KeyedVectors
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as Data
import torch.nn.functional as F
import torchtext
New versions of torchtext use torchtext.legacy.data. Older versions of Torchtext use Torchtex.data
from torchtext.legacy.data import TabularDataset 
import warnings
warnings.filterwarnings("ignore")
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu") # When I was writing my blog, no one was using 3 cards on our lab server, so I used 3 cards

train_data = pd.read_csv('train_data_sentiment.csv')
# train_data

Use torchtext to process data
# define filed
TEXT = torchtext.legacy.data.Field(sequential=True,lower=True,fix_length=30)
LABEL = torchtext.legacy.data.Field(sequential=False,use_vocab=False)

train_x = TabularDataset(path = 'train_data_sentiment.csv'.format = 'csv',skip_header=True,
                        fields = [('utterance',TEXT),('label',LABEL)])

# print(train_x[0].utterance)
# print(train_x[0].label)

TEXT.build_vocab(train_x)
# for w,i in TEXT.vocab.stoi.items():
# print(w,i)

TEXT.vocab.load_vectors('glove.6B.300d',unk_init=torch.Tensor.normal_)

glove_model = KeyedVectors.load_word2vec_format('glove.6B.300d.word2vec.txt', binary=False)
# glove_model['you']

# print(TEXT.vocab.vectors.shape) #torch.Size([10440, 300])

batch_size = 64
train_iter = torchtext.legacy.data.Iterator(dataset = train_x,batch_size=64,shuffle=True,sort_within_batch=False,repeat=False,device=device)

# len(train_iter)
# list(train_iter)

# for batch in train_iter:
# print(batch.utterance.shape)

# batch.utterance[:,0]

# list_a=[]
# for i in batch.utterance[:,0]:
# if i.item()! = 1:
# list_a.append(i.item())
# print(list_a)
# for i in list_a:
# print(TEXT.vocab.itos[i],end=' ')

# l =[]
# for batch in list(train_iter)[:1]:
# for i in batch.utterance:
# l.append(i[0].item())
# print(l)
# print(' '.join([TEXT.vocab.itos[i] for i in l]))

kernel_sizes, nums_channels =  [3.4], [150.150]
embedding_size = 300
num_class = 3
vocab_size = 10440

# set textcnn
class TextCNN(nn.Module) :
    def __init__(self,kernel_sizes,num_channels) :
        super(TextCNN,self).__init__()
        self.embedding = nn.Embedding(vocab_size,embedding_size)
        self.dropout = nn.Dropout(0.5) 
        self.convs = nn.ModuleList()
        for c,k in zip(num_channels,kernel_sizes):
            self.convs.append(nn.Sequential(nn.Conv1d(in_channels=embedding_size, 
                                       out_channels = c,           
                                       kernel_size = k),           
                             nn.ReLU(),                           
                             nn.MaxPool1d(30-k+1)))                
        self.decoder = nn.Linear(sum(num_channels),3)              
    def forward(self,inputs) :
        embed = self.embedding(inputs)
        embed = embed.permute(1.2.0)
        encoding = torch.cat([conv(embed).squeeze(-1) for conv in self.convs],dim=1)
        outputs =self.decoder(self.dropout(encoding))
        return outputs 

net = TextCNN(kernel_sizes, nums_channels).to(device)
# net

net.embedding.weight.data.copy_(TEXT.vocab.vectors)    
optimizer = optim.Adam(net.parameters(),lr=1e-4)      
criterion = nn.CrossEntropyLoss().to(device)          
train_x_len = len(train_x)   

# define the training function
def train(net,iterator,optimizer,criterion,train_x_len) :
    epoch_loss = 0                           
    epoch_acc = 0                            
    for batch in iterator:
        optimizer.zero_grad()                
        preds = net(batch.utterance)         
        loss = criterion(preds,batch.label)  
        epoch_loss +=loss.item()             
        loss.backward()                      
        optimizer.step()                     
        epoch_acc+=((preds.argmax(axis=1))==batch.label).sum().item()   
    return epoch_loss/(train_x_len),epoch_acc/train_x_len    

n_epoch = 100
acc_plot=[]     
loss_plot=[]    
for epoch in range(n_epoch):
    train_loss,train_acc = train(net,train_iter,optimizer,criterion,train_x_len)
    acc_plot.append(train_acc)
    loss_plot.append(train_loss)
    if (epoch+1) %10= =0:
        print('epoch: %d \t loss: %.4f \t train_acc: %.4f'%(epoch+1,train_loss,train_acc))
        
plt.figure(figsize =(10.5),dpi=80)
plt.plot(acc_plot,label='train_acc')
plt.plot(loss_plot,color='coral',label='train_loss')
plt.legend(loc = 0)
plt.grid(True,linestyle = The '-',alpha=1)
plt.xlabel('epoch',fontsize = 15)
plt.show()
Copy the code