The data used in this paper is a three-category English Dataset. Torchtext is used to process the data and construct Dataset and iterator. The network model constructed is LSTM+ self-attention. This article does not use validation sets to evaluate the model.

I. Development environment and data set

1. Development environment

Ubuntu 16.04.6

Python: 3.7

Pytorch: 1.7.1

Torchtext: 0.8.0

2. Data sets

Data set: Train_data_sentiment

Extraction code: GW77

2. Use TorchText to process data sets

1. Import the necessary libraries

Import math import torch import pandas as pd import matplotlib.pyplot as PLT import torch. Nn as nn import torch.optim as optim import torch.utils.data as Data import torch.nn.functional as F import torchtext from Torchtext. Vocab Import Vectors # Data from torchtext.data import TabularDataset import Warnings warnings. Filterwarnings ("ignore")  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")Copy the code

2. Import and view the data set

Train_data = pd.read_csv('train_data_sentiment. CSV ') train_dataCopy the code

3. Use TorchText to process data sets

Torchtext processes data in the following steps:

  • Definition of the Field
  • Create a Dataset
  • Creating iterators

Torchtext can be very convenient for text data segmentation, truncation complement, construction of word lists, etc. Those unfamiliar with TorchText can study the official documentation or explain the blog.

3.1. Define Field

Field(sequential=True,lower=True,fix_length=30) # define Field TEXT = torchtext.data.Field(sequential=True,lower=True,fix_length=30) # define Field TEXT = torchtext.data torchtext.data.Field(sequential=False,use_vocab=False)Copy the code
  • Sequential: Whether to represent data as a sequence. If False, no participles can be used. Default value: True.
  • Lower: whether to convert data to lowercase. Default value: False.
  • Fix_length: During iterator construction, change the length of each piece of text data to this value, perform truncation complement, and complete with PAD_token. Default value: None.
  • Use_vocab: specifies whether to use dictionary objects. If False, the data type must already be numeric. Default value: True.

3.2 create Dataset

 train_x = TabularDataset(path = 'train_data_sentiment.csv',
                         format = 'csv',skip_header=True,
                         fields = [('utterance',TEXT),('label',LABEL)])
Copy the code
  • Skip_header = True to treat column names as data.
  • Fields should be in the same order as the original data columns.

Looking at the data we have processed to this point, you can see that the original data has been segmented

3.3. Construct the word list and load the pre-trained word vector

For w, I in text.vocab.stoi.items (): print(w, I) text.vocab.stoi.items (): print(w, I)Copy the code

# Load the glove word vector. It will automatically download the word vector the first time you use it, or you can download the word vector yourself. I used 400,000 words here. Text.vocab.load_vectors (' globe.6b.100d ',unk_init=torch.tensor. Normal_) # Randomly initialize the 100-dimensional vector for words that are present in the data but not in the glove word vectorCopy the code

We can check the dimension of the constructed word embedding matrix, that is, each word in the word list we constructed is represented by a 100-dimensional vector, so the dimension of the word embedding matrix is [10440,100].

 print(TEXT.vocab.vectors.shape) #torch.Size([10440, 100])
Copy the code

3.4. Build iterators

Iterators are iterators and bucketiterators

  • Iterator: Builds batch data in the same order as raw data.
  • BucketIterator: Builds data of similar length into a batch, which reduces padding during truncation fill operations.

I set batCH_size =64, so 9989//64+1=157 batch, because we have 9989 pieces of data in total, each batch has 64 pieces of data, and 9989/64=156 more than 5. Then the remaining 5 items of data will constitute a batch.

Iterator batch_size = 64 train_iter = torchtext.data.Iterator(dataset = train_x,batch_size=64,shuffle=True,sort_within_batch=False,repeat=False,device=device) len(train_iter) #157Copy the code
  • Shuffle: Indicates whether to shuffle data
  • Sort_within_batch: specifies whether to sort data within each batch
  • Repeat: Whether batch data is iterated repeatedly in different epochs

Look at the iterators built and the internal data representation:

# check the iterator list(train_iter)Copy the code

For batch in train_iter: print(batch.utterance. Shape)Copy the code

We can see that each batch of data is 64 (except for the last batch of data), i.e. Batch_size =64, and each batch is made up of 30 words. We can also see that the last 5 pieces of data form a batch.

Utterance [:,0]# Batch. Utterance [:,0]# Batch. Utterance [:,0]# Batch. Each piece of data is composed of 30 words. The following non-1 part represents the index of the words in the first piece of data in the word table, and the remaining 1 represents the supplementary part.Copy the code

List_a =[] for I in batch.utterance[:,0]: if i.item()! =1: list_a.append(i.item()) print(list_a) for i in list_a: print(TEXT.vocab.itos[i],end=' ')Copy the code

L =[] for batch in list(train_iter)[:1]: for I in batch. Utterance: l.append(i[0].item()) print(l) print(' '.join([TEXT.vocab.itos[i] for i in l]))Copy the code

At this point, we will finish the data processing, the next is to build a network.

3. Build LSTM+ self-attention network model

1. Network model structure

2, the Self – Attention

The structure of the model in this paper is relatively simple and adopts the method of calculating Attention in Transformer. I will simply explain the part of self-attention.

First, we use outputs from the LSTM output layer (named X1,X2,X3) as input for self-attention. Pass these inputs through the Linear layer (W_Q,W_K,W_V) to get q, K, v for each output, and then combine each q, K, and V for attention. And just to add up, there’s no combination of each q,k, and v in this code, because x1,x2, and x3 use the same linear layer W_Q to get their q, so we can combine x1,x2, and x3 to get the q directly from the linear layer W_Q. (K and V operate the same way in code)

Secondly, according to the formula

Attention(Q,K,V) = softmax(Q.*K^T/\sqrt[]d_k)V

And you get a vector representation of the attention mechanism.

3. Build the model with PyTorch

Parameter description in the network model:

  • Vocab_size: The number of words in the constructed word table
  • Embedding_size: The word vector dimension for each word
  • Hidden_dim: number of hidden layer cells in LSTM
  • N_layers: number of hidden layers in LSTM
  • Num_class: indicates the number of categories
 vocab_size = 10440
 embedding_size = 100
 hidden_dim = 128
 n_layers = 1
 num_class = 3
Copy the code
class LSTM_Attention(nn.Module): def __init__(self,vocab_size,embedding_dim,hidden_dim,n_layers,num_class): Super (LSTM_Attention,self).__init__() # And then you get Q,K,V # and here I'm using attention_size is hidden_dim, Attention_size self.w_q = nn.Linear(hidden_DIM, hidden_DIM,bias =False) self.w_k = Nn. Linear(hidden_DIM,hidden_dim,bias =False) self.W_V = nn.Linear(hidden_DIM,hidden_dim,bias =False) #embedding self.embedding = nn.Embedding(vocab_size,embedding_dim) #LSTM self.rnn = nn.LSTM(input_size = embedding_dim,hidden_size = hidden_dim,num_layers = n_layers) Self. f = nn.Linear(hidden_dim,num_class) #dropout self.f = nn. dropout (0.5) #dropout attention(self,Q,K,V): SQRT (d_k) alpha_n = f.score (scores,dim=-1) context = [batch_size,hidden_dim] [batch_size,hidden_dim] [batch_size,hidden_dim] Output = context.sum(1) return output,alpha_n def forward(self,x): output = context.sum(1) return output,alpha_n def forward(self,x): Embedding = [seq_len,batch_size] = [30,64] embedding = self.dropout(self.embedding(x)) #embedding. Shape = [seq_len,batch_size,embedding_dim = 100] embedding = embedding. Transpose (0,1) #embedding. Shape = [batch_size,seq_len,embedding_dim] # embedding_output,(h_n,c) = self.rnn(embedding) #out.shape = [batch_size,seq_len,hidden_dim=128] Q = self.W_Q(output) #[batch_size,seq_len,hidden_dim] K = self.W_K(output) V = Self. W_V(output) # Attn_output,alpha_n = self.attention(Q,K,V) #attn_output.shape = [batch_size,hidden_dim=128] #alpha_n.shape = [batch_size,seq_len,seq_len] out = self.fc(attn_output) #out.shape = [batch_size,num_class] return outCopy the code
Net = LSTM_Attention(VOCAB_size = VOCab_size, embedding_dim=embedding_size,hidden_dim=hidden_dim,n_layers=n_layers,num_class=num_class).to(device) netCopy the code

Iv. Model training and results

1. Define parameters such as training function, optimizer and loss function

Generally, I define the training function, and then call the model training, you can operate at will.

Net. Embedding. Weight. Data. Copy_ (TEXT. The vocab. Vectors) # for embedding layer of the model was introduced into our word embedded matrix optimizer = Optim.adam (net.parameters(),lr= 1E-3) # Criterion = nn.crossentropyLoss ().to(device) # define loss function train_x_len = len(train_x) # That is 9989Copy the code
# define training function def "train" (.net, iterator, optimizer, criterion, train_x_len) : Epoch_loss = 0 # Initialize loss value epoch_ACC = 0 # initialize ACC value for batch in iterator: Optimizer.zero_grad () # preds = net(Batch.utterance) # forward propagation, Loss = criterion(preds,batch.label) # calculate loss epoch_loss +=loss.item() # add loss As the molecule below for averaging loss, Loss.Backward () # backpropagation optimizer.step() # update the weight parameters in the network Epoch_acc + = ((preds. Argmax (axis = 1)) = = batch. The label), sum (). The item # () accumulative acc, Return epoch_loss/(len(iterator)), epoch_ACC/train_X_len # returns the loss and ACC valuesCopy the code

2. Train

Loss_plot =[] # for the epoch in range(n_epoch): train_loss,train_acc = train(net,train_iter,optimizer,criterion,train_x_len) acc_plot.append(train_acc) loss_plot.append(train_loss) if (epoch+1)%10==0: print('epoch: %d \t loss: %.4f \t train_acc: %.4f'%(epoch+1,train_loss,train_acc))Copy the code

The results are as follows:

3. Visualization results

Figure (figsize =(10,5),dpi=80) plt.plot(acc_plot,label='train_acc') plt.plot(loss_plot,color='coral',label='train_loss') plt.legend(loc = 0) plt.grid(True,linestyle = '--',alpha=1) plt.xlabel('epoch',fontsize = 15) plt.show()Copy the code

Five, the summary

This paper mainly uses a three-category data set to do classification tasks. The model is LSTM with self-attention mechanism. This paper does not use validation set to evaluate the model, but you can try to separate some data from the training set as validation set and evaluate the model.