Additional recommended reading:

1. “Take you to AI” takes you to AI and TensorFlow2 Combat introduction: How to fast deep Learning development

· Baseline (Based on Keras Val_ACC: 0.88)

·DC race Bearing fault detection Baseline (based on Keras1D convolution VAL_ACC :0.99780)

4. The author’s in-depth study public account “Minimalist AI” :

Transfromer theory section

In the paper Attention Is All You Need, Google Brain proposed Transformer, a codec model based entirely on the Attention mechanism, which completely abandoned the loop and convolution structure retained by other models after the introduction of Attention mechanism. Then, there were significant improvements in task performance, parallelism and trainability. Transformer has since become an important benchmark model for machine translation and many other text understanding tasks.

Model introduction

Analysis of model paper

GitHub:github.com/xiaosongshi…

Transfromer Model code implementation (based on Keras)

  • Position_Embedding
#! -*- coding: utf-8 -*-
# % %
from __future__ import print_function
from keras import backend as K
from keras.engine.topology import Layer

class Position_Embedding(Layer) :

    def __init__(self, size=None, mode='sum', **kwargs) :
        self.size = size # must be even
        self.mode = mode
        super(Position_Embedding, self).__init__(**kwargs)

    def call(self, x) :
        if (self.size == None) or (self.mode == 'sum'):
            self.size = int(x.shape[-1])
        batch_size,seq_len = K.shape(x)[0],K.shape(x)[1]
        position_j = 1. / K.pow(10000., \
                                 2 * K.arange(self.size / 2, dtype='float32' \
                               ) / self.size)
        position_j = K.expand_dims(position_j, 0)
        position_i = K.cumsum(K.ones_like(x[:,:,0]), 1) -1 # k.range does not support variable length, so it has to be generated this way
        position_i = K.expand_dims(position_i, 2)
        position_ij = K.dot(position_i, position_j)
        position_ij = K.concatenate([K.cos(position_ij), K.sin(position_ij)], 2)
        if self.mode == 'sum':
            return position_ij + x
        elif self.mode == 'concat':
            return K.concatenate([position_ij, x], 2)

    def compute_output_shape(self, input_shape) :
        if self.mode == 'sum':
            return input_shape
        elif self.mode == 'concat':
            return (input_shape[0], input_shape[1], input_shape[2]+self.size)
Copy the code
  • Attention

class Attention(Layer) :

    def __init__(self, nb_head, size_per_head, **kwargs) :
        self.nb_head = nb_head
        self.size_per_head = size_per_head
        self.output_dim = nb_head*size_per_head
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape) :
        self.WQ = self.add_weight(name='WQ',
                                  shape=(input_shape[0] [-1], self.output_dim),
                                  initializer='glorot_uniform',
                                  trainable=True)
        self.WK = self.add_weight(name='WK',
                                  shape=(input_shape[1] [-1], self.output_dim),
                                  initializer='glorot_uniform',
                                  trainable=True)
        self.WV = self.add_weight(name='WV',
                                  shape=(input_shape[2] [-1], self.output_dim),
                                  initializer='glorot_uniform',
                                  trainable=True)
        super(Attention, self).build(input_shape)

    def Mask(self, inputs, seq_len, mode='mul') :
        if seq_len == None:
            return inputs
        else:
            mask = K.one_hot(seq_len[:,0], K.shape(inputs)[1])
            mask = 1 - K.cumsum(mask, 1)
            for _ in range(len(inputs.shape)-2):
                mask = K.expand_dims(mask, 2)
            if mode == 'mul':
                return inputs * mask
            if mode == 'add':
                return inputs - (1 - mask) * 1e12

    def call(self, x) :
        # if only Q_seq,K_seq,V_seq is passed, then no Mask is done
        # If you pass Q_seq,K_seq,V_seq,Q_len,V_len, then Mask the extra part
        if len(x) == 3:
            Q_seq,K_seq,V_seq = x
            Q_len,V_len = None.None
        elif len(x) == 5:
            Q_seq,K_seq,V_seq,Q_len,V_len = x
        Let's take a linear transformation of Q, K, and V
        Q_seq = K.dot(Q_seq, self.WQ)
        Q_seq = K.reshape(Q_seq, (-1, K.shape(Q_seq)[1], self.nb_head, self.size_per_head))
        Q_seq = K.permute_dimensions(Q_seq, (0.2.1.3))
        K_seq = K.dot(K_seq, self.WK)
        K_seq = K.reshape(K_seq, (-1, K.shape(K_seq)[1], self.nb_head, self.size_per_head))
        K_seq = K.permute_dimensions(K_seq, (0.2.1.3))
        V_seq = K.dot(V_seq, self.WV)
        V_seq = K.reshape(V_seq, (-1, K.shape(V_seq)[1], self.nb_head, self.size_per_head))
        V_seq = K.permute_dimensions(V_seq, (0.2.1.3))
        # Calculate inner product, then mask, then softmax
        A = K.batch_dot(Q_seq, K_seq, axes=[3.3]) / self.size_per_head**0.5
        A = K.permute_dimensions(A, (0.3.2.1))
        A = self.Mask(A, V_len, 'add')
        A = K.permute_dimensions(A, (0.3.2.1))
        A = K.softmax(A)
        Print and mask
        O_seq = K.batch_dot(A, V_seq, axes=[3.2])
        O_seq = K.permute_dimensions(O_seq, (0.2.1.3))
        O_seq = K.reshape(O_seq, (-1, K.shape(O_seq)[1], self.output_dim))
        O_seq = self.Mask(O_seq, Q_len, 'mul')
        return O_seq

    def compute_output_shape(self, input_shape) :
        return (input_shape[0] [0], input_shape[0] [1], self.output_dim)
Copy the code

Save the above two pieces of code to attention_keras.py

 

Training model

  • Import packages to record textual data
# % %
from keras.preprocessing import sequence
from keras.datasets import imdb
from matplotlib import pyplot as plt
import pandas as pd

max_features = 20000

print('Loading data... ')

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

# tags are converted to unique heat codes
y_train, y_test = pd.get_dummies(y_train),pd.get_dummies(y_test)

print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')
Copy the code

Output :(if run for the first time, will download the file, I have downloaded now run directly loaded)

Using TensorFlow backend.
Loading data...
25000 train sequences
25000 test sequences
Copy the code
  • Data normalization
#%% data normalization processing

maxlen = 64


print('Pad sequences (samples x time)')

x_train = sequence.pad_sequences(x_train, maxlen=maxlen)

x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

print('x_train shape:', x_train.shape)

print('x_test shape:', x_test.shape)
Copy the code

Output results (pad_SEQUENCES, portions larger than maxlen will be intercepted, portions smaller than maxlen will be filled with maxlen)

Pad sequences (samples x time)
x_train shape: (25000, 64)
x_test shape: (25000, 64)
Copy the code
  • Defining the network model

batch_size = 5
from keras.models import Model
from keras.optimizers import SGD,Adam
from keras.layers import *


S_inputs = Input(shape=(None,), dtype='int32')

embeddings = Embedding(max_features, 128)(S_inputs)
embeddings = Position_Embedding()(embeddings) # Add Position_Embedding to slightly improve accuracy

O_seq = Attention(8.16)([embeddings,embeddings,embeddings])

O_seq = GlobalAveragePooling1D()(O_seq)

O_seq = Dropout(0.5)(O_seq)

outputs = Dense(2, activation='softmax')(O_seq)


model = Model(inputs=S_inputs, outputs=outputs)
# try using different optimizers and different optimizer configs
opt = Adam(lr=0.0005)
loss = 'categorical_crossentropy'
model.compile(loss=loss,

             optimizer=opt,

             metrics=['accuracy'])

print(model.summary())
Copy the code

Model output (simple model with few parameters)

================================================================================================== input_1 (InputLayer) (None, None) 0 __________________________________________________________________________________________________ embedding_1 (Embedding) (None, None, 128) 2560000 input_1[0][0] __________________________________________________________________________________________________ position__embedding_1  (Position (None, None, 128) 0 embedding_1[0][0] __________________________________________________________________________________________________ attention_1 (Attention) (None, None, 128) 49152 position__embedding_1[0][0] position__embedding_1[0][0] position__embedding_1[0][0] __________________________________________________________________________________________________ global_average_pooling1d_1 (Glo (None, 128) 0 attention_1[0][0] __________________________________________________________________________________________________ dropout_1 (Dropout) (None, 128) 0 global_average_pooling1d_1[0][0] __________________________________________________________________________________________________ dense_1 (Dense) (None, 2) 258 dropout_1[0][0] ================================================================================================== Total params: Trainable Params: non-trainable Params: 0 __________________________________________________________________________________________________Copy the code
  • Train, save the model

# % %
print('Train... ')

model.fit(x_train, y_train,

         batch_size=batch_size,

         epochs=2,

         validation_data=(x_test, y_test))



model.save("imdb.h5")
Copy the code

Output :(more than 80% accuracy can be achieved by training two epochs, excellent model)

Train... Train on 25000 samples, Validate on 25000 samples of Epoch 1/2 25000/25000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 4 ms - 95 - s/step - loss: 0.4826 acc: 0.7499 - val_loss: 0.3663 - val_ACC: 0.8353 Epoch 2/2 25000/25000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] 4 ms - 93 - s/step - loss: 0.3084 acc: 0.8680 - val_loss: 0.3983 - val_acc: 0.8163Copy the code

Save the above code to train.py