Deep learning has been in use for over a year, and has recently begun work on NLP natural processing. Just take this opportunity to write a series of NLP machine translation deep learning practical courses.

This series of courses will go from principles and data processing to hands-on practice and application deployment, including the following content :(update ing)

  • NLP Machine Translation Deep Learning Practical Course · Zero (Basic Concepts)
  • NLP Machine Translation Deep Learning Practice course
  • NLP Machine Translation Deep Learning Practice course ii (RNN+Attention Base)
  • NLP Machine Translation Deep Learning Practice Course iii (CNN Base)
  • NLP Machine Translation Deep Learning Practice Course iv (Self-attention Base)
  • NLP Machine Translation Deep Learning practical course wu (Application deployment)

For this tutorial, see the blog :me.csdn.net/chinateleco…

Open source: github.com/xiaosongshi…

Personal homepage: www.yansongsong.cn/

0. Project background

In the last article, we briefly introduced NLP machine translation, this time we will introduce RNN based translation model in a practical way.

 

0.1 Introduction to RNN based SEQ2SEQ architecture translation model

Seq2seq structure

RNN based SEQ2SEQ architecture includes encoder and decoder, and the decoder part is divided into train and inference. The specific structure is shown in the following two figures:

It can be seen that the structure is very simple (compared with CNN and Attention Base). Now we will further explore and understand the internal principle of the model by implementing it in the form of code.


1. Data preparation

 

1.1 Downloading Data

There is a lot of translation data available at www.manythings.org/anki/ in many languages, and the Chinese to English data sets have been selected for this tutorial.

Training the download address: www.manythings.org/anki/cmn-en…

Unzip cmn-eng.zip, you can find the cmn. TXT file, which is as follows:

# = = = = = = = = read original data = = = = = = = = with open (' key. TXT ', 'r', encoding = "utf-8") as f: data = f.read() data = data.split('\n') data = data[:100] print(data[-5:])Copy the code
['Tom died.\t died. ', 'Tom quit. ', 'Tom swam. ', 'Trust me.Copy the code

It can be found that each pair of translated data is on the same line, with English on the left and Chinese on the right using \ T as the boundary between English and Chinese.

 

1.2 Data Preprocessing

Using network training requires us to process the data into a format that the network can receive.

For this data, specifically, characters need to be converted into numbers (sentence digitization) and sentence length normalized.

Sentence digitization

Can refer to my blog: “deep application” NLP named entity Recognition (NER) open source combat tutorial, data preprocessing implementation.

English and Chinese characters are processed separately.

English to deal with

Because every word in English is separated by a space (except for abbreviations, which are treated as one word), and because punctuation marks are not separated from words, special treatment is required

Here I use a simple method to implement a space before punctuation:

def split_dot(strs,dots=",.! ?") :
    for d in dots.split("") :#print(d)
        strs = strs.replace(d,""+d)
        #print(strs)
    return(strs)
Copy the code

Use this method to dictionarize words:

ef get_eng_dicts(datas):
    w_all_dict = {}
    for sample in datas:
        for token in sample.split("") :if token not in w_all_dict.keys():
                w_all_dict[token] = 1
            else:
                w_all_dict[token] += 1
 
    sort_w_list = sorted(w_all_dict.items(),  key=lambda d: d[1], reverse=True)


    w_keys = [x for x,_ in sort_w_list[:7000-2]]
    w_keys.insert(0."<PAD>")
    w_keys.insert(0."<UNK>")
    
 
    w_dict = { x:i for i,x in enumerate(w_keys) }
    i_dict = { i:x for i,x in enumerate(w_keys) }
    return w_dict,i_dict
Copy the code

Chinese language processing

When dealing with Chinese, it can be found that there are both traditional and simplified Chinese, so it is best to convert to a unified form :(reference address)

# installation
pip install opencc-python-reimplemented

# t2s - Simplified Chinese to Traditional Chinese
# s2T - Simplified Chinese to Traditional Chinese
# Mix2t - Mixed to Traditional Chinese
# mix2s - Mixed to Simplified Chinese
Copy the code

To convert traditional Chinese to Simplified Chinese:

import opencc
cc = opencc.OpenCC('t2s')
s = cc.convert('What is this? ')
print(s)
# What is this?
Copy the code

Then use jieba to separate words from the sentence:

def get_chn_dicts(datas) :
    w_all_dict = {}
    for sample in datas:
        for token in jieba.cut(sample):
            if token not in w_all_dict.keys():
                w_all_dict[token] = 1
            else:
                w_all_dict[token] += 1
 
    sort_w_list = sorted(w_all_dict.items(),  key=lambda d: d[1], reverse=True)

    w_keys = [x for x,_ in sort_w_list[:10000-4]]
    w_keys.insert(0."<EOS>")
    w_keys.insert(0."<GO>")
    w_keys.insert(0."<PAD>")
    w_keys.insert(0."<UNK>")
    w_dict = { x:i for i,x in enumerate(w_keys) }
    i_dict = { i:x for i,x in enumerate(w_keys) }
    return w_dict,i_dict
Copy the code

Now let’s do the padding

 

def get_val(keys,dicts) :
    if keys in dicts.keys():
        val = dicts[keys]
    else:
        keys = "<UNK>"
        val = dicts[keys]
    return(val)

def padding(lists,lens=LENS) :
    list_ret = []
    for l in lists:
        
        while(len(l)<lens):
            l.append(1)

        if len(l)>lens:
            l = l[:lens]
        list_ret.append(l)
    
    return(list_ret)
Copy the code

Finally, unified operation and processing:

if __name__ == "__main__":
    df = read2df("cmn-eng/cmn.txt")
    eng_dict,id2eng = get_eng_dicts(df["eng"])
    chn_dict,id2chn = get_chn_dicts(df["chn"])
    print(list(eng_dict.keys())[:20])
    print(list(chn_dict.keys())[:20])

    enc_in = [[get_val(e,eng_dict) for e in eng.split("")] for eng in df["eng"]]
    dec_in = [[get_val("<GO>",chn_dict)]+[get_val(e,chn_dict) for e in jieba.cut(eng)]+[get_val("<EOS>",chn_dict)] for eng in df["chn"]]
    dec_out = [[get_val(e,chn_dict) for e in jieba.cut(eng)]+[get_val("<EOS>",chn_dict)] for eng in df["chn"]]

    enc_in_ar = np.array(padding(enc_in,32))
    dec_in_ar = np.array(padding(dec_in,30))
    dec_out_ar = np.array(padding(dec_out,30))
Copy the code

The following output is displayed:

(TF_GPU) D:\Files\Prjs\Pythons\Kerases\MNT_RNN>C:/Datas/Apps/RJ/Miniconda3/envs/TF_GPU/python.exe d:/Files/Prjs/Pythons/Kerases/MNT_RNN/mian.py Using TensorFlow backend. eng chn 0 Hi . A: hi. 1 Hi. 2. You have to. 3 Wait ! Wait a minute! 4 Hello ! hello save csv Building prefix dict from the default dictionary ... Loading model from cache C:\Users\xiaos\AppData\Local\Temp\jieba.cache Loading model cost 0.788 seconds. Prefix dict has  been built succesfully. ['<UNK>'.'<PAD>'.'. '.'I'.'to'.'the'.'you'.'a'.'? '.'is'.'Tom'.'He'.'in'.'of'.'me'.', '.'was'.'for'.'have'.'The']
['<UNK>'.'<PAD>'.'<GO>'.'<EOS>'.'. '.'我'.'the'.'了'.'you'.'he'.'? '.'in'.'Tom'.'is'.'she'.'it'.'我们'.', '.'no'.'很']
Copy the code

2. Model building and training

2.1 Model building and hyperparameters

A two-layer LSTM network is used

# ======= Predefined model parameters ========
EN_VOCAB_SIZE = 7000
CH_VOCAB_SIZE = 10000
HIDDEN_SIZE = 256

LEARNING_RATE = 0.001
BATCH_SIZE = 50
EPOCHS = 100

# ======================================keras model==================================
from keras.models import Model
from keras.layers import Input, LSTM, Dense, Embedding,CuDNNLSTM
from keras.optimizers import Adam
import numpy as np

def get_model() :
    # ==============encoder=============
    encoder_inputs = Input(shape=(None,))
    emb_inp = Embedding(output_dim=128, input_dim=EN_VOCAB_SIZE)(encoder_inputs)
    encoder_h1, encoder_state_h1, encoder_state_c1 = CuDNNLSTM(HIDDEN_SIZE, return_sequences=True, return_state=True)(emb_inp)
    encoder_h2, encoder_state_h2, encoder_state_c2 = CuDNNLSTM(HIDDEN_SIZE, return_state=True)(encoder_h1)

    # ==============decoder=============
    decoder_inputs = Input(shape=(None,))

    emb_target = Embedding(output_dim=128, input_dim=CH_VOCAB_SIZE)(decoder_inputs)
    lstm1 = CuDNNLSTM(HIDDEN_SIZE, return_sequences=True, return_state=True)
    lstm2 = CuDNNLSTM(HIDDEN_SIZE, return_sequences=True, return_state=True)
    decoder_dense = Dense(CH_VOCAB_SIZE, activation='softmax')

    decoder_h1, _, _ = lstm1(emb_target, initial_state=[encoder_state_h1, encoder_state_c1])
    decoder_h2, _, _ = lstm2(decoder_h1, initial_state=[encoder_state_h2, encoder_state_c2])
    decoder_outputs = decoder_dense(decoder_h2)

    model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

    # Encoder model and training are the same
    encoder_model = Model(encoder_inputs, [encoder_state_h1, encoder_state_c1, encoder_state_h2, encoder_state_c2])

    The initial state of the decoder in the prediction model requires a new state to be passed in
    decoder_state_input_h1 = Input(shape=(HIDDEN_SIZE,))
    decoder_state_input_c1 = Input(shape=(HIDDEN_SIZE,))
    decoder_state_input_h2 = Input(shape=(HIDDEN_SIZE,))
    decoder_state_input_c2 = Input(shape=(HIDDEN_SIZE,))

    Initialize the input state of the current model with the value passed in
    decoder_h1, state_h1, state_c1 = lstm1(emb_target, initial_state=[decoder_state_input_h1, decoder_state_input_c1])
    decoder_h2, state_h2, state_c2 = lstm2(decoder_h1, initial_state=[decoder_state_input_h2, decoder_state_input_c2])
    decoder_outputs = decoder_dense(decoder_h2)

    decoder_model = Model([decoder_inputs, decoder_state_input_h1, decoder_state_input_c1, decoder_state_input_h2, decoder_state_input_c2], 
                        [decoder_outputs, state_h1, state_c1, state_h2, state_c2])


    return(model,encoder_model,decoder_model)
Copy the code

2.2 Model configuration and training

A custom ACC is created to facilitate the display effect. The built-in ACC of Keras cannot be used

import keras.backend as K
from keras.models import load_model
 
def my_acc(y_true, y_pred) :
    acc = K.cast(K.equal(K.max(y_true,axis=-1),K.cast(K.argmax(y_pred,axis=-1),K.floatx())),K.floatx())
    return acc


Train = True

if __name__ == "__main__":
    df = read2df("cmn-eng/cmn.txt")
    eng_dict,id2eng = get_eng_dicts(df["eng"])
    chn_dict,id2chn = get_chn_dicts(df["chn"])
    print(list(eng_dict.keys())[:20])
    print(list(chn_dict.keys())[:20])

    enc_in = [[get_val(e,eng_dict) for e in eng.split("")] for eng in df["eng"]]
    dec_in = [[get_val("<GO>",chn_dict)]+[get_val(e,chn_dict) for e in jieba.cut(eng)]+[get_val("<EOS>",chn_dict)] for eng in df["chn"]]
    dec_out = [[get_val(e,chn_dict) for e in jieba.cut(eng)]+[get_val("<EOS>",chn_dict)] for eng in df["chn"]]

    enc_in_ar = np.array(padding(enc_in,32))
    dec_in_ar = np.array(padding(dec_in,30))
    dec_out_ar = np.array(padding(dec_out,30))

    #dec_out_ar = covt2oh(dec_out_ar)


    
    if Train:


        model,encoder_model,decoder_model = get_model()

        model.load_weights('e2c1.h5')

        opt = Adam(lr=LEARNING_RATE, beta_1=0.9, beta_2=0.99, epsilon=1e-08)
        model.compile(optimizer=opt, loss='sparse_categorical_crossentropy',metrics=[my_acc])
        model.summary()
        print(dec_out_ar.shape)
        model.fit([enc_in_ar, dec_in_ar], np.expand_dims(dec_out_ar,-1),
                batch_size=50,
                epochs=64,
                initial_epoch=0,
                validation_split=0.1)
        model.save('e2c1.h5')
        encoder_model.save("enc1.h5")
        decoder_model.save("dec1.h5")
Copy the code

64Epoch Training results are as follows:

__________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== input_1 (InputLayer) (None, None) 0 __________________________________________________________________________________________________ input_2 (InputLayer) (None, None) 0 __________________________________________________________________________________________________ embedding_1 (Embedding) (None, None, 128) 896000 input_1[0][0] __________________________________________________________________________________________________ embedding_2 (Embedding) (None, None, 128) 1280000 input_2[0][0] __________________________________________________________________________________________________ cu_dnnlstm_1 (CuDNNLSTM) [(None, None, 256), 395264 embedding_1[0][0] __________________________________________________________________________________________________ cu_dnnlstm_3 (CuDNNLSTM) [(None, None, 256), 395264 embedding_2[0][0] cu_dnnlstm_1[0][1] cu_dnnlstm_1[0][2] __________________________________________________________________________________________________ cu_dnnlstm_2 (CuDNNLSTM) [(None, 256), (None, 526336 cu_dnnlstm_1[0][0] __________________________________________________________________________________________________ cu_dnnlstm_4 (CuDNNLSTM) [(None, None, 256), 526336 cu_dnnlstm_3[0][0] cu_dnnlstm_2[0][1] cu_dnnlstm_2[0][2] __________________________________________________________________________________________________ dense_1 (Dense) (None, None, 10000) 2570000 cu_dnnlstm_4[0][0] ================================================================================================== Non-trainable params: 0 __________________________________________________________________________________________________... 19004/19004 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 98 - s 5 ms/step - loss: 0.1371 - my_acc: 0.9832 - val_loss: 2.7299 - val_my_acc: 0.7412 Epoch 58/64 19004/19004 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 96 - s 5 ms/step - loss: 0.1234-my_ACC: 0.9851 - val_loss: 2.7378 - val_my_ACC: 0.7410 Epoch 59/64 19004/19004 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 96 - s 5 ms/step - loss: 0.1132 - my_acc: 0.9867 - val_loss: 2.7477 - val_my_ACC: 0.7419 Epoch 60/64 19004/19004 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 96 - s 5 ms/step - loss: 0.1050 - my_acc: 0.9879 - val_loss: 2.7660 - val_my_ACC: 0.7426 Epoch 61/64 19004/19004 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 96 - s 5 ms/step - loss: 0.0983 - my_acc: 0.9893 - val_loss: 2.7569 - val_my_ACC: 0.7408 Epoch 62/64 19004/19004 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 96 - s 5 ms/step - loss: 0.0933 - my_acc: 0.9903 - val_loss: 2.7775 - val_my_ACC: 0.7414 Epoch 63/64 19004/19004 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 96 - s 5 ms/step - loss: 0.0885 - my_acc: 0.9911 - val_loss: 2.7885 - val_my_acc: 0.7420 Epoch 64/64 19004/19004 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 96 - s 5 ms/step - loss: 0.0845 - my_acc: 0.9920-VAL_loss: 2.7914-val_my_ACC: 0.7423Copy the code

3. Model application and prediction

Select some data from the training set for testing

Train = False if __name__ == "__main__": df = read2df("cmn-eng/cmn.txt") eng_dict,id2eng = get_eng_dicts(df["eng"]) chn_dict,id2chn = get_chn_dicts(df["chn"]) print(list(eng_dict.keys())[:20]) print(list(chn_dict.keys())[:20]) enc_in = [[get_val(e,eng_dict) for e in eng.split(" ")] for eng in df["eng"]] dec_in = [[get_val("<GO>",chn_dict)]+[get_val(e,chn_dict) for e in jieba.cut(eng)]+[get_val("<EOS>",chn_dict)] for eng in df["chn"]] dec_out = [[get_val(e,chn_dict) for e in jieba.cut(eng)]+[get_val("<EOS>",chn_dict)] for eng in df["chn"]] enc_in_ar = np.array(padding(enc_in,32)) dec_in_ar = np.array(padding(dec_in,30)) dec_out_ar = np.array(padding(dec_out,30)) #dec_out_ar = covt2oh(dec_out_ar) if Train: pass else: encoder_model,decoder_model = load_model("enc1.h5",custom_objects={"my_acc":my_acc}),load_model("dec1.h5",custom_objects={"my_acc":my_acc}) for k in Range (16000-20160, 00) : test_data = enc_in_ar[k:k+1] h1, c1, h2, Elsif = [] elsiF = [] elsiF = [] elsiF = [] elsiF = [] elsiF = [0, len(outputs)] = chn_dict["<GO>"] while True: output_tokens, h1, c1, h2, c2 = decoder_model.predict([target_seq, h1, c1, h2, c2]) sampled_token_index = np.argmax(output_tokens[0, -1, :]) #print(sampled_token_index) outputs.append(sampled_token_index) #target_seq = np.zeros((1, 30)) target_seq[0, 0] = sampled_token_index #print(target_seq) if sampled_token_index == chn_dict["<EOS>"] or len(outputs) > 28: break print("> "+df["eng"][k]) print("< "+' '.join([id2chn[i] for i in outputs[:-1]])) print()Copy the code

The test results are as follows: Basically all translations are correct.

< p style = "max-width: 100%; clear: both; min-height: 1em; > I canI can't recall the last time we met. > I can't remember whichI don't remember which is my racket. > I canI can't stand that noise any longer. > I can'I can't stand this noise any longer. > < p style = "max-width: 100%; clear: both; min-height: 1em; > I could not afford to buy a bicycle. > I couldnI can't answer all the questions. > I couldn'I can't think of anything to say. < p style = "max-width: 100%; clear: both; min-height: 1em; > I did not participateinThe dialog. < I'm not part of the conversation. > I didnI don't really feel like going out. > I don'I don't care about the future.Copy the code