“NLP by Hand” is a series of practical projects based on PaddleNLP. This series by baidu many senior engineers, meticulously provided from the word vector, the training of language model, the information extraction, the emotional q&a, structured data q&a, text analysis and text translation, machine with transmission and dialogue system and so on practice project of the whole process, designed to help developers more comprehensive grasp clearly baidu fly blade framework in the field of NLP usage, In addition, the NLP deep learning practice can be carried out flexibly with the paddle frame and PaddleNLP.

In June, Baidu Feiyu & Natural Language Processing department jointly launched 12 NLP video courses, in which the practical project was explained in detail.

Watch the course replay please stamp: aistudio.baidu.com/aistudio/co…

Welcome to QQ group (group number :758287592) to communicate ~~

Machine translation is the process of converting one natural language (source language) into another natural language (target language) using a computer.

This project is the PaddlePaddle implementation of Transformer, the mainstream model in the field of machine translation. Come and build your own translation model based on this project.

Transformer Is a new network structure proposed in the paper Attention Is All You Need to complete Seq2Seq learning tasks such as Machine Translation. It uses the Attention mechanism entirely to achieve sequence-to-sequence modeling.

Compared with the Recurrent Neural Network (RNN), which is widely used in Seq2Seq model in the past, Self Attention has the following advantages in converting input sequence to output sequence:

For sequences with small feature dimension D and length N, the computational complexity in RNN is O(N * D * d) (n time steps, each time step computes matrix vector multiplication of D dimension), In Transformer, the computational complexity is O(n * N * D) (n time steps compute d dimensional vector dot products or other relevance functions in pairs), and n is usually less than D. In RNN with high degree of parallelism, the calculation of the current time step depends on the calculation result of the previous time step. The calculation of each time step in self-attention only depends on the input, not the output of the previous time step, and each time step can be completely parallel. It is easy to learn long-range dependencies. In RNN, it takes n steps to establish the association between two locations that are n apart. Any two positions in self-attention are directly connected; The shorter the path, the easier the signal travels. Transformer structure has been widely used in Bert and other semantic representation models and achieved remarkable results.

This example shows how the pre-trained model represented by Transformer can accomplish machine translation tasks with Finetune. Project based on PaddleNLP, GitHub address: github.com/PaddlePaddl… PaddleNLP official documentation: paddlenlp.readThedocs. IO Full code: github.com/PaddlePaddl… Deep learning task Pipeline

Figure 2: Deep learning task Pipeline

2.1 Data Preprocessing This course uses Chinese and English data from the CWMT dataset as the training corpus. The CWMT dataset contains over 9 million samples and is of high quality, which is ideal for training Transformer machine translation models. Chinese requires Jieba+BPE and English requires BPE. Byte Pair Encoding (BPE) Advantages of BPE: Compressed thesaurus; Alleviate OOV(Out of Vocabulary) problems to some extent.

Figure 3: Learn BPE

Figure 4: Apply BPE

Def read(src_path, tgt_path, is_predict=False): with open(src_path, 'r', encoding='utf8') as src_f: for src_line in src_f.readlines(): src_line = src_line.strip() if not src_line: continue yield {'src':src_line, 'tgt':''} else: with open(src_path, 'r', encoding='utf8') as src_f, open(tgt_path, 'r', encoding='utf8') as tgt_f: for src_line, tgt_line in zip(src_f.readlines(), tgt_f.readlines()): src_line = src_line.strip() if not src_line: continue tgt_line = tgt_line.strip() if not tgt_line: Continue yield {' SRC ':src_line, 'TGT ':tgt_line} # def min_max_filer(data, max_len, min_len=0): SRC and TGT (+1 for or) Data_min_len = min(len(data[0]), len(data[1])) + 1 Len (data[1]) + 1 return (data_min_len >= min_len) and (data_max_len <= max_len) # ! bash preprocess.shCopy the code

We define create_data_loader to create Dataloader objects for the training set and validation set. The DataLoader object is used to produce batches of data. The following function call PaddleNLP built-in functions as a simple explanation: PaddleNLP. Data. The Vocab. Load_vocabulary: Paddlenlp.datasets. Load_dataset class Vocab token-ids token-ids token-ids token-ids token-ids token-ids paddlenlp.datasets. When creating a dataset from a local file, it is recommended to read function based on the format of the local dataset and pass the paddlenlp.data.Pad: padding operation in load_dataset() to create dataset paddlenlp.data.Pad: padding operation to align sentences in the same batch.

Figure 6: The process for constructing a Dataloader

Figure 7: Dataloader details

Create dataloader for training set and validation set. The dataloader for the test set is similar. def create_data_loader(args): Paddlenlp.dataset. load_dataset create dataset from local file: Train_dataset = load_dataset(read, src_path=args.training_file.split(',')[0], tgt_path=args.training_file.split(',')[1], lazy=False) dev_dataset = load_dataset(read, src_path=args.training_file.split(',')[0], tgt_path=args.training_file.split(',')[1], Lazy = False) # through Paddlenlp. Data. Vocab. Load_vocabulary from local create word table src_vocab = Vocab. Load_vocabulary (args. Src_vocab_fpath, bos_token=args.special_token[0], eos_token=args.special_token[1], unk_token=args.special_token[2]) trg_vocab = Vocab.load_vocabulary( args.trg_vocab_fpath, bos_token=args.special_token[0], eos_token=args.special_token[1], Unk_token =args. Special_token [2]) # to complement the size of the word table to a multiple of pad_factor for Tranformer's acceleration. padding_vocab = ( lambda x: (x + args.pad_factor - 1) // args.pad_factor * args.pad_factor ) args.src_vocab_size = padding_vocab(len(src_vocab)) args.trg_vocab_size = padding_vocab(len(trg_vocab)) def convert_samples(sample): Source = sample[' SRC '].split() target = sample[' TGT '].split() # make tokens to ids source = srC_vocab.to_indices (source) Trg_vocab.to_indices (target) return source, target # dataloader data_loaders = [] for I, Dataset in enumerate([train_dataset, dev_dataset]): # Convert sample token to ID through map method of dataset; The unqualified samples Dataset = Dataset. Map (convert_samples, lazy=False). Filter (partial(min_max_filer, Batch batch_sampler = BatchSampler(dataset,batch_size=args.batch_size, Shuffle =True,drop_last=False) # select Dataloader (dataset=dataset, drop_last=False) batch_sampler=batch_sampler, collate_fn=partial( prepare_train_input, bos_idx=args.bos_idx, eos_idx=args.eos_idx, pad_idx=args.bos_idx), num_workers=0, return_list=True) data_loaders.append(data_loader) return data_loaders def prepare_train_input(insts, bos_idx, eos_idx, pad_idx): Paddlenlp.data. Pad Used to align the sample lengths in the same batch word_pad = Pad(pad_idx) srC_word = word_pad([INst [0] + [eOS_IDx] for INst in insts]) trg_word = Word_pad ([[BOs_idx] + INst [1] for inst in insts]) # Extended dimensions for subsequent computations Loss lBL_word = np.expand_dims(word_pad([inst[1] + [eos_idx] for inst in insts]), axis=2) data_inputs = [src_word, trg_word, lbl_word] return data_inputsCopy the code

2.3 build model PaddleNLP offer for Transformer API calls: PaddleNLP. Transformers. TransformerModel: The realization of the Transformer model paddlenlp. Transformers. InferTransformerModel: Transformer model is used to generate task paddlenlp. Transformers. CrossEntropyCriterion: calculation of cross entropy loss paddlenlp. Transformers. Position_encoding_init: Initialization of the Transformer location code

Figure 8: Model building

Figure 9: Schematic diagram of Encoder-Decoder

2.4 The training model runs the DO_train function. In the DO_train function, the optimizer, loss function and evaluation index Perplexity are configured. Perplexity is commonly used to measure the quality of language models, or the smoothness of sentences, and is commonly used in fields such as machine translation and text generation. The smaller the Perplexity, the smoother the sentences, the better the language model.

Figure 10: Training model

def do_train(args): random_seed = eval(str(args.random_seed)) if random_seed is not None: Paddle. Seed (random_seed) # get Dataloader (train_loader), Transformer = TransformerModel(SRC_VOCab_size =args. Src_vocab_size, trg_vocab_size=args.trg_vocab_size, max_length=args.max_length + 1, n_layer=args.n_layer, n_head=args.n_head, d_model=args.d_model, d_inner_hid=args.d_inner_hid, dropout=args.dropout, weight_sharing=args.weight_sharing, Bos_id =args. Bos_idx, eos_id=args. Eos_idx) Args. Bos_idx) # define vector decay strategy scheduler = paddle. Optimizer. Lr. NoamDecay (args. D_model, args. Warmup_steps, Args. Learning_rate, last_epoch=0) optimizer = paddle.optimizer.Adam(Learning_rate =scheduler, beta1=args. beta2=args.beta2, epsilon=float(args.eps), Parameters =transformer. Parameters () step_idx = 0 # train for pass_id in range(args. batch_id = 0 for input_data in train_loader: # Batch data from Dataloader (src_word, trg_word, Lbl_word) = input_data # Get model output logits logits = transformer(srC_word = SRC_word, trg_word=trg_word) # calculate loss sum_cost, avg_cost, token_num = criterion(logits, Lbl_word) # calculate gradient avg_cost.backward() # update parameter optimizer.step() # gradient clear zero optimizer.clear_grad() batch_id += 1 step_idx += 1 Scheduler.step () do_train(ARgs) [2021-06-18 22:38:55,597] [INFO] -Step_IDX: 0, epoch: 0, Batch: 0, AVG loss: 10.513082, PPL: 36793.687500 [2021-06-18 22:38:556,783] [INFO] -Step_IDX: 9, epoch: 0, Batch: 9, AVG Loss: 10.506249, PPL: 36543.164062 [2021-06-18 22:38:58,032] [INFO] -Step_IDX: 19, epoch: 0, Batch: 19, AVG Loss: 10.464736, PPL: 35057.187500 [2021-06-18 22:38:59,032] [INFO] -Validation, Step_IDx: 19, AVG loss: 10.454649, PPL: 34705.347656Copy the code

2.5 Prediction and Evaluation model The effect of the final training can generally be tested through the test set. BLEU value is generally calculated in the field of machine translation. Each line output in the forecast result is the translation with the highest score for the corresponding line input. For the data using BPE, the predicted translation result will also be the data represented by BPE, and the original data (i.e., the data after tokenize) can be correctly evaluated.

Figure 11: Prediction and evaluation

Wouldn’t it be fun to give it a try? Xiaobian strongly recommends beginners to refer to the above code to personally knock again, because only in this way, to deepen your understanding of the code yo. The corresponding code of the project: aistudio.baidu.com/aistudio/pr… Customize your own translation system. For more information on PaddleNLP, visit GitHub star: github.com/PaddlePaddl…