Project introduction

“Hand in Hand to Learn NLP” is a series of practical programs based on the Paddle LP. This series by baidu many senior engineers, meticulously provided from the word vector, the training of language model, the information extraction, the emotional q&a, structured data q&a, text analysis and text translation, machine with transmission and dialogue system and so on practice project of the whole process, designed to help developers more comprehensive grasp clearly baidu fly blade framework in the field of NLP usage, And can draw inferences from one example, flexibly use the fly paddle frame and paddle LP to carry out NLP deep learning practice.

In June, Baidu Feiojiao and the Natural Language Processing Department jointly launched 12 NLP video lessons, in which the practical project was explained in detail.

Watch the course replay please stamp: https://aistudio.baidu.com/aistudio/course/introduce/24177

Welcome to the QQ group of the course (group number :758287592) to communicate ~~

Intent recognition is what it is

Intention recognition refers to the analysis of the core of user demand, output and input the most relevant information query, for example, in the search to find a movie, express delivery, municipal office needs, such as these needs at the bottom of the retrieval strategy can make a big different, the error of identification will almost certainly find content can meet user requirements, lead to produce very poor user experience; It is a very challenging task to understand exactly what the other person is trying to express during the conversation.

Such as user input query “xianjian wonder biography”, we know that “xianjian wonder biography”, news, pictures, there are both games and TV series and so on, if we found the user via the user intention recognition is want to see “xianjian wonder preach” TV series, the TV series as a direct result returned to the user, will save the user clicks the search, reducing search time, It greatly improves the user experience. If the other person says, “My apple never lags,” then we can use intentionality to recognize that the apple is an electronic device, not a fruit, and the conversation goes smoothly.

In short, the accuracy of intention recognition affects the accuracy of search and the intelligence of dialog system to a great extent.

This example shows how to use Ernie’s pre-training model to complete the slot filling and intent recognition tasks in task-based conversations, which are the building blocks of a pipeline-based task-based conversation system.

The data set used in this example is the Crosswoc Chinese dialogue data set. The data set covers multiple domains, including attractions, restaurants, hotels, transportation, and more.

A quick practice

This project is based on the flying paddle PaddlenLP. Remember to give the PaddlenLP a small Star⭐

Open source is not easy, we hope to support ~

Making address:

https://github.com/PaddlePadd… https://github.com/PaddlePaddle/PaddleNLP PaddleNLP documents:

https://paddlenlp.readthedocs.io

As with most NLP tasks, the presentation process for this example is divided into the following four steps:

2.1 Data preparation

The data preparation process is as follows:

1. Customize the dataset with load_dataset()

The dataset preprocessed using the official script has been uploaded to this project in AI Studio (the project is linked to at the end of this article). By looking at the format of the dataset, we can write the data file read function and pass in load_dataset(). The dataset can be created.

2. Loading paddlenlp. Transformers. ErnieTokenizer used for data processing in the input text data groeb before the training model, need to pass data processing into Feature. This process usually includes steps such as word segmentation, token to id, add special token, etc.

The PaddlenLP has built-in tokenizer for various pre-trained models. Specify the name of the model you want to use to load the corresponding tokenizer.

This can be done simply by calling methods in the Tokenizer.

3. Call the map() method to batch the data

Since we passed in lazy=False, the dataset we customize with load_dataset() is the MapDataSet object.

MapDataSet is an enhanced version of the Paddle.io. DataSet. Its built-in map() method is suitable for bulk dataset processing.

The map() method passes in a function for data processing. It works perfectly with the tokenizer.

4.Batchify and data read in

Use the methods provided in Paddle.io.BatchSampler and PaddleLP.Data to group the data into batches.

Then use the Paddle.io. Dataloader interface to load the data asynchronously from multiple threads.

Batchify features detail:

Now the data set preparation is complete. Next, we need to network and design the loss function.

2.2 Model structure

Using Ernie as an example, the following project describes how to multitask a pretrained model to accomplish both intention recognition and slot filling tasks.

The intention recognition and slot filling in this example are essentially a sentence classification task and a sequence labeling task. Multi-task learning can be realized by combining loss of the two.

From src.models import Jointernie model = Jointernie. from_pretrained(' Ernie-1.0 ', intent_dim= Len (intent2id), Slot_dim = len (slot2id), dropout = 0.1, use_history = use_history)

2. To design the loss function Jointernie model, the sequence_output of ErnieModel will be taken out and connected to a linear layer whose output dimension is slot class number to get slot_logits. And pooled_output is connected to a linear layer whose output dimension is the number of intent classes to get intent_logit.

So the loss in this example consists of slot_loss and intent_loss, and we need to define the loss function ourselves.

Slot filling is equivalent to a multi-label sorting task at the location of each token, and intention recognition is equivalent to a multi-label sorting task for the entire sentence. So the designed loss function is as follows:

Class NLULoss(paddle.nn.layer): def __init__(self, pos_weight): super(NLULoss, self).__init__() self.intent_loss_fn = paddle.nn.BCEWithLogitsLoss(pos_weight=paddle.to_tensor(pos_weight)) self.slot_loss_fct = paddle.nn.CrossEntropyLoss() def forward(self, logits, slot_labels, intent_labels): slot_logits, intent_logits = logits slot_loss = self.slot_loss_fct(slot_logits, slot_labels) intent_loss = self.intent_loss_fn(intent_logits, intent_labels) return slot_loss + intent_loss

After choosing the network structure, we need to set up the fine-tune optimization policy.

2.3 Set the fine-tune optimization strategy

The learning rate for Transformer models like Ernie/Bert is the dynamic learning rate for Warmup.

Dynamic learning rate diagram

# maximum learning rate during training learning_rate = 3E-5 # training rounds epochs = 10 # learning rate warmup_proportion = 0.0 # weight attenuation coefficient, similar to model regular term strategy, Avoid model overfitting weight_decay = 0.0 max_grad_norm = 1.0 num_training_steps = LEN (train_data_loader) * epochs # learning rate decay policy lr_scheduler = paddlenlp.transformers.LinearDecayWithWarmup(learning_rate, num_training_steps,warmup_proportion) decay_params = [ p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", Optimizer = addle.optimizer.adamw (learning_rate=lr_scheduler, parameters=model.parameters(), weight_decay=weight_decay, apply_decay_param_fun=lambda x: x in decay_params, grad_clip=paddle.nn.ClipGradByGlobalNorm(max_grad_norm))

Now everything is ready for us to start training the model.

2.4 Model training and evaluation

The process of model training usually has the following steps:

  • Take a batch data out of the dataLoader;
  • The batch data is fed to the model to do forward calculation;
  • The forward calculation result is transmitted to the loss function to calculate loss;
  • Loss reverses back and updates the gradient. Repeat the above steps.

After each Epoch is trained, the program calculates the F1 score for both tasks by calling the evaluation() method.

Give it a try

Do you think it’s funny? Xiaobian strongly recommends beginners refer to the above code and knock it again, because only in this way can you deepen your understanding of the code.

Corresponding codes of this project:

https://aistudio.baidu.com/aistudio/projectdetail/2017202

For more information about PaddlenLP, please visit Star Collection on GitHub:

https://github.com/PaddlePaddle/PaddleNLP

Baidu AI developer community, https://ai.baidu.com/forum, for developers across the country to provide a platform of communication, share, unriddling, let developers no longer “alone” in the research and development on the road, through constant communication and discussion to find a better technical solutions. If you want to try a variety of artificial intelligence technologies and explore application scenarios, join the Baidu AI community as soon as possible. Everything you think about AI can be realized here!

Scan the QR code below, add the little assistant WeChat “JD card, small custom surrounding, mysterious gift box, suitcase” and other benefits you can get ~