** Abstract: ** This paper will focus on the relevant technology of dialogue robot and its application practice in the industry, and introduce the latest progress of Huawei cloud dialogue robot in the direction of multi-mode, small sample and pre-training.

It has been nearly 70 years since the Turing test was put forward in 1950, during which the dialog system technology has developed rapidly. The method has also evolved from the original rules to the current deep learning method, and the robustness and accuracy of the dialogue system have been greatly improved. In 2020, the number of working papers related to conversational systems was the highest in the history of natural Language processing top conference ACLs, which further confirms that conversational systems have received great attention in recent years.

This paper will focus on the relevant technology of dialogue robot and its application practice in the industry, and introduce the latest progress of Huawei cloud dialogue robot in multi-mode, small sample and pre-training. It will be expanded in the following five parts:

1. Introduction and brief history of dialogue robots 2. Natural Language Understanding in Dialogue robots 3. Dialogue management in dialogue robot 4. Progress of multimodal dialogue robot 5. Future direction and summary of dialogue robot

Introduction of dialogue robot and brief history review

Looking back at the development of conversational robots, the first thing to mention is the famous Turing test. In 1950, Turing published a paper called computational Machines and Intelligence, which proposed the first criteria for evaluating artificial intelligence, known as the Turing Test. The meaning is that the tester and the testee, usually a person and a machine, are separated from each other, and the tester asks the testee random questions through some devices. If, after a period of communication, more than 30% of the subjects can’t tell which questions were answered by a human or a machine, the machine passes the test and is considered to have some human intelligence. Although there are many controversies about using Turing Test to evaluate dialogue systems, the idea of Turing test has led the development of dialogue systems for decades.

The first human-computer dialogue system after the Turing Test was ELIZA, written by Weizenbaum at MIT between 1964 and 1966. ELIZA was primarily used in clinical practice, to mimic therapists and provide counseling services to patients. At that time, just use some keywords to identify, but the response is still relatively large. Back in 1995, Alice, a very clever and popular conversation robot, was born. Alice has won the Roberna Award three times. The Roberna Prize is a major ai competition that uses the standardized Turing test to select the most human-like program in the competition. Why is Alice’s score so amazing? The main reason is that it uses AIML language, which has a great competitive advantage compared with similar products in that year.

To sum up the function of dialogue robot in this period, it is basically based on keyword recognition or dialogue system constructed by rules of expert system. However, with the evolution of the rules of expert system, the bottleneck gradually appears. The data-driven approach has been widely studied and gradually applied to conversational systems. The conversational systems of this period were driven primarily by the module of natural language understanding, combined with dialogue management based on reinforcement learning. To explain why it is driven by natural language understanding and conversation management, researchers have done two typical works: From 2005 to 2013, Professor Steve Young of Cambridge University proposed dialogue management based on POMDP and a dialogue system based on Pipeline.

During this period, natural language understanding methods based on machine learning flourished and many classical machine learning models emerged. Steve Young from Cambridge University mentioned above, who is the father of Siri in the background of apple mobile phone, laid a very solid foundation for the follow-up dialogue system research of deep learning methods, including application implementation. Basically formalize many of the classic problems of conversational systems. However, with the subsequent development, machine learning based on tradition also quickly encountered a bottleneck, especially in speech recognition and image classification, accuracy cannot be greatly improved. So in the third generation of research, these systems basically turn to technologies based on big data and deep learning. For example, we are now familiar with Amazon Alex, Google Home, Siri and other assistant robots. In fact, they are mainly deep learning methods, that is, the recognition of intention, language understanding. Based on deep learning technology, the end-to-end conversation system becomes feasible.

In recent years, the end – to – end dialog system has received more and more attention and investment. Since 2017, the dialogue system has been applied in the industry on a large scale. Some people say that 2017 is the first year of the dialogue robot.

So, why do we need a conversation system, a conversation robot? What’s the point of a conversation bot? Why should we study it? This starts in the context of the huge demand for conversational robots.

Demand is mainly in two directions, one is to B, the other is to C. To B scenario, for example, enterprise customer service, where the labor of customer service personnel is simple and repetitive, human customer service can be replaced by conversational robot automatic customer service. The second is office assistant. Huawei Welink, which is office software like Dingding, WeChat and WeChat, can also help people to access some applications. This office-based assistant can help book flights and set up schedules. Another direction is marketing. Robots can also help enterprises to promote, sell and introduce products.

A typical application for toC is the personal assistant, especially when it comes to speakers at home, which now has a large application background. Including for the elderly, children and other specific groups of emotional care needs, corresponding to the development of emotional care robots. There are even robots that can study synchronously with children, teach them lessons and do some recreational activities.

So what is a conversation bot? First through the last three words “robot”, the first thought may be a physical robot. Indeed, real robots can do the man-machine dialogue interaction, particularly like hkust kejia robots, can do some multimodal interaction, give a person some emotional care, even a robot it can according to the instruction to do some of the control of household, hkust kejia robots can go to the operation of the refrigerator, microwave oven, can understand the person’s instructions to some object on the operating environment, In addition, similar to Japan’s Asem robot, this kind of robot is a physical class of hardware machine. Then there are virtual software robots that can be deployed on our operating systems, like Microsoft’s Cortana. It can also be deployed on hardware, or even on phones, like Siri or Amazon Alex.

To sum up, the main purpose of dialogue robot is to help users complete tasks through multiple rounds of dialogue, or to maintain a continuous and effective communication among users, and it can be deployed to a large number of hardware devices.

Here, dialogue robots are divided into two categories. The first category is task-completing dialogue robot, and the second is chatty dialogue robot. The diagram above shows a comparison of two types of robots, which we might call a rational robot and an emotional robot.

The task completion conversational robot can be a little bit more rational, it needs to do some tasks. Often it may need to invoke some repository, or some API behind the service. But a sentient robot is a chatty conversation robot, and at the product level, it might be a little more sentient and need to understand some of the emotions of the user. Task-completion conversational robots tend to have specific goals because they do need to complete specific tasks. Chatbot it usually has no specific goal, it will talk to you continuously. In terms of the control of the number of dialogue rounds, the task-completion dialogue robot hopes to have as few dialogue rounds as possible, because as few as possible, it can achieve the goal faster. A chattering robot might want to be able to have more and more conversations with people, and continue to communicate.

Task-completion dialogue robot usually contains multiple modules and can adopt rules or statistical learning methods. However, chatchat robot usually adopts some retrieval or sequence-to-sequence generation methods, which is the difference between the two methods. The following will focus on the task completion type of robots.

Historically, it has been 70 years since the Turing Test, and the field of conversational robotics is still very challenging. It can be summarized as follows:

First of all, the diversity of languages is very complex and one meaning can be expressed in various ways. Similarly, the same expression may mean different things in different contexts, which is linguistic ambiguity.

The diversity and ambiguity of language will bring great challenges to the development of conversational robots.

In addition, there are semantic representation, first of all, it is necessary to let the machine to understand the language, and the symbols of the language itself can not be understood by the machine, it is necessary to convert the symbols into the internal representation of the machine. So what’s the definition of an internal representation, what’s the richness of an internal representation. However, the richer the representation is, the weaker the learning ability may be, and the weaker the representation is, the faster the learning may be. How does this need to be balanced?

Then there is the robustness of the system, the balance between precision and recall. Dialogue robot is also faced with a problem, especially in the to B scene where data is extremely scarce. Without data, how to train, how to do model tuning, and how to ensure its robustness? It also includes the current interpretability of deep learning, the bridging of symbolic and contextual knowledge. When a robot talks to a human, it is usually based on common knowledge. Everyone knows that the capital of China is Beijing, but if a robot does not know this knowledge, how can it continue to communicate with a human?

The figure above shows a framework flow commonly used by a dialogue robot, which is mainly divided into three modules. The first is natural language understanding, the purpose of which is to translate natural language text into machine internal semantic representation. Task-based conversations usually have an assumption. Assuming semantic representation, it is composed of three semantic elements: a domain, an intent, and a slot. A field usually has multiple intents, like the weather and in this field, you might check the weather, you might check the temperature, you might check the wind direction, all these different intents, usually there might be multiple grooves in one intents, and when I say check the weather what does it check? There could be a time, there could be a place, a slot could be a task-based conversation, you could think of a slot as a key information concept like a keyword, like a time, a place, or any user defined entry type.

For example, when a user says: “What’s the weather like in Shenzhen today?” The task of natural language understanding is to identify the domain and intent of the sentence. So the output field is weather, the intention is to check the weather. The time is today, and the place is Shenzhen. Often in real-world applications, you need to translate the time today into a real time expression, such as August 26, 2020. Facilitate the docking of the background system.

After the natural language understanding module, enter the dialog management module, which contains two sub-modules, dialog state tracking and dialog policy. From the perspective of dialog management responsibilities, the input to this step is the output of the natural language understanding module. The output is an action that indicates what the system should do and what should be replied to the user, and this generated action is usually a formal, structured content. So it usually goes through a natural language generated module.

The purpose of the natural language generation module is to translate the output of the dialog management into a natural language description that the user can understand, at which point it will generate a reply that says: “Ok, the weather in Shenzhen today is sunny and the temperature is 20-30 degrees Celsius.” Such a natural language description. This forms a common framework for a very typical conversational robot.

In particular, conversation management can be subdivided into conversation status tracking and conversation strategy modules. Dialogue state tracking means that you need to input the results of natural language understanding, and you need to update the internal and internal state of the machine, where it jumps to, and what happens to the value of each slot. Like this already know time is today, location is shenzhen, when there is no access to this information, before its time, place, must be empty, unknown, when received this information, need to update it, time, turned out to be today, location is shenzhen, this is the dialogue state tracking needs to be done. The dialogue strategy is to select an action according to the state of these robots, and this action needs to be fed back to the user. As shown in the figure, an Inform action is generated through the result of the state.

Natural language understanding in conversational robots

So, what is the practice and progress of Huawei Cloud in natural language understanding? Let’s start with the natural language understanding module in the conversational robot.

The natural language understanding module task consists of three tasks, one is domain recognition, one is intention recognition, and one is slot filling.

In fact, the task of domain recognition and intention recognition is the same, which is a classification task. In the circles above are some of the typical algorithms we are involved in, in domain and intent recognition. The lower left corner is the method of some rules, which was mentioned earlier in the introduction of the history of the dialogue robot, mainly including some keyword recognition, regular rules, and context-free grammar. This is actually used by industrial robot platforms as well.

In the upper left corner of the figure are traditional machine learning methods like traditional SVM, decision tree, and even some of LR’s methods. In the later deep learning, we use TextCNN,Fasttext, r-CNN. In recent years, pre-training has become popular, and even the paradigm of categorizing a task has changed. Pre-trained models and fine-tuning methods like BERT and Huawei’s NEZHA can do this very well.

For some platform-level scenarios, especially the to B scenario, there are many different types of scenarios, because some enterprises may not have data, some enterprises may not have much data, and some enterprises do have a lot of data as the logs are generated. For different data, it is impossible to apply BERT or a pre-training model at first, which is not very feasible.

We did some exploration of these different situations. See if in the case of no sample first, how to do such a recognition field, so huawei cloud some dialogue robot technology platform, it provides some custom rules, because rules, once a rule configuration, it can be generalized to identify a large number of text, inside the rules provide adapted some wildcards, Including it can configure some slot fields, even some common fields, common fields may be some Word, including the user’s own dictionary, these can be configured. On the right are some examples given, with these rule configurations, I can do some cold boot. Even in cases where the user does not have training data, this is of great help.

In the second case, how do you choose the best method when you have a lot of data? This is done in a way that has become familiar in recent years with pre-training and fine-tuning, such as the one on the right. The basic structure is transformer, which outputs a CLS tag Logits followed by a fully connected layer to predict and classify.

After a large number of experiments, this kind of task is indeed found to be better, such as huawei Welink, the office software on the cloud. Welink has some assistant intentions. Among more than 80 intentions, each intention allocates 10, 50, and 100 corpora to it, and then puts all the corpora into it. Its effect is really increasing, and the final effect can basically reach more than 95% effect. If you have more data, it does work really well. One problem, however, is that deployment costs are high. Because if every user uses BERT, the cost pressure is very big. Although it is fine-tuned through pre-training, it still requires a lot of data.

Is there any other way we can go about it? Yes, it can be solved by using some model distillation methods, such as the Tiny-NEZHA distillation in the image above to distil large models into smaller ones. In fact, NEZHA is not very different from BERT’s own model. Both are based on Transformer structure, but NEZHA has some slight structural differences. One is that it may use some relative position encoding, and the second is the word mask, because the word mask may be words or word-level mask. And increasing the Batch size may use some mixing precision training, including the LAMB optimizer, which may be slightly different.

The second is our distillation technology, Tiny-Bert, which does distillation in both places, one is universal distillation in pre-training, and universal distillation means you can do distillation in training. The second is that task-related distillation can also be done, and some data enhancement work has also been done. The Chinese series model NEZHA has been open source, and the code and model can be downloaded publicly.

How does distillation work? Figure out what to learn first, and how to learn it second. Because teacher and Student of large model can learn a lot of vector representation originally, a representation generated by vector, including its own hidden State, can be learned. The direction of each layer is different. In the output layer, logits can be fitted by the Logits of the prediction layer of the traditional Logits student model. In the middle layer, it is a distillation of the Embedding layer, and MSE can be used to constantly approach the expression of the middle layer.

Through these methods, many such distillation experiments have been performed in NLPCC task, including large and small models, tall and thin models, short and fat models, etc., as well as a corresponding effect on layers 4, 6 and 8 as shown in the table below. The final result is still in the upper right corner of the figure. In a small model task like ChineseProve, our score reached 77.7 and we got the first place.

If you need a lightweight model, is there another way? For industry, you can combine some traditional features with some deep features. Traditional features such as language models, parts of speech and entities, including synonyms and stop words, can be used, while deep features such as Word2vec, including some shallow deep learning encoders, can be implemented.

The second problem is how to deal with domain intention recognition in the case of small sample scenarios without a large amount of data. In this case, new categories may be added at any time, and the new categories may contain several pieces of data that cannot be trained with the previous data.

In this case, scholars put forward the concept of small sample learning, whose goal is to provide you with only a number of samples (maybe 15 samples), according to the 15 samples to learn, to judge what this category is.

The idea of small sample learning is divided into two processes, one is the stage of meta-training, this step is very simple. After a basic training data is obtained, the basic data is divided into two sets, one is the support set and the other is the Query set. In the support set, each category may be very limited, only sample sentences, k is usually very few, maybe 1~5, and Query can be selected by itself. In the final stage of meta-testing, we randomly pick 1 to 5 samples and input a Query to see if we can predict correctly through these small samples.

There are three different types of small sample learning approaches, model-based like the one we just saw, optimize based, and metric based. We’ve done some exploration in the metrics. There are many ways to measure, such as MatchingNet is a matching network; ProtoNet, the only difference is the distance calculation is not quite the same, in addition to relationnet. We did compare it with the traditional BERT pre-training plus fine-tuning approach on a small sample. Some comparative experiments were conducted in ten categories and five samples. The accuracy of BERT’s traditional classification is 83.2%, but the accuracy of small sample learning method may reach 93%, so the improvement is really quite large. Finally, ten categories and ten samples can achieve 96% accuracy.

Similarly, a question was also found in the process of experiment, why can it achieve 96% accuracy? There is a trick behind this, and this phenomenon also exists in small sample learning. Under the existing framework, the training test data of each Epoch are actually randomly sampled. When there are 2000 categories, 5 samples are randomly collected. However, when the data itself contains a large number of simple samples, such sampling method is difficult to cover the difficult samples, so the actual effect is very doubtful. To this end, we have also done experiments and proposed a method combining small sample learning and course learning. The method is divided into several parts, part of the first to do difficulty assessment. We can use BM25 or TFIDF to calculate the gap between each sample and select the difficult samples to study together. The other part does data division, which can divide the data of similar difficulty together.

In previous experiments, training with difficult samples was very, very poor as shown in the figure above. Another idea is to complete the training on the basis of ensuring that the test level is relatively difficult, but find that the effect will still decline quickly. As mentioned above, the accuracy rate of 96% May be achieved by measurement. However, after such analysis and experiment, it will be found that the real small sample learning does not have such a good effect. In order to solve this situation, it is necessary to combine curriculum learning, from easy to difficult.

Finally, as shown in the figure above, the accuracy of three to six points has been improved. The current work is also ongoing. Can get the conclusion that in the simple data, courses, though cannot be significantly increased effect, can improve the accuracy in 3 to 6 points, but it can reduce variance (variance is originally with the difficulty of the training, I tested the greater the difficulty, especially big gap between the good and bad), but also directly using the traditional method of small sample learning, In fact, it can not achieve good results in difficult samples, and the previous accuracy rate of 95% is actually not credible. At the same time, small samples and courses were added to improve the effect of difficult samples.

Let’s look at the filling method. For example, if a user wants to book an air ticket to Beijing tomorrow, the robot needs to extract time as tomorrow, while destination is Beijing. Usually, in actual use, tomorrow may need to be converted into a specific time expression, so that a task can be transformed into a sequence labeling task

In addition to CRF, LSTM-CRF and BERT models, online scenes generally have a complete set of processes. Usually, some entities will be built before the dialogue. First, custom entity recognition will be done, which aims at normalization as an entity and feature extraction of fineness, and then input into the model. To improve the generalization ability of the model. At the same time, it will combine the rules of slot filling to fuse and get the output result.

What are the problems with slot filling in application scenarios? The first is time normalization, time expression will be more. In addition, different customer names may not be quite the same, the name expression is also diversified, different user name recognition will bring some difficulties. At the same time, there are challenges in the fusion of models and rules. Finally, there may be some slot filling problems in multiple rounds. The new platform needs some built-in slots, so users may be more convenient and simple to use.

As can be seen from the above, separating domain recognition and intention recognition slot filling will bring some problems when multiple tasks are performed:

Domain identification and intent identification generate errors, as does slot filling, which can stack errors through layer by layer pipelines. At this time, the multi-task model can be adopted, that is, the three label information and the three corpus can be put into one model to learn.

As shown in the above model, the joint modeling of domain, intention and slot by Bert and CRF is integrated, and the experimental results prove that it can indeed bring great improvement. The traditional CRF model may not work very well indeed, and the final ChunkF1 may only achieve 0.79 accuracy. With BERT it could achieve an accuracy of 0.87. The final ChunkF1 is improved by about two points with the addition of the domain recognition slot filling series.

Dialogue management in dialogue robot

Why does a conversation bot need a conversation management module? Why not just use natural language understanding to interface directly with the service API? Dialogue management module is very necessary, and is the core of the dialogue system, the reason is that users in many cases will not express the intention at one time, at the same time, the accuracy of each module of the system, also can not necessarily reach 100% accuracy. Either speech recognition or natural language understanding parsing itself can go wrong, leading to incorrect responses or not knowing how to respond at all. In both cases, the robot needs to communicate with the user for many times to obtain the complete intention of the user. In other words, the dialogue management module is needed to complete this part of the work.

Dialogue management is generally divided into two parts: state tracking and dialogue strategy learning. Status tracking is used to track the user’s goals, such as what the user is currently saying and what he said before. On the left side of the figure above is a simple set of states, including the possibility that there may be some related state jumps between states, and you can see the prior knowledge of how states usually jump. The structure on the right is accompanied by a conversation between the user and the robot. When the user says “I want to book a flight to Beijing tomorrow”, the user jumps from an empty state to a state like the departure point of destination, and the user interacts with the question until all slots are filled. Such a process is done through state tracing.

And then the conversation strategy, the purpose of the conversation strategy is to tell the robot what to say. Take an example. According to the red box in the figure above, the user has entered the destination and informed the departure time according to the input information of the current state. The system should judge and find that the departure point is unknown. “Rather than” Where to, please? .

What are the problems and difficulties with the conversational management task?

First, the user’s intentions cannot be known in advance, and the user may say anything else at any time, or even tease the robot. Therefore, it is difficult for the robot to capture the real intention of the user, and even face the possibility that the user may change the intention at any time.

Second, there is a lot of noise in the real environment, so the information obtained by dialogue management is not the real meaning expressed by users.

Third, most fields have a lot of intention slots, such as time or other digital information is continuous. If you want to use model reality modeling to track all possible states, traditional methods are largely unavailable. To model all possible states and jump between them, you need to enumerate all possible corpora, which is itself a statistical problem. There are many ways to manage conversation itself. From the history mentioned above, the first thing that comes to mind is the method of the state machine. For example, S1 on the left in the figure above, the user defines from S1 what actions it should take in S1, which can be forward, backward, left or right. When the forward is executed, the S3 state will be reached, and the round of interaction will be completed. This is a dialogue management implemented through the state machine. Very clearly defined how the state jumps, including what behavior should be done in the state. Second, if you have 10 slots, and there are a lot of values in that slot, and you list two combinations or several possibilities, the space is so large that it’s difficult to maintain. One way to solve this problem is the slot-based framework approach. Let’s look at the general train of thought, the first model to do a simple formalization, between think tank and tank is independent and has nothing to do with the value of the fill, without filling slot, is to ask questions, don’t ask, don’t fill, after several rounds of interactive questions over, at present a lot of enterprises, including mature large enterprises and start-up, many will adopt the way of the framework of slot.

Whether it is based on state machine or slot framework, it is essentially a set of rules. However, in history, subsequent professor Steve Young proposed a data-driven dialogue management method, which is essentially treating dialogue management as a part of Markov decision process, POMDP. If you are interested, take a look at Steve Young’s very classic review paper published in 2013 on the topic of POMDP dialogue management.

Following the history, there are many ways to start deep learning. At present, the most effective and classic model is the Trade model, which has won ACL2019 Outstanding paper. The author modeled the task of dialogue state tracking into a generation task. Firstly, the historical information was encoded into a vector, and at the same time, the domain slot was encoded, and finally the value of the corresponding slot was generated by fusion. The paper was a great success and the effect was really good. Another typical example is the reinforcement learning-based conversation management on the right of the figure above, which models conversation strategies as deep reinforcement learning problems.

After the pre-training, BERT can also solve the problem of dialogue state tracking. Bert reading comprehension technology can be used to predict the Start POS and end POS of each slot in the user’s speech, and finally extract the value of the slot. At the same time, it will combine a classified task to make a joint model. However, there is no annotation data in the real scene, so we usually interact with a robot through an emulator. This way of interaction can generate a large amount of dialogue data, and a visiting emulator is also established. With this simulator, many conversation samples can be generated. Now, according to our intention and slots, more than 7,000 dialogue samples are generated, and there are more than 3,000 dialogue samples in the training set. Finally, the tracking accuracy of Bert reading comprehension classification can reach about 90%. But there is a problem — the data generated may not be a good simulation of the real thing.

In terms of dialogue strategy, most of the industry now uses this kind of dialogue logic and dialogue process. Because in the toB scenario, it has a lot of dialogue choices, at first in a certain state, the user says any intention, it may jump to any state, it will have a lot of behavior. If it is modeled as a real reinforcement learning problem, the first pair requires a large amount of data, although it is possible to generate data through simulation, it also requires a large amount of data. Secondly, the behavior space of the real scene is very large, so it is difficult to simulate it by means of reinforcement.

However, there are many practical problems to be solved when designing this kind of dialogue flow scheme. One is the problem of slot memory, which needs to support the association between different intention slots. For example, when booking a ticket, it has already said the time and place, so when it says to check the weather, it should not ask you what weather you want to check. The second is the problem of intent memory, which needs to support multiple rounds of intent recognition. For example, when users ask about the weather, they ask “What about Shanghai?” It uses multiple rounds of information to identify weather intentions.

Conversation systems can be broken down into language comprehension, state tracking, and conversation strategies. Natural language understanding can also be incorporated into conversation management, and deep learning has allowed conversation systems to be modeled in an end-to-end manner. There are two classic efforts. One is HRED, which sets up the conversation system as a two-tier end-to-end network, with the first layer encoding the text history of the conversation and the second layer encoding the state of the conversation. This is a rough approach, where text is encoded as it comes in, and history is passed on. Another classic work is done by Professor Steve Young’s team. It looks like an end-to-end module or a branch module. It is especially similar to the way of pipeline, which first detects an intention, namely the network of intentions, and then fills a slot called brief Checker. Finally, a search is made from database, and the three information is integrated into policy network, which can be considered as a dialogue policy network. Finally, a reply is generated. Such a partial end-to-end task-based dialogue system is better understood than the previous method. And it’s much more explanatory.

These are two classic end-to-end dialogue management, so for the future human-machine dialogue, how to design a better end-to-end dialogue system architecture? Are we still using the previous two ways? What will be the future of human-computer conversation?

4. Progress of multimodal dialogue robots

Multi-mode natural human-computer interaction system is a development trend of the next generation of human-computer interaction system, it can integrate vision, hearing, touch, smell and even taste, expression efficiency is stronger than a single vision or even a single text richness. Multimodal natural language dialogue mode is the most natural and ideal way of human interaction. The reason for the study of multimodal dialogue system is that it is difficult to avoid the errors caused by speech recognition engine in real environment, and the semantic ambiguity caused by it is also very large. Is it possible to integrate information from other modules, such as video images, on the basis of language understanding, and introduce a kind of multi-modal information fusion to improve the accuracy of the computer’s understanding of the user’s intention?

More than the application of the modal dialog now not many, but there are also articles are studied, this one is the emotional perception dialogue system, while driving, drivers need to focus on attention to road conditions, but it’s hard to make his hand to an interface operation, this is a very classic multimodal problems, it can pass the driver, by oral or visual hint, Even for voice and text, the voice recognition effect may be worse when driving, so whether it can be understood through visual information and gesture information is a very typical scene.

The Chinese Academy of Sciences has developed a multi-mode natural language spoken dialogue system, which can combine some human expressions, gestures and gestures to carry out dialogues. However, in essence, these application scenarios are still a series of modes, and it has not achieved good integration of modes. So we’ve been investigating whether we can do modal fusion. Through the investigation, found that e-commerce actually there is such a scenario: the user says “I want to buy pants, I want to buy clothes,” he will also send some sample pictures, then the robot will also feedback the pictures to him, this is the natural way of a text and picture, it can form a multimodal dialogue process.

A simple definition of multimodal is that given a multimodal conversation context, including a user’s query, the goal is to generate text responses for the corresponding system. For the e-commerce scene above, only text and pictures may be provided. Of course, it can be expanded later, and voice or even some other information can be added, so it may not contain pictures. In the form of this, all you need to do is enter a historical context and add the user’s query to generate a response from the system.

For end-to-end conversation management, you can also use the HRED model, which is very simple, but it supports only one mode. In HRED, it is only necessary to add the picture information, encode the picture, and then fuse the text. The text gets the vector through RNN, splicing the two together, and then passing the RNN above one layer. This is the multi-mode HRED model that is mostly used now based on HRED structure.

After that, the model was improved. First, it could control the generation. After understanding the intention, it could control the generation of a simple and general reply, and it could also generate a multimodal and knowledge-related reply. Second, we can incorporate some knowledge into the generation process, such as triples or attribute tables, to better control the quality of the generation. But there are also big problems with these models.

The methods mentioned in the two classic papers listed here are based on hierarchical circular neural networks. This method is weak in modal fusion and encodes the sentence as a vector, which actually loses the fineness information in the sentence, such as keyword entities. On the other hand, although attribute triples are used, these knowledge can not be effectively utilized, that is, the utilization rate of knowledge is relatively low. Therefore, Huawei adopts a model called MATE, which is a context-dependent multi-mode dialogue system based on semantic element sets. Breaking the model apart, on the left is an encoder for a set of multimodal elements that encodes records from the conversation history, including all images queried by the user, stored in the conversation memory module. Why do image memory modules exist? Because some of the current text can’t see the image in front of it, there will be a attention operation. Through the attention mechanism or the image of some text embedded, selectively whether to add some pictures.

Finally, all the embeddings are concatenated into a set of multimodal semantic elements. This allows each element to have a nice interaction with the elements in the image. The second block is the right half, which is a decoding process. The decoding process can be divided into two steps. The first step is to focus on the output in the encoder, which only focuses on the previously generated attention operation. In the second stage, after decoding, combined with domain knowledge, a attention operation is performed, which can further make good use of such knowledge and the output of a previous encoder, so as to further optimize the quality of system response.

The figure above is the result of an experiment in our paper, which found some improvement if the first decoder and the second decoder were used. At the same time, our encoder of the first stage can improve 6 points on BleU-1 and 9 points on BleU-4 compared with the best method of all the previous methods, and the improvement is very large in absolute value. Meanwhile, in the following table, different modules are replaced for further analysis, including comparison without image position, previous image and knowledge.

An example is shown above. It focuses on the information of the semantic element set, the lower left part and the lower right part. Formal Shoes can focus on some information of the higher level more critical element set, including star.

V. Future direction and summary of dialogue robot

So that’s some of our progress and work on the conversational robot. For the robot industry, we hope that everyone can enjoy the fun of human-computer interaction. Even across oceans, robots will be able to communicate better with and even serve their users. Above is a picture of Kevin Kelly inside the monitor, and on the right is the Jia Jia robot from HKUST, doing a trans-oceanic and trans-lingual dialogue. But it’s a big challenge to do something like this well.

First of all, machines need to understand users and even understand many open questions of users, which requires a lot of common knowledge. For example, as mentioned above, the capital of China is Beijing. How could a robot know such knowledge? There is so much knowledge in the real world that if it can understand users’ various questions, it needs to have a lot of common sense to enrich its ability.

And of equal importance. Is now more popular personalized needs. The characteristics of each person are different, even the characteristics of robots are different. How to make different personalized responses according to the personality of each user is also the direction of relatively more research at present, and has a good prospect. In addition, for the problems that need to be solved in small sample learning, especially in toB scenarios, the challenges are quite severe. In real scenarios, enterprises do not have much data, or even no data. Small sample learning is a problem that enterprises will focus on.

Multi-mode, multi-field, pre-training, pre-training in the relative future for a period of time or will become the mainstream. From the current practice, the effect of pre-training plus fine-tuning is indeed much better than the traditional deep learning retraining effect. Combined with the current interpretability of deep learning, some people are studying the combination of neural network and symbolic class to explain deep learning, so as to better model real AI problems.

Then there is unsupervised learning. Unsupervised learning and small sample learning also face problems in enterprise scenarios. Customers may not have marked data, and there may be some unstructured data. Finally, most of the current corpus, even the corpus of dialogue robot, is before 2014, and the data of single language is still available. Recently, the data set of multi-language has been opened, so the multi-language dialogue robot will also be a good direction.

Share this article from huawei cloud community “Turing test, 70 years, review the dialogue classic practice and the latest progress of robot | to huawei cloud dialogue robot as an example, and the original author: listen2Bot.

Click to follow, the first time to learn about Huawei cloud fresh technology ~