Introduction: Today, I would like to introduce the practice of OPPO assistant in the construction of the dialogue system skill platform, mainly divided into four aspects:

  • Business domain modeling, building common capability maps
  • Preliminary exploration of semantic understanding ability, multiple types of scene support
  • Multi-mode easy to expand flow of dialogue management
  • End-to-end one-stop offline platform, visible skill life cycle

Business domain modeling, building common capability maps

1. What does it take to implement an intelligent assistant

After the user speaks a sentence, we first do speech recognition, identify what the user’s real query is, and then we do semantic understanding, identify the user’s intent, slot, and so on. Once we have identified the user’s intent, we manage the entire conversation according to the user’s context and define some of the strategies for the conversation. In this process, we will rely on some knowledge to manage the whole conversation strategy. After we decide what action to perform, we can generate the entire conversation.

Buassistant is an intelligent assistant built into OPPO smart phones and IoT devices. It is a multi-type integrated conversation system.

For task-based dialogues, The assistant will perform some system applications, such as opening the APP, listening to songs and so on. For the question-and-answer dialog, The assistant will return the user’s query information, such as asking where Beijing is, what day of the week it is, etc.; Finally, there is conversational conversation, which is mainly related to small talk.

The whole dialogue system is semantic understanding from the user’s input, and finally generates a card to reply to the user’s information. Of course, we also have several different portals where we can display different responses.

2. What is a skills platform

The skill platform we introduce today is the low code dialog system management platform built by OPPO assistant. We hope to create skills, configure skills and train skills in zero-code or low-code way, and finally carry out automatic on-line of some skills, coupled with the iteration of skill automation, to ensure that the whole life cycle of skills is zero-code or low-code state. The entire platform needs to be designed to be relatively generic, because there will be different dialogue logic to reuse. In terms of scalability, we need to support a variety of business scenarios. As mentioned above, Small Assistant is a multi-scene dialogue system. In addition, the whole dialogue platform is not only for internal use, but also needs to be open to the outside. We must ensure the security and stability of the whole platform, without affecting the mainstream business of Xiaobuassistant.

Based on the scenarios described above and the business capabilities to be established, we define the entire capability map. Online platform provides data editing capabilities, offline platform provides model training and model evaluation capabilities, and semantic understanding provides intention understanding, slot parsing and some general text processing. And conversation management provides the ability to support different conversation modes, generate conversation policies, and execute conversations.

Next, I’ll expand on how we’re building each of these capabilities.

Preliminary exploration of semantic understanding ability, multiple types of scene support

1.Generalize the NLU process

The ability to understand semantics is essentially how we identify in various ways what the user wants to do. Here, we define some generic NLU processes, including the following three:

  • Based on model, we can automatically train users’ corpus into a model, and then upload skills;
  • Based on matching, when the user’s corpus is very small, we will recommend him to use this matching scheme to achieve some high precision matching;
  • Based on knowledge, for some question and answer corpus, we will recommend him to use the third type, that is, knowledge-based NLU recognition.

The whole process is divided into: pretreatment, pretreatment, intention identification and slot extraction.

The first is preprocessing, which mainly includes text error correction, text normalization and Query rewriting. Normalization is the processing of special characters, case conversion, etc. Sometimes the voice conversion query does not fit the user context, and we need to rewrite the query.

And then the pre-processing, for the model, we need to do the numerization, the prediction through the numerization, so we will do the numerization of the word vector or the word vector. At the same time, we will also embed some defined knowledge, embed knowledge into the model, do a numerical processing.

After the pre-processing, we will transfer the numerical vector to the model for prediction. If the model is not used, we will allow users to edit some rules to achieve accurate matching through the rule engine. For this kind of matching, we also support vector retrieval, can do some retrieval from the semantic level.

After we identify the user’s intention, we need to do the processing of slots. We have defined several slots, one is based on dictionary slot extraction, we have our own dictionary extraction method of DAG+DP. At the same time, we have also defined dozens of common slots, so that external users do not need to provide their own dictionary for conventional slot extraction, such as extraction of city, number, name and other common slots, as long as you can choose. We’ll also plug in some tripartite slots, and if he has NLU capabilities of his own, we can give him access to his own tripartite slots. Some abilities have the ability to train models, we also support them to access some model slots.

2,Model-based intent recognition

The process of model-based intent recognition is divided into four steps:

First is some data standardization of skills, including intentions configuration, such as define skills may have three intentions, each intent will have some kind of slot, for each of the intention of it need to provide standardized corpora, can provide some corpus, can also provide some negative corpora, can also go to configure it slot dictionary, here is primarily a custom dictionary.

This information is fed into the offline training system, which does data enhancement. For the configured negative corpus, the negative corpus will be enhanced, and the extracted slots will be enhanced. If the rules are configured, they will also be enhanced, and some generalization will be done. After data enhancement, preprocessing mainly includes entity recognition and knowledge embedding. After the basic data is processed, it is put into the model. There are several general models defined to do the whole training of intention recognition. And then, after that training, through knowledge distillation and so on, to generate a model that you can actually reason with. For some small models, we will directly make local inference, which is more efficient. For some large models, TF serving is used for prediction.

After the model is generated, we integrate the process to do the standardized NLU, which is the pre-processing, pre-processing, intent identification and slot extraction we mentioned earlier.

3,Intent recognition based on retrieval

For intent recognition of retrieval, there will also be its intent recognition, configuration corpus and slot dictionary. However, relatively speaking, corpus does not need a lot of numbers, and more retrieval is generated through its slots and sentence patterns. So we do preprocessing, but it’s not quite the same, we do normalization, we do basic processing of adjectives and prefixes and suffixes. Relatively heavy is the processing of the slot, we will be dictionary slot, universal slot and external slot integrated to do the slot decision. Finally, through our own pre-training model, semantic vector model to do some operations, get semantic vector representation. In order to guarantee the effect of the whole retrieval, we not only can do through semantic vector retrieval, the text also will be asked to do the normalized processing, generate conducted slot enhanced text, calculate the synonyms and word segmentation has been enhanced after processing of text, the overall on search engines, and so on the search engine, we can have the characteristics of the four dimensions.

We have two heavy search engines, the first one is based on semantic vector, and the other one is text search engine, which means we will retrieve the whole intention from two dimensions. After retrieving the intent, we integrate our custom Rank process to select which representation to base the result on. This generation of intents is also integrated into the standardized NLU process.

4,Question-based intention recognition

Question-based intention recognition differs from the previous one in that it does not have many intentions. It may be a class of intentions, but there will be a lot of questions in this class of intentions. Q&a consists of standard questions and similar questions, and we ask the user to configure his questions and his responses, and then go through a search-based process to normalize adjectives and slot processing. But here, we will also focus on synonyms, because many of the problems are relatively similar. After synonym processing, the pre-training model is called to perform general model operations to obtain semantic vector representation. Then do some text processing again, put it into the search engine, and finally integrate it into the standardized NLU process. The NLU process here will only get the question, and the response will be generated by the dialog management module later.

5,Componentized core functions and operator choreographed services

As mentioned above, we have defined many generic NLU processes. In order to better reuse these NLU processes and carry out more NLU extensions, we have de-componentized the core functions of NLU business and choreographed the whole service in an operator way.

Underlying component services. The text engine includes normalization, prefix and suffix processing, adjective processing and so on in the preprocessing mentioned above. Search engines are vector search and text search mentioned earlier. The model engine includes local inference and Serving inference. Rules include regular-based rules and so on.

Generic business operator. For example, slot generator, dictionary slot operator, general slot operator, there will be some pre-processing slot operator. We will also define some general multi-classification models and business operators for knowledge embedding. We define dozens of operators as general operators.

The upper layer arranges a process based on DAG graph, defines each operator in different states, how to flow to the next module, and then executes the business logic of the next business operator, and finally realizes our NLU process.

6,Large-scale data text processing service

One of the big challenges we face is that there is a lot of data in this kind of text processing service. One skill may require millions of dictionaries. As our skills grow, it is difficult for the text processing service to carry all the skills.

Therefore, we split the text processing services in some areas, separating different functions to support slot, word, and sentence processing. Each process integrates algorithms to define the domain. The upper layer can support preprocessing, word segmentation, slot extraction and normalization. In order to deal with this large volume of data processing, we borrowed the redis slot splitting idea, and we hashed each skill once to calculate its slot, and then grouped it based on slot definition. When we have a bit of expansion in the overall skill set, we can automatically expand it, and then we will recalculate its slot. Since we are hashing, the first two slots remain on the same node, and we recalculate the new nodes and fragment them into different nodes.

7,Vector retrieval

Vector retrieval integrates a graph – based clustering search engine. Because there are many skills to support, each skill faces different situations. Some skills have more corpus, while others have less corpus. For this kind of scene we have investigated different vector retrieval engines, and finally we choose the graph-based method to do semantic vector retrieval. This graph-based approach has an open source implementation, so instead of having to do it in stand-alone memory, we can easily separate the entire semantic vector retrieval from our business logic.

We have copied the HNSW algorithm of graph retrieval into ES plug-in, so THAT ES can do clustering well. Not only semantic vector retrieval, our text retrieval can also well reuse some of the text retrieval capabilities of ES itself. We can complete the business process through ES once, so we choose TO use ES as the search engine.

Multi-mode easy to expand flow of dialogue management

So now that we’ve done some work in the semantics module, we’re going to talk about how we’re going to do dialog management, that is, how we’re going to perform actions.

1.Dialogue Management Overview

B’s assistant’s dialogue management is divided into three parts.

The first part is the dialogue strategy. There are many skills in the assistant, including platform skills and non-platform skills. We first aggregate the conversation strategy with other non-platform skills, and then influence the outcome of the entire conversation at different levels of the unified conversation strategy. The next step is the execution of the dialogue, there are also platform dialogue execution and non-platform dialogue execution, in the central control system will do an overall collection of these executions. After collection, the results will be optimized. There are different optimization strategies, and the customized optimization strategies of our platform will also be used to finally bring results to users.

Based on the above three parts, we defined some conversation management for the platform. First, we defined some template policies, including pre-checking, slot handling, and intent handling. For example, if there are several rounds, you can inherit the intention, maintain the dialogue or jump the intention. Sometimes the template design doesn’t meet the needs of the users, so we’ve defined editable dialogue flows and online programming that allows users to modify their own dialogue policies to influence the overall system’s dialogue policies and increase their own priorities. In the execution section of the dialogue, we define some conditional judgments and verbal templates to do resource survival and protocol conversion.

2,Unified dialog protocol, custom dialog state control

In the dialog protocol section, we define some keep-ons of the dialog state. We define general conversation information that conveys contextual intent and historical information about the user, such as what the user’s interests are and what commands were executed in each round. Through the transmission of general conversation information, we can distinguish the intention confirmation states in the position of the conversation strategy, and then make refined rank decisions after obtaining these states.

With this mode we can achieve the effect of weak rounds. Weak rounds are achieved mainly through contextual information retention and passing. With a slotted round, we fill the slots to get to the jump. In immersive strong rounds, we will do dialogue retention and intent inheritance.

3,Custom conversation support

For editable dialogue flow, we implemented streamlined dialogue design based on the state machine approach. The user can edit the voice input in each place on the interface. When the user input conditions or voice input conditions are met, the user can determine whether to switch to achieve dynamic conversation flow. Such as I defined in node 1 when the user about the chengdu slot, need to jump to the next node, if not, will jump to another node, the node of the dialogue can also support, after when the user performs the action whether to save the state, how should help users well defined several rounds of state, To achieve different strong multi-round weak multi-round logic, so that users are more flexible to achieve the overall dialogue flow. For each action and each condition, we define different extensibility conditions that allow the user to judge conditions in each dimension.

The online programming pattern minimizes access costs through functional programming. First, we define a general DST conversation state transition interface, so that users can customize after they get their NLU information. In addition, we support python code embedded in our Java code to implement custom interfaces.

4,General conversation protocol

The platform supports different types of conversation reply, such as text, audio, text, quick application and so on. The protocol of different reply types is complicated. If we access through three parties, the protocol of three parties will be complicated.

To cope with the transformation of different protocols, we defined a syntax tree of templates that we could fill into our templates and eventually generate a client protocol of our own. In this way, we can deal with the situation of too many protocols and complicated three-party access protocols, so that the overall management template can be controlled, and we can also do some transformations on these nodes.

End-to-end one-stop offline platform, visible skill life cycle

One of the other things we’ve done with Bubo assistant is to make the overall skills work in a life-cycle way so that our platform users can constantly optimize their skills.

1.Data synchronization across environments

We defined several environments, starting with the ability for users to perform some basic tests in their own test environment after editing the data. After the basic test is completed, we will push the data to the evaluation environment, and integrate the user’s data with various internal data of the cloth, so that it can run end to end. After the run, we’ll know if the skills configured on the platform affect some of the skills we developed internally, and if so, does it affect regress or semantics? Should we be prompting developers to change the corpus, or is there something wrong with our strategy? This is all done in an evaluation environment.

When the evaluation environment is verified, we will push it to the internal test environment, and special tests will verify the performance of skills on each end. When the overall data is validated in the test environment, we will push the official skills to the online environment. This ensures that the overall user profile on the platform is manageable and is a skill that can be used.

Overall, it depends on our custom data distribution service. After the user edits the platform online, we store it in the basic data service, which then pushes the data to the object cloud storage by calling the data distribution. The data distribution service notifies offline data distribution services through the API gateway to pull corresponding data. When the online service is notified of this data, we pull it up from the object cloud and load it. These processes benefit from the fact that we have defined many standardized processes in the NLU, intent recognition, and conversation management sections, all of which are expressed in data-based terms. This way we can push a good piece of data, make the overall NLU and DM work well, and do some validation in different environments.

2,Online data mining

After verification, the skill goes online. After launch, we hope to do some continuous optimization. We also defined a continuous optimization process for online data mining, since many skills go live and we don’t know if they have an impact on the real sample of online users.

If users need to enable some functions of our data mining, we will screen the online data and get some corpus suspected not recalled or suspected to have been wrongly recalled. Next we will use our own pre-training model to do rough recall. Then we call the refined minimodel to make some predictions. By predicting, we can get suspected bad cases. A lot of times we still need to identify a valid badcase by annotation. If it’s a valid badcase, we’ll label it positive or negative. Through this annotation, we will automatically provide these positive and negative samples of data and let the user choose whether to add it to the corpus. If it can be added to the corpus, we will help users to optimize the whole data model and corpus. In other words, if some users are configured with less data, we can further enrich their corpus in this way. If he has problems with the configuration of the corpus for different purposes, we can also correct his corpus in this way. At the same time, because we have obtained some positive and negative samples through manual annotation, we can continue to optimize the whole model and our skills in the process. In this way, we can achieve end-to-end performance of the whole process, so that the small cloth skills platform to achieve more complete skills.