After the traditional PC, PC Internet and mobile Internet, dialogue and interaction is a very imaginative key technology direction in the next era. Both academia and industry have a high degree of attention. At the same time, as one of the key nodes of OPPO’s all-things integration strategy, it bears a great and arduous mission.

Algorithm is one of the core capabilities of dialogue interaction, which determines the intelligent level that voice assistant can achieve and has high technical value. This paper will mainly introduce the goal of dialogue and interaction, the key problems to be solved by the algorithm, the current situation and trend of the industry, the main practice and progress of Oppo assistant, as well as the challenges and future.

The first part: Introduction of the dialog system and the engineering practice of Oppo small cloth assistant

1. Objectives and key issues of dialogue interaction

Generally speaking, the goal of dialogue interaction is to complete the human-computer interaction process such as task execution, information acquisition and emotional communication by means of natural dialogue through speech or text. For example, intelligent assistants such as Jarvis and Big White in science fiction movies represent people’s expectations about the ideal state of dialogue and interaction ability.

Dialogue interaction has received more and more attention in recent years. What is the reason behind it? Looking back at the development of information technology over the past 40 years, it is not difficult to understand. As we know, information technology has experienced the traditional PC, PC Internet, mobile Internet several big times, each of which is closely related to the device, thus giving birth to the revolution of the entrance and interaction mode.

Now we are moving towards the AIOT era with high expectations. Dialog interaction, with its great imagination in the aspects of a new generation of search engine, super service distribution center and new interaction mode, carries the mission and vision of the next entry-level interaction reform in this new era.

However, it is very difficult to achieve the ideal dialogue interaction effect, mainly because it needs to leap over the mature perceptual intelligence technology to the cognitive intelligence. At present, there are still many problems in the field of cognitive intelligence that have not been fundamentally solved or even clearly defined. Typical cognitive challenges include how to represent and understand common sense, how to make machines capable of reasoning and planning, and how to make machines capable of imagination and autonomy like humans.

To some extent, it can be said that solving the problem of cognitive intelligence is basically equivalent to realizing strong artificial intelligence, which shows the high difficulty of dialogue interaction.

The main process of dialogue interaction is shown in the figure below, from which it is not difficult to find that almost all the key nodes are related to algorithm, which is the core ability to achieve better dialogue interaction effect.

Semantic understanding and dialogue ability are the focus of this paper. The main task of this paper is to first understand what the user wants, then decide what to give to the user, and finally assemble appropriate resources to properly satisfy the user. The semantic algorithm system composed of semantic understanding and dialogue ability is to achieve the above goals. The system will mainly face two categories of systemic problems and technical problems, as shown in the figure below.

Systemic problems include how to decouple complex systems that need to support domain-wide Query, hundreds of skills, multiple devices and multiple channels; How to iterate efficiently in the face of many product requirements, long modules, long processes and large algorithm uncertainty; How to guarantee the experience through the effect monitoring for the diversified oral Query that cannot be exhaustive; How to avoid low-level defects, irrelevant answer, excessive undercover and other “mentally retarded” experience.

Technical problems include algorithm selection, modeling and solving of key problems, multi-round dialogue control, performance assurance, etc.

2. Industry status and algorithm trends

First of all, dialogue interaction has become increasingly mature in application scenarios, covering many fields such as smart home, vehicle, life and travel, professional services, etc. Conciseness and quickness is the natural advantage of natural language dialogue and interaction, which is accepted by more and more users. It is estimated that there will be more than 7 billion devices equipped with voice assistant in 2020.

In addition, from the perspective of development trend, the top technology companies have never given up their investment in this direction in the past decade. The three foreign companies represented by Apple, Amazon and Google all regard dialogue and interaction as their very important direction. The domestic situation is similar, Baidu, Xiaomi, Ali are actively layout, aiming to seize the dialogue and interaction of the future traffic entry.

A notable trend is the third party equipment oriented dialogue interaction intelligent assistant gradually fade out, mainly focus on their own equipment to develop, in addition to the related technology and equipment of the cause of the tightly coupled, there is a more important reason is that the entrance is too important, no head equipment manufacturers technical side willing to put it in full to a third party.

Dialogue interaction is also a hot topic of academic research. From the trend analysis of ACL papers, it can be seen that the direction of dialogue interaction has risen rapidly in the past five years and will become the most popular research direction in 2019 and 2020.

Reference: the Trends of ACL: https://public.flourish.studi…

In terms of the core cognitive understanding algorithm, its solution paradigm has evolved from the traditional multi-module pipelined solution that strongly relies on language, problem type and manual customized experience, to a simpler, universal and efficient end-to-end integration scheme. The evolution of this paradigm greatly simplifies the problem solving process, which can not only effectively avoid cumulative errors, but also enable the application of big data, big models and big computing power, and significantly improve the effect.

In the past two years, large-scale pre-training models represented by Google Bert have emerged at the model level, sweeping the lists of major language modeling tasks, and releasing huge potential for the research and development of more advanced semantic understanding algorithm models, which will undoubtedly provide solid technical support for the development of dialogue interaction.

To sum up, both the industry and academia are paying great attention to the direction of dialogue and interaction, which reflects the industry’s prediction of the future trend. The breakthrough of algorithmic technology further catalyzes the landing speed of conversational interactive products, making the future come sooner.

3. The practice and progress of the algorithm system of Xiaobo assistant

As mentioned earlier, semantic understanding and conversational ability together form the core semantic algorithmic system of Oppo Boob Assistant. The following sections will present in detail our practice and key progress in this direction.

First of all, in terms of business requirements, we mainly consider four dimensions: business boundary, dialogue ability, user volume, and evaluation index.

  • In terms of business boundary, Xiaobu Assistant belongs to a full-scene open domain dialogue and interaction system. The fields to be supported include system control, information Query, video and audio entertainment, life service, intelligent chat, etc., including about hundreds of skills. The breadth of user Query is very large.
  • In terms of dialogue ability, in addition to simple command control and single-round problem, it also needs to support multi-round task-oriented ability, weak multi-round ability, context understanding ability, as well as dialogue recommendation, active dialogue and other high-level abilities.
  • In terms of the volume of users, Xiaobu needs to cover the company’s mobile phones, watches, headphones, TV and other 100-million-level devices and daily activities of tens of millions of magnitude;
  • In terms of evaluation indexes, it mainly considers demand coverage, intention call accuracy rate, skill satisfaction, response time and so on.

In summary, the mission of Bud Assistant is to create a dialogue connection between the large user base of the company’s equipment ecology at one end and the excellent conversational service at the other end, to realize user value, technical value, etc.

In order to support the above business requirements, we abstract four design principles to guide the design of the algorithm system.

  • Domain divide and conquer: The complex problems in the whole field are decomposed into simpler sub-problems to be solved in groups by means of domain division, which reduces the difficulty of solving and improves the controllability of the system.
  • Effect first: In order to avoid “mentally retarded” experience as much as possible, it is not rigid to any single technology, and the algorithm scheme design is driven by effect first to avoid low-level defects.
  • Closed-loop monitoring: establish a perfect closed-loop monitoring mechanism, improve the test coverage through the design of multi-Laton test cases such as product, test and research and development in the research and development stage, and adopt real-time dynamic test set monitoring and manual evaluation online to guarantee the experience.
  • Platform benefits: In order to cope with numerous medium and long tail skills support, promote the construction of skills platform, and reduce the research and development and maintenance costs of medium and long tail skills with consistent and common platformer solutions.

With reference to business requirements and design principles, the overall architecture of the algorithm system of the current Bu Assistant is shown in the figure below.

First of all, in terms of platform and tools, the basic algorithm is mainly based on the mainstream deep learning algorithm in the industry, on which algorithm schemes are built for different types of problems, and further encapsulated into modules such as NLU framework, general graph question-and-answer, skill platform and open platform.

Then, in terms of business, the top layer will use symbolic, structured and numerical ideas to conduct general processing on Query, and then divide the business according to system application, life service, video and audio entertainment, information Query and intelligent chat, and each business line will iterate independently. Finally, combined with dialogue generation and fusion sorting, the best skills are selected to meet the demands of users.

From the processing process, it can be divided into several links: preprocessing, intention identification, multi-ranking, resource acquisition and post processing. Among them, the first three nodes are mainly responsible for the recall rate of intention, the last two nodes are responsible for the coverage of resources and the correlation of results, and the whole process is responsible for the final skill execution satisfaction.

The key algorithm modules involved in the semantic algorithm system are shown in the figure below, and the following three core modules, namely semantic understanding, dialogue management and dialogue generation, will be introduced.

Intent recognition is the core module of semantic understanding. Its main task is to infer what the user wants to do through the analysis of the user’s current Query and interaction history, including several typical scenarios of closed domain, open domain and context.

Slot extraction is a task closely related to intention recognition. The main task is to extract key information from the user’s current Query and interaction history to assist in accurately obtaining the answer/content required by the user.

Intention recognition and slot extraction constitute the semantic understanding module, and the difficulty lies in the diversification of oral language (hundred million level independent Query). Ambiguity (e.g. Peppa Pig is an animated cartoon and an App); Relying on knowledge (e.g. “Can’t” is also the title of a song).

Conversation management is another key module of semantic algorithm system. Its task is to deduce the state of the conversation based on the current Query and the context of the conversation, and then infer the best response of the next step of the dialog system.

After semantic understanding and conversation management is complete, it is necessary to combine conversation generation to achieve the final appropriate implementation feedback of skills. The task of dialogue generation is to obtain the appropriate response language in the appropriate way according to the parsing result of semantic understanding and the actions to be performed.

In terms of algorithm model, Xiaobu is mainly driven by strong deep learning. On the one hand, this kind of module has a good effect, and on the other hand, the technical scheme has been relatively mature, and there are many successful cases.

However, it is worth emphasizing that there is basically no “one-trick” algorithm solution to solve all technical problems in this field. Generally, the master model based on deep learning is responsible for ensuring the fundamentals of the effect, and it still needs to combine custom rules to deal with badcases of corners.

In the face of systematic application of manipulation skills, in order to improve the effect of semantic understanding, we mainly adopt a scheme based on the integration of rules and deep learning models, in which reverse rules are used to quickly reject queries outside the domain, forward rules are used to cover strong arguments, and deep learning model is responsible for the generalization recognition of general cases. In addition, in order to improve the joint accuracy of intention and slot, multi-task joint learning was introduced.

Multi – task joint learning can disambiguate intention and slot. It is mainly applied to telephone, text message, schedule and other skills. Compared with single task independent learning, the general accuracy can be improved by 1% ~ 3%. Combined with detailed data-driven optimization and rule verification, the call accuracy rate can be more than 95%.

For knowledge dependent skills, such as music, radio, film and television, we mainly adopted the intention recognition scheme integrating knowledge, as shown in the figure below. The main difficulty of such skills is that it is impossible to determine the intention from the sentence pattern alone. It is crucial to extract the resource field accurately from the Query, and the intention identification after integrating the resource correlation results can significantly reduce the difficulty of problem solving.

Different from closed domain, intention recognition in open domain is difficult to be modeled as a classification problem, and semantic matching scheme is generally required to solve the problem. For this kind of problem, we mainly adopt the deep semantic matching method, as shown in the figure below.

Compared with the traditional matching based on text symbols, the matching accuracy can reach more than 95%. However, there are also problems such as subject recognition and semantic inclusion, which need to be controlled with downstream verification strategies. At present, it is mainly used in information query and chat QA matching.

In addition, in order to further improve the effect of semantic understanding, we are also exploring the implementation of large-scale complex models. In the direction of large-scale pre-training language model, the team has improved, retrained and fine-tuned the open source model, and achieved rapid improvement in the effect. Currently, it ranks the fifth in the overall ranking of Chinese Language Understanding Assessment Benchline (CLUE).

However, the computational complexity of such models is very high, and it is generally difficult to meet the timeliness requirements of online reasoning. Therefore, it is necessary to combine knowledge distillation and other accelerated schemes before it can be applied.

Common knowledge distillation schemes can be divided into data distillation and model distillation. The assumption of data distillation is that the simple model is inferior to the complex model because of the lack of annotated data. If the complex model is used to provide enough pseudo-annotated data, the simple model can gradually approach the effect of the complex model.

The hypothesis of model distillation is that simple models not only lack sufficient data, but also lack good guidance. If the intermediate results obtained during the training of complex models are used to guide the training process of simple models, it will help simple models approach the effects of complex models. Both data distillation and model distillation are applied in the small cloth assistant business.

Conversation system is also considered as the next generation of search engine, and users have many demands for knowledge questions and answers, which are expected to obtain accurate answers. In order to meet such demands, we build our own knowledge base through data acquisition and data mining, and then provide question-and-answer services in combination with online semantic matching and KBQA.

In addition, in order to accurately answer the questions of the fact class in the vertical domain, we also built a general question-and-answer capability based on knowledge graph. For the fine vertical class, the domain graph is built through data cooperation and self-help crawling, and then accurate question-and-answer is conducted based on the template and graph.

In terms of dialogue management, the commonly used schemes include the scheme based on finite state machine, the scheme based on slot-filling, and the end-to-end scheme. The difficulties are flexible process control, context inheritance and forgetting, intention jump, exception handling, etc. Currently, the mode of slot-filling is mainly adopted.

In order to achieve better context understanding in multiple rounds, xiaobu assistant implements a context understanding scheme based on reference resolution, which is used to deal with the common problems of reference and omission in multiple rounds of dialogue.

ACL 2019 multi-turn Dialogue Modelling with Utterance ReWriter

With the help of dialogue management and context understanding, Xiaobo Assistant has supported immersive strong multi-round mode, free switching weak multi-round mode, contextual reasoning multi-round mode, covering task-based, information query, multi-round chat and other business scenarios.

In terms of dialogue generation, there are mainly three types in the industry: template-based, retrieve-based and model-based. Due to the weak controllability of the generative model, the template based and retrieve-based schemes are mainly adopted by Xiaobu at present, and the generative model is still under pre-study.

In terms of algorithm engineering, in the early stage, in order to go online quickly, a Python-based service framework was provided to make up for the weak concurrency capability of a single service by deploying multiple instances. At present, operator engineering reconstruction and optimization are also being explored for services with high computational complexity, and more simple and efficient service modes are being explored together with machine learning platform team.

In terms of skill building, in the early stage, in order to quickly go online, we mainly focused on skills customization research and development. At the end of last year, we started the construction of the skills platform. The main idea is to standardize offline model generation and online reasoning processes, operationize key algorithms, complete skill research and development through data import and process configuration, and reduce the cost of medium and long tail skill support and maintenance.

Finally, in order to ensure the effect experience of dialogue and interaction, we combined the data team with the evaluation team to build a closed-loop monitoring scheme for the whole process. First, self-testing by the R&D team ensured that the effect of the algorithm model met the expectation, and then a round of batch testing was conducted when the version was released to ensure that no new risks would be introduced. After the launch, there will be routine monitoring and real-time monitoring to ensure the overall effect and the normal monitoring of key functions respectively; In addition, manual based sampling and tripartite reviews will be introduced to further monitor the experience.

4. Challenges and future thinking

Although great advances have been made in algorithmic technology for conversational interaction in recent years, there are still a lot of challenges that users have come to expect from Jarvis and Baymax.

First of all, in terms of semantic understanding, the current model is essentially based on statistical induction of data and lacks robustness and completeness when confronted with extreme cases.

Secondly, as a candidate with potential to replace search engines, it is bound to assume the role of “know-it-all”. Then, low-frequency Q&A has problems such as open field, obvious long tail effect, very dependent on knowledge content, and so on, and the construction difficulty and cost are very high.

In addition, different from the relatively mature search and recommendation scenarios, the iterative optimization of dialogue interaction ability mainly relies on manual labor, which makes it difficult to connect with the high-speed self-feedback and self-learning engine driven by big data and make rapid improvement.

The challenges of the future is also far more than that, OPPO small assistant team will continue to be in more powerful semantic understanding ability, more knowledge, more fluent dialogue, dialogue in the field of management, and the feedback, weak supervision, the evolutionary learning ability, etc, actively explore, to make the user experience the best Chinese field intelligent assistant and make unremitting efforts.