OPPO’s AI assistant “Buassistant” recently surpassed 100 million monthly active users, making it the first mobile phone voice assistant with 100 million active users in China.

After more than two years of growth, The assistant has achieved a significant upgrade in capacity, but also integrated into the convenient service functions around us. The small cloth team has also overcome many technical difficulties and brought more intelligent services to users. To that end, Busch’s team wrote a series of articles detailing the technology behind busch’s assistant. This article is the first to reveal the technology behind the cloth, mainly introduces the system architecture design and evolution.

1. Industry value

1.1 introduction

Dialogue system is a technology with nearly 30 years of research history, representing the future of human-computer interaction. In the past decade, with the periodic breakthrough of voice and NLP fields and the maturation of industrial applications, user value and industry scale have risen rapidly.

In terms of scenarios, dialogue systems can be divided into three categories

  • Task-oriented: accurate answers, limited domain, and the goal is to satisfy the user with minimal interaction, such as setting an alarm clock.

  • Question and answer type: the answer is broad, limited field, to the simplest interaction to meet the user for the goal, such as encyclopedia.

  • Small talk: Broad answers, open field, and aim for conversation turns.

Intelligent assistant is a task-based, question-and-answer, chatty, integrated dialogue system product form, industry value potential is huge.

1.2 Intelligent Assistant

With the advent of AIoT era and the integration of all things, intelligent device groups increasingly rely on intelligent assistants for natural human-computer interaction. Smart assistants will cover thousands of devices and have a lot of imagination.

Juniper Research forecasts that the number of devices equipped with smart assistants will rise to 8 billion by 2023 from 2.5 billion at the end of 2018.

At the user level, although smart assistant is a niche function, with the popularity of smart devices and the gradual cultivation of early users, the familiarity and awareness are gradually rising, and there is a large space for improvement.

The user value that intelligent assistants bring is threefold

  1. The efficiency of

  2. personality

  3. emotional

With the further popularization of the industry, on the basis of small screen, no screen and large screen, more intelligent devices for vertical scenes and people are gradually extended, such as education intelligent screen, story machine, AI learning machine, etc.

Small cloth assistant is the intelligent assistant of OPPO company, covering all kinds of terminal devices of the company, and constantly adding new entry, covering many tasks, question and answer type and chat type.

As the “brain” in the intelligent assistant, the dialogue system is one of the core technologies. With a dialogue system, intelligent assistants can understand users’ demands and meet users’ efficiency, personality and emotional needs with dialogic services.

2. Industry structure

2.1 review

First, the typical architecture of a dialogue system is introduced. In academia, there are two architectures of dialogue system: Pipeline and E2E. Pipeline is widely used in industry, while E2E is still in the exploratory stage.

Pipeline modular architecture

Automatic Speech Recognition

Receives audio input and outputs a transcribed sentence text. It generally consists of four blocks: Signal processing, acoustic model, decoder, post-processing, first collect the sound, signal processing, the speech signal into the frequency domain, from the N millisecond speech proposed feature vector, provided to the acoustic model, the acoustic model is responsible for classifying the audio into different phonemes, then decoder to get the highest probability of a string of words, The final post-processing is to combine words into easy-to-read text.

NLU (Natural Language Understanding)

Responsible for representing natural language as structured data that computers can process. Receive text input and output a structured triplet of Domain +Intent +Slot. Semantic analysis is mainly carried out through word segmentation, part-of-speech tagging, named entity recognition, syntactic analysis and coreference resolution.

DM (Dialog Management)

Take control of the conversation. Take the output from the NLU and maintain some context state and conversation policy about what action to perform, such as further questioning the user to get the necessary information. DM is the main body of the Dialog system, which has the following two important modules: Dialog State Tracking (expressed by DST) and Dialog Policy (expressed by DP). DST records the t-1 or even T-N status and the status of the current time T, and determines the current session status based on the context. DP decides what actions to perform based on the session state and specific tasks.

ASR and NLU determine the lower limit of voice interaction, and DM determines the upper limit of voice interaction.

NLG (Natural Language Generation)

Generate the reply content according to the system action output by the DM. Generally there are methods based on rule template and methods based on deep learning.

TTS (Text To Speech)

You need to control the pronunciation and rhythm of polyphonic words, such as where to pause, and the light or stress of words.

Summary: Modular pipeline architecture has the advantages of strong interpretability and easy landing. Most of the industrial task-based dialogue systems are based on this architecture. The disadvantage is that each module is relatively independent, difficult to joint tuning, error between modules layer by layer accumulation.

E2E End-to-end architecture

In recent years, with the development of end-to-end neural generation models, an end-to-end trainable framework for conversational systems has been constructed. This kind of architecture hopes to train an overall mapping relationship from the user side natural language input to the machine side natural language output (i.e., combining NLU, DM and NLG as a module), which has the characteristics of strong generalization and migration ability, breaking the isolation between the modules of the traditional pipeline architecture. However, the end-to-end model has high requirements on the quantity and quality of data, the effect is not controllable, and the process modeling for filling slots, API calls and other processes is not clear enough. The effective application effect in the industrial field is still being explored.

Next, typical industry implementations of different types of conversation systems are presented.

2.2 Microsoft Xiaoice: chat-based dialogue system

Microsoft Xiaoice is a social chatbot for open domain chat, featuring “EQ”. CPS (number of conversation rounds per session) is generally used to evaluate the effectiveness of a chatbot. The larger the CPS, the better the chatbot’s ability to participate in conversations. Xiao Ice had an average of 23 rounds of CPS (April 2017 data).

The following figure shows the overall architecture of Xiaoice. It consists of three layers: user experience layer, conversation engine layer and data layer.

User Experience layer

This layer connects Xiaobing to popular chat platforms (such as wechat and QQ) and communicates with users in two modes, full-duplex mode and rotating dialogue mode. This layer also includes a set of components for processing user input and mini-ice responses, such as speech recognition and synthesis, image understanding, and text normalization.

Dialogue engine layer

It consists of a conversation manager, an empathy computing module, core chat and conversation skills. The dialog manager consists of DST and DP. Empathy computing input data such as user data and iceman design, and computational features are used as input of DM and skills. There are two different schemes for chattering and skill fusion: generative and retrieval.

The data layer

Stores collected session data (text pairs or text image pairs), non-session data and knowledge maps for core sessions and skills, and portraits of Xiao Ice and all registered users.

Related information can be found at arxiv.org/pdf/1812.08…

2.3 Little Honey Robot: question-and-answer dialogue system

Xiaomi robot is a classic pipeline architecture. Since the application scenarios of customer service robots are all text interaction on web pages, ASR and TTS modules are not involved.

It achieves the domain and platform, facing ali ecosystem, merchant ecosystem and enterprise ecosystem to support the output of PaaS and SaaS. Modularize the entire dialogue management and process, and build a parallel architecture system with pluggable algorithms and business modules.

Relevant data can be reference: zhuanlan.zhihu.com/p/33596423

2.4 Degrees Secret, Ai, Alexa and other intelligent assistants

They are mainly task-based, and also include chat and q&A. Secret and love are based on the classic pipeline architecture. The following is a brief introduction with Love as an example.

Little love

1. Multichannel dialogs manage recall, with complete NLU and Action for each vertical domain

2. Traffic is distributed in full vertical domain and traffic is reduced by intention prediction model

3. Central control module DM Policy to return the result of the intent selection

2.5 Open source solution: RASA

Rasa is based on the Pipeline architecture

1. Interpreter assumes the responsibilities of the NLU and Tracker+Policy+Action assumes the responsibilities of the DM

2. Modular design, especially the Interpreter process is customizable

3. Action isolation with the greatest variation can be embedded in the external server

4. A large number of configuration-driven designs are adopted to complete dialogue flow development based on rule configuration

5. Rasax provides dialogue driven development solutions, evaluation, annotation and testing platforms

3. Xiao Bu assisted in engineering practice

3.1 Dialogue System architecture design and evolution

OPPO’s overall system is layered as follows:

Among them, the dialogue system is the user domain, dialogue domain and semantic domain on the left, which is built by referring to the classic Pipeline architecture.

In addition to the basic experience related to voice output, the evolution goal of the dialogue system can be roughly divided into two stages.

1. Improved skill coverage and skill intent identification

2. Explore and improve skill satisfaction, highlight skill building

Stage 1 focuses on vertical domain rapid iteration, while Stage 2 focuses on public capacity building and vertical domain dialogue semantic optimization.

Stage 1: Vertical domain rapid iteration

Skill coverage and single wheel intention recognition are the main objectives. The dialogue system only needs to provide the basic capabilities of strong and weak multiple wheels to meet the demands of this stage, pursuing vertical domain to set their own goals and rapid iteration with low coupling between vertical domains.

The design principle is:

1. Conway’s Law: [Vertical domain (algorithm + engineering)], services are divided according to feature team, and each vertical domain server is divided into algorithm and engineering. Services are divided according to this, responsible for complete dialogue management and semantic understanding

2. Low coupling: inter-vertical domain engineering is uncoupled. In addition to the global sorting decision, the NLU of each vertical domain is also uncoupled

3. High cohesion: the framework abstracts common dialogue management functions, the central controller is responsible for global scheduling, and vertical domain service focuses on logic

Stage 2: Public capacity building and vertical domain optimization

When skill coverage and single round intention recognition are optimized to a certain extent, skill satisfaction is more inclined to dialogue product experience and highlight skill building.

At this stage, there are many demands for the common capability of dialogue semantics. Public construction can help reduce the cost of repeated development and maintenance between vertical domains, keep the dialogue experience consistent, and guarantee the quality and performance.

The current dialog management component is gradually decoupled under construction.

The design principle is:

1. Inversion of control: Vertical domain DM services do not directly control conversations, but provide necessary information through abstract protocols, while frameworks and common conversations manage control and decision conversations. The same is true for other dialog management components.

2. Single responsibility: dialogue management atomic capabilities are disassembled into dialogue components, which are arranged by central control services to reduce complexity and improve reusability.

3. Backward compatibility: DM services used to perform full conversation management functions, and protocol extensions ensured backward compatibility, allowing DM to host and manage conversations.

In addition to the strong and weak multiple rounds and intention identification already supported in stage 1, the following dialogue capabilities will be gradually built following the implementation of product features to create dialogue product experience and highlight skills.

3.2 Dialogue Framework

In the past, the most frequently iterated business services in conversational systems were DM and NLU, which implemented conversational logic and semantic understanding respectively.

In order to solve the common problems of DM service development and NLU service development, two sets of frameworks, DM framework and DAG framework, are abstracted.

DM framework

The DM service inputs domains, intents, slots, and conversation states, and outputs skill actions and new conversation states.

There are two stages of The DM service of The assistant:

1. In the multi-channel dialogue management stage, DM service is responsible for the complete dialogue management capability

2. In the central dialogue management stage, DM service is responsible for the output of action, and dialogue management is entrusted to the upper central control service for unified responsibility

In order to solve the common problems in the two phases of business DM service, the analysis is as follows:

1. The similarity of business processes is large, and there is a basis for unified business processes

2. Dialogue management ability repeated construction

3. The structure of the code is so different that it’s not easy for newcomers to read

4. Each DM service provides ITS own SDK for upper-layer invocation, so interfaces and protocols cannot be centrally managed

DM service development framework solves the above problems, and the design principles are as follows:

1. Adopt the idea of hierarchical design to decouple business logic and reduce the coupling and mutual influence of business

2. Use The form of Spring El expression + annotations to standardize the code style and readability

3. Rely on inversion + Richter’s substitution principle + interface oriented programming to solve the implementation of differentiated business logic at the upper level of each business

DAG framework

NLU vertical domain construction, using Python to build prototype in the early stage, Java Side Car proxy way of service.

Gradually exposed some engineering problems:

1. The operators of each group of algorithms have similar abilities, but the call order is quite different. The same operator ability is repeatedly constructed, and the maintenance cost of the operator is high. The operator ability of each group is not common

2. The Agile iteration team used Python to achieve the corresponding capabilities, but the performance of the service was problematic

In order to realize the ability reuse of skill NLU field, improve monitoring, improve performance and efficiency, support the quick online skill NLU field, layered precipitation operator, with DAG framework for orchestration.

Operator hierarchy design

The basic class library layer is responsible for the capacity building of the bottom layer, and the operators of the upper layer depend on the bottom class library layer for implementation. The business layer uses DAG framework to combine operators to construct the process topology diagram to be executed (as shown below), and quickly build domain NLU.

Pilot business benefits:

1. Flat noise reduced by 71.8%

2. Single instance concurrency increased by 50 times

3. Single-skill operator code reuse rate 95.7%

3.3 Performance optimization practices

Xiaobuassistant pursues user’s ultimate experience, and fluency is one of the most important dimensions.

The high-speed camera was used to shoot, and The assistant initiated interaction with similar products at the same time. Finally, the comparison of the display time of skill results was returned, and the winning rate was calculated according to the actual proportion of online query as the core indicator of fluency.

The following describes the engineering practices for fluency optimization.

Problem analysis

1. The server takes the largest proportion of resource execution time. In the server time, the three-party resource execution time occupies the largest proportion (80%+)

2. Voice recognition on the server takes the second place

3. Client-side rendering interactions can be more concise. Some vertical skill client interactions can be more concise and perform faster

Overall solution

1. Parallelism: prediction, serial modification and combination

2. Pruning: fast and slow layered, multi-level cache

3. Speed up: third-party self-construction, cloud VAD, interaction simplification and execution optimization

General idea of prediction

Prediction is a highly architecturally complex feature, which expands to illustrate the practice of small assistants.

In the process of the user’s voice interaction, the intermediate results of THE ASR stream are constantly on the screen until the end of the VAD recognition, and the complete user audio input is obtained.

Using the business characteristics, prediction can achieve “listening and thinking”, parallelize the identification process and execution process, and shorten the serial waiting time.

There are two strategies

1. Parallel execution of VAD phase, high accuracy and low profit.

2. Identification stage is executed in parallel, with low accuracy and high profit.

The current principal uses the first strategy, balancing the cost and time optimization of back-end request amplification.

The prediction has a great impact on the architecture, and there are difficulties in implementation. A request is split into n-1 informal predictive request and 1 formal request, and the downstream cannot know whether the request is formal or not. Stateful services introduce side effects that lead to incorrect results.

There are three ways to solve the problem:

1. The rollback status of each forecast request

2. Submit the status after the formal request is complete

3. Change the status to stateless

Forecast scenario – Each forecast request rollback status

The difficulty of implementation is that the sequence is difficult to ensure, and the distributed transaction is needed to ensure that the following steps are in a transaction.

1. Undo is rolled back

2. Dialog business logic dialog

3. Write the dialog status to write

Predictive scenario – Commit status after formal request completion

The implementation difficulties are as follows:

1. The service logic is highly intrusive, and each design service state maintenance needs to be modified to implement try, confirm, and Cancel

2. The request is enlarged and the back-end write request is increased by 1/N. Generally, the request N is predicted to be small

Prediction solution – transform to stateless

1. Write state persistence is unified upstream, and state reads and writes are carried through the request protocol. The size of the conversation status is less than 1kb

2. Part of the service that cannot be transformed into a stateless service is predicted to go to reject

The scheme is suitable for the data volume of The assistant, and the architecture is more simple and elegant, and more friendly to performance and availability.

Forecast earnings

Some skills with high hit rates are now 70+% and take 60+% less time

Enabled skills have an overall hit rate of 42.3% and take 43% less time

4. Challenges and Prospects

As the algorithm schemes and product scenarios of dialogue system continue to expand, the links become more and more complex, and the engineering architecture will face great challenges in terms of scalability and performance availability.

  • Algorithm scheme: NLU optimization from single round to multi-round, dialogue decision rules to model, standardization to personalized

  • Product scenario: multi-device, multi-entry, multi-mode

In the future, Xiao Bu will consider the following directions:

  • Dialogue system component decoupling: cloud side scalability, central control microkernel, component response algorithm product changes, component common library governance performance availability

  • End-cloud interaction mechanism optimization: end-side expansibility, dialogue system asynchronously responds to end-side change events, and ADAPTS to the change of multi-device, multi-entry and multi-mode complex interaction

  • Open protocol and SDK: provide internal business expansion points and concentrate the company’s strength to build the xiaobuassistant technology brand; Expand the skill ecosystem by combining with the external skill platform