With the popularity of mobile intelligent devices, the demand for chatbots and intelligent personal assistants is becoming more and more urgent. One industry view is that chatbots powered by ARTIFICIAL intelligence technology will become the mobile interface of the future, fundamentally changing the experience of human-computer interaction. We’ve already seen products like Amazon Echo and Google Home, which have a wide range of applications in everyday life, e-commerce, access to information, and more. But socialbots (also known as chatbots or Chitchat bots) are still an open field in AI research, with a number of challenges for industry and research to tackle.

The development of Deep learning in the past few years, especially the promotion of Deep reinforcement learning (DRL) in the past year, provides a possible technical approach to solve open domain human-computer interaction. The salient feature of reinforcement learning is that the Agent obtains feedback from the user and gives rewards to the user. Through learning, the Agent provides a response that is conducive to the maximization of the overall Reward. The success of Alphago has led to great advances in reinforcement learning for sequential decision making. These advances have in turn driven DRL research in the areas of automatic speech and natural language understanding to explore and solve the challenges of natural language understanding and response in conducting conversations. Bots based on deep reinforcement learning have the ability to expand into areas that are currently inaccessible, and are suitable for the scenarios of open domain chatbots.

This paper introduces the model, experiment and final system of MILABOT proposed by Yoshua Bengio’s group at the University of Montreal, Canada. This paper was included in NIPS 2017 Demo. MILABOT combined multiple NLP models using deep learning and did well in the open Domain Socialbot competition organized by Amazon in 2016, outperforming any non-combined model. The uniqueness of MILABOT lies in its use of reinforcement learning algorithm for utterance response tasks, which combines all successful NLP models and algorithms in the past decade on a large scale, minimizing the need for manual customization of rules and states. Secondly, in the training parameterized model, the opportunity provided by Amazon contest is used to train and test the current latest machine learning algorithm on real users. The trained system obtained significantly improved results in A/B tests.

Below, we introduce the main ideas and innovations of the paper.

An overview of the system

Early conversational systems were based on states and rules manually created by experts. Modern conversational systems typically use a combinatorial learning architecture, combining manually customized states and rules into statistical machine learning algorithms. Due to the complexity of human language, one of the biggest challenges in building open domain conversational robots is the inability to enumerate all possible states.

Using an entirely statistical machine learn-based approach, MILABOT makes as few assumptions as possible in processing and generating natural human conversation. The design of each component in the model is optimized by machine learning, and the output of each component is optimized by reinforcement learning. It is inspired by combinatorial machine learning systems, which consist of multiple independent statistical models to make a better learning system. For example, the 2009 Netflix Grand Prix Winning system used a combination of hundreds of models to do a good job of predicting users’ movie preferences. Another example is IBM’s Watson winning the Quiz game Jeopardy! In 2011. . These examples illustrate the advantages of combinatorial learning.

In MILABOT, Dialogue Manger (DM) combines a series of response models, with DM acting as the agent in reinforcement learning, and its control structure is shown in Figure 1. DM combines the responses of all models with a certain strategy. In MILABOT’s design, the response model uses a variety of strategies to generate responses on a variety of topics. This article will detail the design considerations of the various strategy models.

FIG. 1 Control structure of DM

As shown in Figure 1, the response process given by DM is divided into three steps. First, the DM invokes various response models to generate a set of candidate responses. If the candidate set has a priority response, that priority response is returned. If there is no priority response in the candidate set, the system uses the policy model to give rules and selects a response from the candidate set. Once the confidence value falls below a given threshold, the user is asked to repeat the last expression.

Below, we describe the various response models used by MILABOT and the design considerations of the policy model in generating the response.

Response model

Each response model inputs the conversation and generates the response in natural language form. In addition, the response model outputs one to more scalars that identify the confidence of the given response. MILABOT uses a combination of 22 response models that utilize some of the outstanding research results in the NLP field over the last decade. Models can be divided into:

  • Template-based models, including Alicebot, Elizabot, and Storybot.
  • Knowledge-based question and answer systems, including Evibot, BoWMovies.
  • Retrieval based neural network, including VHRED Models, SkipThought Vector Models, Dual Encoder Models, bag-of-words Retrieval Models.
  • Logistic regression based on retrieval, including BoWEscapePlan, etc.
  • Neural network based on search engine, including LSTMClassifierMSMarco, etc.
  • Generative neural networks, including GruQ Generator, etc.

Please refer to the detailed report for the introduction and training of the model used in this paper.

Model selection strategy

After the candidate response set is generated by multiple response models, THE DM uses the policy model to determine the selection policy from the candidate set to determine the response to be returned to the user. The DM must be able to select responses that improve overall user satisfaction, which requires a trade-off between real-time response and overall user satisfaction. In addition, response selection should also consider a trade-off between immediate and overall user satisfaction. Using the classical reinforcement learning framework proposed by Richard Sutton and Andrew Barto, this paper regards the problem as a sequential decision making problem, which is formally defined as a given sequential learning sequence, the dialogue at time T is, the agent needs to respond from a set of KAnd get a reward. When the system moves to the next state, the response isAfter selecting the response, the reward is. The ultimate goal of reinforcement learning is minimization. Among themIt’s a discount factor. Factors to be considered in building the reinforcement learning model include:

  • Parameterization of behavior value function: action-value function consists of argumentsDefinition,. The learned expected return value maximizes the parameter.
  • Parameterization of random strategies: Assuming that the strategy is random, then the random distribution follows a parameterized distribution of the actions. Among them,Based onIs the scoring function of parameters. Greedy tactics can be used, select the action with the maximum probability.

FIG. 2 Calculation diagram of model selection strategy scoring model. The calculation is based on behavior value function and random strategy parameterization.

The scoring function and behavioral value function are parameterized, and a five-layer neural network is constructed, as shown in Figure 2. The first layer of the neural network is the input layer, which uses features extracted from the conversation history and generated responses to represent the conversation history and candidate responses. Features are considered based on word embedding, dialogue, POS tags, Unigram word overlap, Bigrapm word overlap, and some combination of model-specific features, 1458 in total (see detailed report). The second layer contains 500 hidden cells, calculated by applying linear transformation and ReLU activation function to input layer features. The third layer contains 20 hidden layers, calculated by applying a linear transformation to the previous layer. The fourth layer contains 5 units of output probability, which are obtained by applying linear transformation to the previous layer followed by Softmax transformation calculation and corresponding to the labels given by Amazon Mechanical Turk (AMT). The fifth layer is the final output layer, which gives a single scalar value. This layer is calculated by linear transformation of the elements in the third and fourth layers. In order to learn the parameters of each layer, five different machine learning methods are studied in depth.

  • Supervised learning using crowdsourced labels. This method (called “supervised AMT”) is the first process of scoring model learning, and the model parameters obtained can be used as priming parameters for other methods. The method uses supervised learning on crowdsourced label data to give a function of behavioral valueEstimates. The data set required for training was collected by the AMT and the response was scored on a scale of 1 to 5 using a human. The research team collected 199,678 tags from real Alexa user sessions, divided into training datasets (137,549), development datasets (23,298), and test datasets (38,831). In the training model, the team used logarithmic likelihood to optimize scoring model parametersIs estimated to represent the fourth layer of the neural network of AMT tags. The model parameters were optimized using the first-order SGD method. Figure 3 shows a performance comparison using several different strategies for five different tag classes (that is, scoring the response from best to worst). As can be seen from the results, the supervised AMT achieved better performance than the other comparison methods (random, Alicebit, Evibot+Alicebot).

Figure 3. Frequency of responses to AMT tag classes when different policies are used.

  • Supervised rewarding learning. Use the learned reward function to learn the parameters of the model. Given the history of the conversation at a given time, and the corresponding response set, the rewards at a given time can be modeled as a linear regression model that predicts the scoring of the response. The goal of learning is to maximize the score. Model parameters were optimized using Mini-Batch SGD. In order to increase efficiency, Bagging method is used in combinatorial model learning. In order to avoid over-fitting during training, parameters of supervised AMT scoring model were used in initialization of the model, and further optimization was made to minimize square error.
  • Off-policy reinforcement learning. One approach to policy parameterization assumes that behavior has a discrete probability distribution so that random strategies can be learned directly from the record of conversations between the system and real users. MILABOT uses a re-weighted reinforcement learning algorithm for learning, and the model initialization parameters also use the model parameters of supervised AMT training. The data set used in the training is 5000 conversations recorded between the system and real users over a period of time. The policy parameters are optimized in the training set using SGD, and the hyperparameters and early-stop of the model are determined in the development set.
  • Use the learned reward function to do the separation strategy reinforcement learning. This approach is similar to supervised reward learning in that it uses a disengagement strategy reinforcement learning algorithm on the reward model used for training. First, the method uses well-tuned behavioral value functions to give more accurate scoring predictions of conversations at a given moment. Then, the mini-batch SGD model parameters were trained in the combination of regression models and strategy reinforcement learning. The data set used in the training also used the data set in the disengagement strategy reinforcement learning.
  • Q-learning using Markov Decision process (MDP). All of these methods are trade-offs between variance and bias. The supervised AMT method uses a large number of training sets and can give minimal variance, but introduces a large amount of bias. On the other hand, disengagement strategy reinforcement learning only uses thousands of dialogues in training, i.e. the scores learned, so the variance is very large. But because it directly optimizes the objective function, it gives very little deviation. Faced with this problem, MILABOT’s team proposed a new method called Abstract Discourse MDP. Abstract discourse MDP approximates learning strategies in Markov decision process (MDP) to reduce variance and give reasonable bias.

FIG. 4 Directed probability graph model of abstract discourse MDP.

The directed probability graph model of abstract discourse MDP is shown in Figure 4. For some time t,Are discrete variables that represent the abstract state of the conversation,Represents the history of the dialogue,Represents the action taken by the system (the selected response),Represents the sampling AMT label,Represents sampling rewards. Among them,Is defined as a triplet of discrete values, including conversational behavior states (accept, reject, request, question, etc.), emotional states (positive, negative, neutral), and expressive states (true, false). Simulation data can be directly used in the training of the model, q-learning with experience replay is used in the training method, and policy parameterization is transformed into behavior value function. The evaluation of various strategies on the AMT is shown in Table 1.

Table 1Evaluation of mean and standard deviation of strategy scoring on AMT with confidence interval of 90%

The experimental evaluation

The team used A/B testing to verify the effectiveness of the DM in selecting the strategy model. The test was carried out in the Amazon contest environment. When Alexa users were talking to the system, a random strategy was automatically specified, and then the conversation content and scoring were recorded. A/B tests can verify the comparison of different policies in the same system environment, which takes into account the different conditions of users in different time periods. The team’s testing took place in three phases.

The first phase tested five different strategy generation approaches and compared them with the heuristic baseline approach Evibot+Alicebot. In the second stage, the test mainly focuses on the separation strategy and q-learning reinforcement learning method. In the third stage, the model and training set of optimized parameters were used to further test the departure strategy and Q-learning. The test results are shown in Table 2.

Table 2 A/B test results under 95% confidence interval. * indicates 95% statistical significance.

It can be seen from the test results that the separation strategy and Q-learning perform better than other strategies. On average, Q-Leaning gave the best ratings. All in all, the experiment shows the effectiveness of the combinatorial method. MILABOT continuously improves strategy learning by combining responses from multiple NLP models and using the strategy model to select the best rated response.

conclusion

This paper proposes a new large-scale combinatorial learning based dialogue system, MILABOT, and verifies it in the Amazon Alexa Grand Prix. MILABOT uses a number of machine learning methods, including deep learning and reinforcement learning. Among them, the team proposed a novel reinforcement learning method. By using A/B test to compare with existing reinforcement learning methods, better conversational effect was achieved on real Alexa user data.

The paper proposes two directions for further work. One direction is personalization, so chatbots can provide a better user experience. The technical approach to implementation may involve learning embedding vectors for each user. Another direction is text-based assessment to eliminate the impact of speech recognition errors on chatbots.

A Deep Reinforcement Learning Chatbot

Thanks to CAI Fangfang for proofreading this article.