1 overview

“Guess you like” is the recommended booth with the largest traffic on Meituan. Located at the bottom of the home page, the product form is information flow, and it undertakes the responsibility of helping users to complete the transformation of their intentions, discover their interests, and guide the flow of comments to various business parties of Meituan. After many years of iteration, the current ranking model for the guess what you like baseline strategy is the industry-leading streaming update Wide&Deep model [1]. Considering that the Point-Wise model lacks the correlation description between candidate items, and the product experience also has the problem of insufficient capture of user intentions, there is still room for improvement of recommendation experience and effect by starting with the model and features and understanding time more deeply. In recent years, reinforcement learning has made remarkable achievements in games, control and other fields. We try to use reinforcement learning to optimize the above problems, and the optimization goal is to achieve long-term benefits in the multi-round interaction between the recommendation system and users.

In the past work, we started from basic Q-learning and made some technical attempts along the path from low-dimensional to high-dimensional state, from discrete to continuous action, and from offline to real-time update mode. This article will introduce meituan “Guess you like” booth application reinforcement learning algorithm and engineering experience. Section 2 introduces MDP modeling based on multi-round interaction, which is strongly related to business scenarios. We have done a lot of work in user intention modeling, which initially lays the foundation for reinforcement learning to achieve positive benefits. Section 3 introduces the optimization of network structure. Aiming at the problems of unstable reinforcement learning and training, difficult to converge, low learning efficiency, and massive training data requirements, we improved the DDPG model combined with the online scenario of online A/B Test, and achieved stable positive benefits. Section 4 introduces the work of the lightweight real-time DRL framework, in which some optimizations are made for the problem that TensorFlow does not support Online Learning well enough and the flat noise spike when TF serving updates the model.

2 MDP model

In the booth of “Guess you like”, users can turn pages to realize multiple rounds of interaction with the recommendation system. In this process, the recommendation system can sense the real-time behavior of users, so as to better understand users and provide better experience in the following interaction. The “guess you like” user-page-number distribution is a long-tailed distribution, and in Figure 2 we take the logarithm of the number of users. It can be seen that multi-round interaction does naturally exist in recommendation scenarios.

FIG. 2 Statistics of page turning of booth users of "Guess you like"Copy the code

In such multi-round interaction, we regard the recommendation system as an Agent and the user as the Environment. The multi-round interaction process between the recommendation system and the user can be modeled as MDP<S,A,R,P> :

  • State: The observation of the Agent on the Environment, namely, the user’s intention and the scenario.
  • Action: Adjusts the List of recommendations at a list-wise granularity, taking into account the impact of long-term benefits on the current decision.
  • Reward: According to user feedback, the Agent is directly responsible for business objectives.
  • P(s,a) : State transition probability of Agent taking Action A under the current State S.

FIG. 3 Interaction diagram between recommendation system and userCopy the code

Our optimization goal is to maximize the benefits of Agent in multiple rounds of interaction:

Specifically, we modeled MDP<A,S,R,P> in the interaction process as follows:

2.1 State Modeling

The state comes from the Agent’s observation of the Environment. In the recommended scenario, that is, the user’s intention and scene, we designed the network structure as shown in Figure 4 to extract the expression of the state. The network is mainly divided into two parts: the Item Embedding of users’ real-time behavior sequence is taken as input, and one-dimensional CNN is used to learn the expression of users’ real-time intentions. In fact, recommendation scenes still rely on traditional feature engineering. Therefore, Dense and Embedding features are used to express users’ time, place and scene, as well as the mining of users’ behavior habits in a longer period.

Figure 4. State modeling network structureCopy the code

Here we introduce the Binary Sequence[2] method that uses Embedding feature to express user behavior mining. We abstract the user behavior sequence from various dimensions through feature engineering, and make some discrete N-base codes, indicating that each bit has N states. For example, counting whether clicking behaviors of users in different time Windows of 1H/6H/1D/3D/1W are encoded into 5-bit binary numbers, learning the representation of these numbers as discrete features and as a kind of feature processing method. In addition, there are also the transfer of click categories, the gap between click intervals and so on, which have achieved good results in the “guess you like” scenario sequencing model and reinforcement learning state modeling. The reason is that in the case of very rich behavioral data, the Sequence model is limited by complexity and efficiency, which is not enough to make full use of such information. Binary Sequence can be a good supplement.

The left part of Figure 5 is the sequence model, which uses different Pooling methods and the comparison of offline effects of ONE-DIMENSIONAL CNN respectively. The right part is the feature of Dense and Embedding, which adds users’ high-frequency behaviors, distance, behavior time interval, behavior times, and intent transfer, as well as the offline effects with all significant positive features.

2.2 Movement Design

The sorting model currently used by “Guess you like” consists of two isomorphic Wide&Deep models, which take click and payment as target training respectively. Finally, the output of the two models is fused. The fusion method is shown in the figure below:

FIG. 6 Schematic diagram of sorting modelCopy the code

The physical significance of the hyperparameter θ is to adjust the Trade Off of the click and order model in the full data set. It is determined by considering the AUC of the click and order tasks comprehensively, and there is no personalized factor. Taking this as a starting point, we use the action adjustment and fusion of Agent to make:

A generates Action by Agent’s policy, which has two advantages: First, we know that an optimal solution is A =1. In this case, the reinforcement learning strategy is consistent with the baseline ranking strategy. Since reinforcement learning is a trial-and-error process, we can easily initialize the Agent’s strategy as A =1, so as to avoid the harm line effect at the initial stage of the experiment. Second, it allows us to Clip the Action according to the physical meaning, thus reducing the actual impact of the unstable reinforcement learning update process.

2.3 Reward shaping

The core optimization indexes of “Guess you like” booth are click rate and order rate, and the denominator in each experimental bucket is basically the same. Therefore, the business objective can be regarded as the optimization of click times and order times. We try to shape the reward as follows:

Compared with the Point Wise granularity ranking model, which focuses on the efficiency of each Item transformation, reinforcement learning aims to maximize the reward benefits in multiple rounds of interaction and is directly responsible for business goals.

FIG. 7 Relative effect changes before and after adding penalty termCopy the code

During the experiment, we found that the reinforcement learning strategy may have a good effect at the initial stage of launch, with certain improvement in click and order indexes, but it will gradually decline later, as shown in the first half of Figure 7. In the analysis of layer-by-layer conversion efficiency, we found that the device exposure rate and UV dimension click rate of the reinforcement learning bucket decreased, while the user stay time and browsing depth increased steadily, indicating that the Agent learned the strategy of making users interact with the recommendation system more, so as to obtain more exposure and conversion opportunities. However, this strategy is harmful to the experience of some users with strong ordering intention, because the cost of intention transformation of these users is higher, so their expectation of the booth is lower. In this case, we added two penalties to the reward shaping:

  1. Punishment without any conversion (click/order) behavior in the middle of the interactive page (penalty1), so that the user intent model learning transformation of the short circuit;
  2. Penalize pages where no conversion has occurred and the user leaves (Penalty2) to protect the user experience.

The corrected reward is:

Due to the user experience is a continuous time, the effect of UV dimension there is a certain hysteresis in the report, about a week after the hit rate and all orders back to positive levels, at the same time the user stay length and browsing depth has further improvement, that Agent really learned on the premise of avoiding damage to the user, get more transformation strategy from several rounds of interaction, as shown in figure 7 during the second half.

In this section we introduce the work related to MDP modeling. MDP is highly relevant to business scenarios, and experience is not easily transferable. As far as the scenario in this paper is concerned, we have spent more energy on the features of state expression, which enables reinforcement learning to achieve positive benefits on its own goals. Therefore, this part is introduced in detail. Action design is aimed at the scene of multi-objective model fusion, which is common in the industry and not suitable for supervised learning. It can also reflect the ability of reinforcement learning. Reward shaping is designed to bridge the Gap between reinforcement learning goals and business goals and requires some work on data insight and business understanding. After completing the above work, reinforcement learning has achieved some positive effects on its own goals and business indicators, but it is not stable enough. In addition, since strategy iteration is an Online Learning process, it takes a week of real-time training after the experiment goes Online to converge and observe the effect, which also seriously affects our iteration efficiency. We made some improvements to the model for these situations.

3. Improved DDPG model

In terms of models, we have tried Q-learning, DQN[3] and DDPG[4] models in the process of constantly improving MDP modeling, but we are also faced with problems such as unstable update, non-convergence in training process and low Learning efficiency (here refers to low sample utilization efficiency, so a large number of samples are needed) in reinforcement Learning. Specifically, in the recommendation scenario, since the samples of list-WISE dimension are much less than that of Point-WISE, and real actions and feedback are required as training samples, we can only use the small flow of the experimental group for real-time training. In this way, the amount of training data is relatively small, only hundreds of thousands per day, and the iteration efficiency is low. Therefore, we have made some improvements to the network structure, including the introduction of specific Advantage function, State weight sharing, optimization of on-policy strategy, data enhancement of more than ten times combined with online A/B Test framework, and support for pre-training. Next, we introduce the work of model improvement, using DDPG as the cornerstone.

FIG. 8 DDPG modelCopy the code

As shown in Figure 8, the basic DDPG is the Actor-Critic architecture. The Actor network is used online to predict the best Action A in the current State, and a’ is obtained by adding a random noise to the predicted Action through Ornstein-Uhlenbeck process, so as to achieve the purpose of exploring near the optimal strategy. Apply a’ to the line and get the corresponding benefit from the user (Environment). In the process of training, Critic learns to estimate the income obtained by taking action A in the current state S, and MSE is used as Loss Function:

Derivative of parameter:

Actor uses Critic to propagate back the strategy gradient, and uses gradient rise to maximize Q estimation, thus continuously optimizing the strategy:

In the formula of deterministic strategy gradient, θ is the parameter of strategy, Agent will use strategy μθ(s) to generate action A at state S, ρμ (exponential relation) represents the probability of state transition under this strategy. During the whole learning process, we do not really need to estimate the value of the strategy, but only need to estimate the maximum Q of the strategy gradient returned by the Critic. Critic constantly optimizes its own estimation of Q(s,a), and Actor solves a better strategy function by judging the gradient of Critic. This is repeated until the Actor converges to the optimal policy and the Critic converges to the most accurate estimate of Q(s, A).

The following work is based on the DDPG model improvements we introduced.

3.1 Advantage function

Drawing on the idea of Advantage function of DDQN[5], we divide the estimated Q(s,a) into two parts: The Advantage function A(s, A), which is only related to the state and both the state and the action, has Q(s, A) = V(s) + A(s, A), which can alleviate the problem of over-estimation of Q by critic. Specifically, in the recommendation environment, our strategy only adjusts the fusion parameters of the ranking model, and the benefits are mainly determined by the state.

FIG. 9 Comparison of Q values between the experimental group and baselineCopy the code

As shown in Figure 9, the ratio of the mean value of V(s) and A(s, A) is about 97:3 observed in the actual experiment, which can verify our judgment. In the actual training process, we first train V(s) according to the status and benefits, and then use residual training A(S,a) of Q(s,a) -v (S), which greatly improves the stability of training, and we can intuitively observe whether the current strategy is better than the baseline through residual. As shown in Figure 8, A(s, A) is stable and greater than 0, it can be considered that reinforcement learning has achieved stable positive benefits on its own goals.

3.2 State weight sharing

Inspired by A3C[6] network, we observe that State is expressed in Actor and Critic network in DDPG network. However, in our scene, most parameters are concentrated in the State part, with a magnitude of one hundred thousand, while other parameters are only thousands. Therefore, we try to share the weight of the State part. This reduces the training parameters by about half.

Figure 10. Use the Advantage function and do state weight sharingCopy the code

The improved network structure is shown in Figure 10. For this network structure, we notice that there are branches of V(S) that are unrelated to actions, which means that we can learn the expectation of Q in this State without specific Action. This allows us to use the baseline strategy with tens of millions of data volume for pre-training offline, and simultaneously use the baseline and experimental flow for real-time update online. So as to improve the training effect and stability. Because this update path includes all State parameters, most parameters of the model can be fully pre-trained, and only parameters related to Action must rely on Online Learning, which greatly improves the efficiency of our experimental iteration. Originally, we needed to wait for one week of training and then observe the effect. After improvement, we could start to observe the effect on the second day after going online.

3.3 On the policy

In the A2C[7] paper the authors discuss their opinion that the synchronous A2C implementation performs better than the asynchronous A3C implementation. We have not yet seen any evidence that asynchronously introduced noise provides any performance benefit, so to improve the training efficiency, we take this approach, using the same set of parameters to estimate Q_{t+1} and update Q_t, thus reducing the model parameters by half again.

3.4 Extending the Multi-group Parallel Policy

Considering that multiple groups of reinforcement learning experiments are online at the same time and combined with the characteristics of A/B Test environment, we extend the above network framework to the case of multi-agent.

Figure 11 supports DDPG model for multi-group online experimentsCopy the code

As shown in Figure 11, multiple groups of online experiments share State expression and V(s) estimation, and each strategy trains its own A(S, A) network and can converge rapidly. Such A structure makes the training process more stable on the one hand, and on the other hand, provides the possibility for the full reinforcement learning strategy.

FIG. 12 Experimental effect of click-through rate by dayCopy the code

In the DDPG modification work, we use Advantage function to obtain more stable training process and strategy gradient. The State weight sharing and on-policy approach reduced our model parameters by 75%. The combination of Advantage function and State weight sharing allows us to use the baseline strategy sample for data enhancement, so that the daily training sample can be expanded from one hundred thousand to one million. At the same time, sufficient pre-training ensures the rapid convergence of the strategy after it goes online. Through these efforts, the online experiment of reinforcement learning has achieved a stable positive effect. When the effect of single order rate is flat, the click rate of weekly effect increases by 0.5%, the average stay time increases by 0.3%, and the browsing depth increases by 0.3%. The main difference between the modified model and A2C is that we still use deterministic strategy gradient, so that we can estimate the distribution of one action less, that is, the special case where the variance of random strategy drops to 0. Figure 12 shows that the effect of intensive practice is stable, and since the “guess what you like” ranking model is already the industry leading streaming DNN model, we consider this improvement to be significant.

Lightweight real-time DRL system based on TF

Reinforcement learning is usually done in trial-and-error. Improving strategies in real time and getting feedback can greatly improve learning efficiency, especially in continuous strategies. This is easy to understand in the context of the game, and accordingly, we built real-time deep learning into the recommendation system to make strategy updates more efficient. In order to support real-time updated DRL model and efficient experiments, we made some improvements and optimizations based on TensorFlow and TF Serving to meet the requirements of Online Learning, and designed and implemented a feature configuration real-time updated DRL framework. In the process of experimental iteration, DQN, DDQN, DDPG, A3C, A2C, PPO[8] and other models were precipitated. The system architecture is shown in Figure 13:

FIG. 13 Reinforcement learning framework updated in real timeCopy the code

The training workflow is as follows:

  1. Online Joiner collects features and user feedback from Kafka in real time, splints them into point-wise label-feature samples, and outputs the samples to Kafka and HDFS, supporting Online and offline updates respectively.
  2. The Experience Collector collected the above samples, merged them into the request granularity of list-wise, and splice them into MC episodes in the form of [

    ] List according to the request timestamp. After state transition calculation, it is divided into TD Instance in the form of

    , and the output sample in MC or TD format supports RL training.
    ,>
    ,>
  3. The Trainer preprocesses the input features and trains the DRL model using TensorFlow.
  4. The Version Controller is responsible for scheduling tasks to ensure effectiveness and quality, and pushing the training model to TF Serving and Tair with the desired metrics. This part only needs actor-related parameters. Tair, as an auxiliary PS to make up for TF’s shortcomings in Online Learning, will be introduced in detail later.
  5. Monitor monitors and records the amount of data and training indicators in the entire training process and generates an online alarm if the data does not meet expectations.
  6. Before the new model goes online, offline pre-train will be conducted to learn the expression of State and Value NET using the data of baseline strategy. Update Actor, Advantage and Value parameters in real time after online.

In the online prediction part, the Agent of the recommencing system obtains the pre-processing parameters from Tair and feeds the processed features to TF Serving for forward propagation. The Action is obtained and the corresponding intervention is made on the ranking results presented to the user.

As TensorFLow’s support for Online Learning is weak and Serving processing efficiency is not high, some improvements are made:

  • Online feature distribution will change over time. For Dense feature, we maintain our own incremental Z-Score algorithm to preprocess features.
  • The Input dimension of Embedding features also changes frequently, and TF does not support variable length Input Dimention, for which we maintain a full id-embedding mapping and make the model load the high-frequency Embedding in the current sample set every time we train.
  • Multi-million-level Item Embedding will greatly reduce the efficiency of training and prediction, so we map this part in the preprocessing and take the mapped matrix as the input of CNN directly.
  • In order to improve the experimental efficiency of feature engineering, feature configuration is supported to generate model structure.
  • In addition, the response time of TF Serving increases dramatically within a minute or two of updating the serving model, resulting in many requests timeouts. First, the serving model loads and requests share a thread pool, which causes the switching model to block the request processing. Second, the graph initialization is lazy, so that the first request after the new model needs to wait for the graph initialization. This problem has a great impact on online learning scenarios with weak support for online learning when model updating frequency is Low. We adopt the method of thread pool segmentation and warm up initialization to solve this problem. For more specific solutions and effects, please refer to another technical blog of Meituan [9].

5 summary and outlook

Reinforcement learning is one of the fastest developing directions in the field of deep learning, and its combination with recommendation system and ranking model also has more value to be explored. This paper introduces the reinforcement learning in meituan’s “guess what you like” sequencing scenarios, including MDP modeling, which is constantly adjusted according to the business scenarios, so that reinforcement learning can achieve certain positive benefits; By improving DDPG for data enhancement, the robustness of the model and experimental efficiency were improved, and stable positive benefits were obtained. And real-time DRL framework based on TensorFlow, which provides the foundation for efficient parallel strategy iteration.

After a period of iterative optimization, we have also accumulated some experience in reinforcement learning. Compared with traditional supervised learning, the value of reinforcement learning is mainly reflected in the following aspects:

  1. Flexible reward shaping, can support a variety of business goal modeling, including but not limited to click rate, conversion rate, GMV, length of stay, browse depth, etc., support multi-goal integration, directly responsible for business goals.
  2. Action design full of imagination does not need direct Label, but generates and evaluates strategies through the network, which is suitable as a supplement to supervised learning. This has something in common with GAN.
  3. Considering the influence of long-term benefits of optimization on current decisions, the scenario with frequent interaction between Agent and Environment can better reflect the value of reinforcement learning.

At the same time, reinforcement learning is a branch of machine learning, and many machine learning experiences are still applicable here. For example, data and features determine the upper limit of the effect, and models and algorithms just keep approaching it. For reinforcement learning, feature space is mainly involved in state modeling, so we strongly recommend making more attempts in state modeling and trusting the model’s ability to make judgments from it. For another example, the idea of using more training data to reduce empirical risk and fewer parameters to reduce structural risk is still applicable to reinforcement learning. Therefore, we believe that DDPG improvements can be extended to online A/B Test scenarios of different businesses. In addition, we also encountered the problem that reinforcement learning is sensitive to randomness in the training process [10]. For this reason, we used multiple groups of random seeds in online training at the same time, and selected the group of parameters with the best performance for the actual parameter update.

In the current scheme, the Action we try is to adjust the model fusion parameters, mainly considering that this is a relatively common scene in the sorting problem, which is also suitable to reflect the ability of reinforcement learning, but in fact, the ability to intervene in the sorting result is relatively limited. In the future, we will explore the recall numbers of different categories, locations, price ranges and other attributes strongly related to user intention scenes, and adjust the hidden layer parameters of the ordering model. In addition, Priority Sampling will be tried to improve sample utilization efficiency and Curious Networks will be tried to improve exploration efficiency to solve the problem of low learning efficiency. We also welcome friends who are interested in reinforcement learning to contact us, to exchange and explore the application and development of reinforcement learning in the industry, and also welcome you to criticize and correct the mistakes and omissions in the article.

reference

[1] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah. Wide & deep learning for recommender systems. CoRR, 2016. [2] Yan, P., Zhou, X., Duan, Y. E-commerce item recommendation based on field-aware Factorization Machine. In: Proceedings of the 2015 International ACM Recommender Systems Challenge, 2015. [3] Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Humanlevel control through deep reinforcement learning. Nature, 2015. [4] Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. In International Conference, 2015 on Learning Representations, 2016. [5] Wang, Z., de Freitas, N., and Lanctot, M. Dueling network architectures for deep reinforcementlearning. Technical report, 2015. [6] Volodymyr Mnih, Adri`a Puigdom`enech Badia, Mehdi Mirza, Alex Graves, Tim-othy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asyn-chronous methods for deep reinforcement learning. ICML, 2016 [7] Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. Scalable trust-region method for deep reinforcementlearning using kronecker-factored approximation. arXiv Preprint arXiv:1708.05144, 2017. [8] Wolski, F.; Dhariwal, P.; Radford, A.; And Klimov, o. Proximal policy optimization algorithms. The arXiv preprintarXiv: 1707.06347, 2017 [9] chongda, hong jie, steady. Online estimation of deep learning based on TensorFlow Serving. MT Bolg, 2018 [10] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, And D. Meger. Reinforcement learningthat matters. ArXiv :1709.06560, 2017.

Author’s brief introduction

Duan Jin, who joined Meituan Dianping in 2015, is currently responsible for the implementation of intensive learning in recommended scenarios.