Nowadays, the recommendation system has higher and higher requirements for real-time performance. The process of real-time recommendation can be summarized as follows: the recommendation system generates recommendations for users’ requests, users give feedback on the recommendation results (buy/click/leave, etc.), and the recommendation system makes new recommendations based on users’ feedback. There are two things to note about this process:

  • This can be seen as a process of continuous interaction between the recommender system and the user.
  • Recommendation systems need to respond quickly and timely to user feedback.

Both of these are achieved through reinforcement learning and Flink, respectively, but before doing so, some background concepts should be understood.

Reinforcement learning

Reinforcement Learning: An Introduction, a well-known textbook in the field of Reinforcement Learning, begins:

When we think about the nature of learning, the first thing that probably comes to mind is learning in constant interaction with the environment. When a baby is playing, waving its arms or looking around, it doesn’t have any teachers, but it does have a direct sense of what’s going on in its environment.

The main process of reinforcement learning is to build an agent and make it learn continuously in the process of interacting with the environment in order to obtain the maximum expected reward. It is a very general learning paradigm that can be used to model a wide variety of problems, such as games, robotics, autonomous driving, human-computer interaction, recommendations, health care, and more. The main difference between reinforcement learning and supervised learning is that reinforcement learning learns through trial-and-error based on delayed feedback, while supervised learning learns with clear feedback information at each step.

The following figure shows the interaction between a recommender agent and the environment. Here, the user is regarded as the environment, and the behavioral data of the user is input into the agent as the state. The agent takes corresponding actions according to the data, that is, recommending corresponding items to the user. The user’s response to the recommendation, such as click/not click, buy/not buy, can be regarded as a new round of rewards. It can be seen from this that recommendation can be regarded as a dynamic sequential decision-making process, in which recommended items have an impact on users, and then the feedback of users in turn affects the decision itself of the recommendation system. Such a process will continue to form a sequence.

The word “decision” is actually very representative of reinforcement learning itself. Imagine that when one is making decisions, there are many times when one needs to evaluate the rapidly changing situation and make decisions accordingly. On the other hand, one needs to make decisions with long-term goals in mind rather than just short-term gains. These two points are exactly the problems about the traditional recommendation method mentioned in almost all papers using reinforcement learning to make recommendation system, that is, recommendation is regarded as a static prediction process and only focuses on short-term benefits and so on. Of course, this is mainly to highlight their achievements in the paper, but the actual situation should be far from so simple.

The ultimate goal of reinforcement learning is learning out a strategy 𝜋 (𝐚 | 𝐬) to maximize the expected reward. Policy refers to how the agent decides the next action 𝐚 according to the environment state 𝐬. Corresponding to the recommended scene, it decides the next recommended item according to the user’s past behavior record.

At present, there are two main methods to obtain the optimal strategy through training: on-policy and off-policy. Unlike supervised learning, which requires manual data collection and annotation in advance, reinforcement learning relies on continuous interaction with the environment, and then updates the model with the collected data. Therefore, there are two parts related to strategy in this process. One is to use strategy when interacting with the environment, and the other is to update strategy when training. On-policy refers to that the environment interaction strategy and the updated strategy during training are the same strategy, and the off-policy refers to that the interaction and the updated strategy are different strategies. The figure on the left below is an on-policy, and the figure on the middle below is an off-policy (the figure on the right below is an offline method, described later).

The idea of on-policy is intuitive, which is equivalent to an agent learning by trial and error in the environment. However, its main problem is that the sample utilization rate is low, which leads to low training efficiency. After a strategy is used to interact with the environment to obtain data and update the model, a new strategy is generated. Then the data obtained from the interaction of the old strategy may not conform to the conditional distribution of the new strategy, so the data can no longer be used and will be discarded.

Off-policy alleviates this problem by storing data collected from previous policies in an Experience replay buffer and then sampling the data for training. So why can the off-Policy class methods be trained using data generated by the old policy? Since the new data cannot be trained together due to the different data distribution, it is ok to adjust the data distribution to make it close to each other. Therefore, the off-policy algorithm generally adopts the idea of importance sampling to impose different weights on different data, which will be mentioned later in the introduction of YouTube’s recommendation system, and will be discussed at that time.

So which kind of reinforcement learning method does this article apply to? It’s really hard to say. I don’t have an interactive environment, I only have static data sets, so off-policy seems more suitable, but even the off-policy approach usually needs to interact with the environment to constantly generate new data for training. Therefore, the method of this paper belongs to batch reinforcement learning, or offline reinforcement learning, but I tend to use the word Batch. Because offline and off-policy can easily be confused. The diagram on the upper right shows batch (Offline) Reinforcement Learning, which is characterized by collecting a batch of data at one time and using only this batch of data for training, without any interaction with the environment before formal deployment.

We know that deep learning has made a lot of progress in imaging and NLP in recent years, and one of the main reasons is the explosion of computing power and data. However, mainstream deep reinforcement learning algorithms need to constantly interact with the environment to obtain data, which usually means collecting data while training. Many fields cannot interact with the environment as frequently as traditional reinforcement learning, because of high cost or low safety, and even cause ethical problems. Typical examples are unmanned driving and medical treatment.

So it’s natural for people to wonder, is it possible to train reinforcement learning to collect a fixed set of data, then reuse it, not collect new ones, and do wonders on fixed data sets like deep learning? Therefore, batch reinforcement learning has attracted more and more attention in academia and industry in recent years and is widely regarded as an effective way to realize the large-scale application of reinforcement learning into practice. The recommendation system is very suitable for this mode, because the cost of direct online exploration interaction is too high, affecting the user experience, but it is relatively easy to collect user behavior logs and the amount of data is large.

Flink

On the other hand, the recommendation system as a system, light algorithm is certainly not good. As mentioned above, batch Reinforcement learning does not need to interact with the environment. It can be trained only by data sets. Therefore, after the model is trained and actually goes online, it needs to interact with the environment. In this article, we mainly use Flink for the previous procedure. Flink was designed to be Stateful Computations over Data Streams. Flink was designed to be Stateful Computations over Data Streams.

In other words, as data continues to flow in, it can save and access the previous data and intermediate results, and calculate them together when certain conditions are reached. For our reinforcement learning model, a certain amount of user behavior needs to be accumulated before it can be used as model input for recommendation. Therefore, previous behavior data need to be stored in Flink in real time, which requires the powerful state management function of Flink.

In addition, the deep learning framework used for offline training is PyTorch, which is not as easy to deploy as Tensorflow. Therefore, the popular FastAPI is used to make API service. After obtaining the features that meet the conditions in Flink, the service is directly called for inference. After the recommendation is generated, it is stored in the database. The server can directly invoke the recommendation result from the database in the next user request. See the following figure for the overall architecture, and see the complete code and process:

FlinkRL: github.com/massquantit… DBRL:github.com/massquantit…

The following three algorithms are introduced. Due to space limitation, the principle is only briefly introduced here. For details, please refer to the original paper. Here is the main symbol table:

YouTube Top-K (REINFORCE)

This method mainly refers to the paper top-K Off-Policy Correction for a REINFORCE Recommender System published by YouTube in 2018. The authors of the paper claim in this video that this method has seen the biggest increase in almost two years, and I’m honestly a little skeptical. At the end of the experiment, it is mentioned that this reinforcement learning model is only one of many recall models, and then all recalled items are recommended to users through an independent ordering module. The paper does not say what model this ordering module uses, so there is a large space in this module.

This paper uses the oldest REINFORCE algorithm in the Policy Gradient field and makes some changes to its specific business situation. Here we look at the basic framework of REINFORCE.

Assume that execution is random strategy, agent interaction in the environment of a trajectory for 𝜏 = (𝑎 𝑠 0, 0, 1 𝑠, 𝑎 1, 𝑟 1,…, 𝑠 𝑇 – 1, 𝑎 𝑇 – 1, 𝑠 𝑇, 𝑟 𝑇). In deep reinforcement learning, neural network is generally used to parameterize the strategy 𝜋, and multiple tracks are generally sampled in the environment. Then, the expected total return of this strategy is:

Where 𝜃 is the parameter of neural network, and 𝑅(𝜏) is the total return of trajectory 𝜏. Since the trajectory is random, we hope to maximize the expected return to obtain the optimal strategy (∗ :

REINFORCE this idea, or the whole idea of Policy gradient, is the same as many algorithms in supervised learning, i.e., use a gradient increase (decrease) to get the parameter 𝜃, use formula (1.1) as the objective function, and then once you have a gradient, you can use this familiar formula to optimize:

It is extremely difficult to directly calculate the gradient of 𝐽(𝜋𝜃), but through the Policy gradient theorem, we can get an approximate solution:

Including 𝑅 𝑡 = ∑ | 𝜏 | 𝑡 ‘= 𝑡 𝛾 𝑡’ – 𝑡 𝑟 (𝑠 𝑡 ‘, 𝑎 𝑡 ‘), meaning 𝑡 moment to take action to obtain the final return only about after you get the rewards, and has nothing to do with before the reward. The proof of the Policy gradient theorem is shown in Appendix A.

The original REINFORCE algorithm was on-policy, meaning that the online interaction and the actual optimized strategy were the same, whereas the paper says that they were interacting with different strategies, or even a mixture of different strategies. This results in inconsistent data distribution, and if you simply use REINFORCE, this will generate a huge bias. In this paper, the algorithm is transformed into off-policy through importance sampling:

For the actual interaction strategy, the derivation of this equation can also be directly obtained through Policy gradient. See Appendix B for details. After a series of trade-offs, the author believes that the following formula reasonably balances the bias and variance:

Is the main difference is the sampling along the trajectory of the 𝛽 for, and adding the importance weights to each step 𝑡 𝜋 𝜃 (𝑎 𝑡 | 𝑠 𝑡) 𝛽 (𝑎 𝑡 | 𝑠 𝑡), and the weight is relatively easy to calculate.

Another problem that is considered in the paper is that in the past, only one action is considered, that is, only one item is recommended, but in reality, 𝑘 items are often recommended to users at one time. If 𝑘 items are considered at the same time, the combination will explode. The paper assumes that the total reward for showing 𝑘 non-repeating items at the same time equals the sum of the rewards for each item, which can be translated into optimization for individual items, and the authors say that this assumption holds only if users observe each item in the recommended list independently.

In 2019, YouTube published another paper on Reinforcement Learning for Slate-based Recommender Systems: A Tractable Decomposition and Practical Methodology), compared with the method in the middle of 2018, the main difference is that on-policy SARSA algorithm is used, and it is used in sorting rather than recall stage.

This paper also makes a certain assumption for the recommended 𝑘 items: it is assumed that a user will only consume one item in a recommended list. If the user returns to the recommended list to consume a second item after consumption, this behavior will be regarded as another event and recorded separately in the log. In fact, the essence of these two assumptions is that the user, faced with a list of 𝑘 items, will only focus on one of them and ignore the others, whereas in many cases the user will be interested in more than one item, but after consuming one item, will forget the rest. The extreme case is that you recommend 10 of them and all of them are interested, then after you consume one of them something goes away or gets stuck in a click-through cycle, so that the other 9 of them are treated as negative samples.

Here we can see that the fundamental reason why both papers have to make some assumptions is that the algorithms used output discrete actions, that is, a single object, while the recommended scene is different from the general reinforcement learning application that only needs to output one action. So you have to make some assumptions that look awkward, and the two algorithms that follow are both outputs of continuous actions, so there are other solutions.

Following the above assumption, 𝑘 can be converted into a single item. After combining the weight of importance, Equation (1.3) can be converted into:

Type in the main is to use 𝛼 𝜃 (𝑎 𝑡 | 𝑠 𝑡) to replace the original 𝜋 𝜃 (𝑎 𝑡 | 𝑠 𝑡), because 𝑘 item is separate from the strategy 𝜋 𝜃 sampling, The 𝛼 𝜃 (𝑎 𝑡 | 𝑠 𝑡) = 1 – (1 – 𝜋 𝜃 (𝑎 𝑡 | 𝑠 𝑡)) 𝐾 said 𝑡 𝑎 moment items appear in the final 𝑘 probability of an item in the list, because (1 – 𝜋 𝜃 (𝑎 𝑡 | 𝑠 𝑡)) 𝐾 said 𝐾 times have never been sampling to probability.

Can see (1.3) and (1.4), the only difference is more than a multiplier: partial 𝛼 (𝑎 𝑡 | 𝑠 𝑡) partial 𝜋 (𝑎 𝑡 | 𝑠 𝑡) = 𝐾 (1 – 𝜋 𝜃 (𝑎 𝑡 | 𝑠 𝑡)) 𝐾 – 1. So we just sample a track, at each step 𝑡 by neural network to calculate the 𝜋 𝜃 (𝑎 | 𝑠), 𝛽 (𝑎 | 𝑠) (the two actual output is softmax probability). To integrate discount returns 𝑅 𝑡 = ∑ | 𝜏 | 𝑡 ‘= 𝑡 𝛾 𝑡’ – 𝑡 𝑟 (𝑠 𝑡 ‘, 𝑎 𝑡 ‘), can be correspondingly algorithm, finally see the code:

Github.com/massquantit…

DDPG

Discrete movements are a natural idea in terms of recommended scenes, where each movement corresponds to each item. However, the actual number of items may be at least millions, which means a lot of action space and high computational complexity with SoftMax. The YouTube article uses Medisoftmax to solve this problem. In this article, we can output a continuous vector 𝑎∈ℝ𝑑 and order 𝑎 with each item’s embedding dot to get the recommendation list. Online, you can use efficient nearest neighbor search to retrieve items. For continuous action, DDPG is a relatively common choice, and DDPG is used the most in the recommended related papers I have read, such as two articles by JINGdong [1][2], one article by Ali [1], and one article by Huawei [1].

DDPG, short for Deep Deterministic Policy Gradient, is an off-policy algorithm suitable for continuous actions. Instead of REINFORCE, this uses a deterministic strategy that, as the name suggests, produces a unique action 𝑎 for the same state 𝑠, so we use 𝜇(𝑠) here. Since it is a deterministic strategy, there is no probability distribution of actions 𝑎, and importance sampling is not required.

DDPG adopts actor-critic framework. Actor is a specific policy, whose input is state 𝑠 and output action 𝑎. The Critic is used to evaluate the quality of a strategy. Its input is a spliced vector (𝑠+𝑎) and its output is a scalar. The Actor and the Critic can be parameterized using neural network, assuming that the Actor network for 𝜇 (𝑠 | 𝜃 𝜇), the Critic network for 𝑄 (𝑠, 𝑎 | 𝜃 𝑄), the Actor and Critic of the objective function and gradient are:

Then the core of the algorithm is to optimize the two objective functions by gradient ascending (descending) to get the final parameters, and then get the optimal strategy. Other DDPG implementation details such as Target network, soft Update and so on are not described here, because we use a fixed data set, so as long as the data into the DDPG algorithm can be input format, and then like supervised learning by batch training on the line. Instead of interacting with the environment, see the final code:

Github.com/massquantit…

BCQ

The full name of BCQ algorithm is Batch-constrained Deep Q-learning. The off-policy Deep Reinforcement Learning without Exploration. BCQ can be regarded as the modified version of DDPG in the Batch (Offline) scenario. As mentioned above, Batch (Offline) Reinforcement learning refers to learning on a fixed data set without interaction with the environment. The author of this paper points out that popular off-policy algorithms such as DQN and DDPG may not have good effects under such conditions, mainly due to extrapolation error.

Extrapolation error is mainly caused by the inconsistency between the distribution of the combination of state 𝑠 and action 𝑎 in the data set and the distribution of state-action combination in the current strategy, that is, the sampling strategy is very different from the current strategy, which makes the estimation of value function inaccurate, and thus makes learning ineffective. As an example of the target function of the DDPG Critic network (with some symbols changed) from above:

Because 𝜇 ‘is itself a neural network, if 𝜇’ (𝑠 ‘) eventually outputs an action that is not in the data set, it is likely that 𝑄 ‘(byte’, byte ‘(byte’)) will misvalue the state-action combination, and no good strategy will be learned. As shown in the figure below (source), if the action 𝑎 is outside the distribution of the behavior policy 𝛽, it is possible to overestimate the value of 𝑄, resulting in the accumulation of subsequent errors.

I did encounter a similar situation when I was actually training DDPG, and the loss of Critic sometimes reached such an exaggerated level as 1E8, no matter how low the learning rate is, it is useless. Later, it was found that the possible reason was to average the multiple item vectors that the user had interacted with before as the expression of the state 𝑠 at the beginning. However, the vector after averaging might not grow very similar to any single item vector, that is, it was far away from the original data distribution. The action 𝑎 output by 𝜇 ‘(𝑠) is naturally very different from the action in the data set, and the result of this is that the 𝑄 value explodes from one link to another, which is not the case after the object vector is directly joined.

In addition, the author also mentions that common off-policy algorithms such as DQN and DDPG do not take extrapolation error into account. Why are they effective in orthodox reinforcement learning tasks? Because these algorithms essentially adopt the training method of growing Batch learning: After off-line training with a batch of data for a period of time, the trained strategy will still be used to collect new data in the environment for training. In this way, the sampled data is actually not very different from the existing strategy, so the Extrapolation error can be ignored. However, in the case of full offline training, the data set is likely to be collected using a completely different strategy, so the impact of extrapolation error is significant in this case.

So the crux of the question is how do you avoid generating an extrapolation error by creating an unintelligible state-action combination? The method presented in the paper uses a generative model to generate the state-action combinations similar to the dataset, and uses a perturbation model to add noise to the generated actions to increase the variety. The generated models use variational auto-encoder (VAE), the disturbance model is a common fully connected network, the input is the generated action 𝑎, the output is the new action within the range of [− φ, φ], φ is the maximum value of the action, For continuous actions we usually set an upper limit to avoid output too large action values.

Together, these two models can be regarded as the strategy used by BCQ algorithm, namely Actor, while the Critic part is not very different from DDPG. See the complete code:

Github.com/massquantit…

Appendix A: Policy Gradient Theorem

According to Equation (1.1), the objective function is:

Appendix B: Importance Sampling

Assume that the trajectory distribution of the actual interaction strategy 𝛽 is 𝑄𝜙(𝜏), and sample the importance of the application in Equation (A.2) :

And then we have the same derivation as appendix A.

References

  • Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction
  • Minmin Chen, et al. Top-K Off-Policy Correction for a REINFORCE Recommender System
  • Xiangyu Zhao, et al. Deep Reinforcement Learning for List-wise Recommendations
  • Scott Fujimoto, et al. Off-Policy Deep Reinforcement Learning without Exploration
  • Sergey Levine, Aviral Kumar, et al. Offline Reinforcement Learning: Tutorial, Review,and Perspectives on Open Problems
  • Eugene Ie , Vihan Jain, et al. Reinforcement Learning for Slate-based Recommender Systems: A Tractable Decomposition and Practical Methodology

The article is reprinted from the blog by Massquantity.

The original link: www.cnblogs.com/massquantit…