[Editor’s Note] Researchers from the Social Computing group of Microsoft Research Asia forecast the development direction of recommendation system in the future from five aspects: deep learning, knowledge graph, reinforcement learning, user portrait and interpretable recommendation.

In the first two articles, we introduced the application of deep learning technology and knowledge graph in recommendation system and the possible future research direction respectively. In today’s article, we will introduce the application of reinforcement learning in recommendation systems.


By combining deep learning and knowledge graph technology, the performance of recommendation system has been greatly improved. However, most recommendation systems are still built in a one-step fashion: they are similarly built by designing and training specific monitoring models to gauge users’ preferences for different items, with full access to historical data. Once deployed, these trained models can identify the most attractive items for specific users and make personalized recommendations for them. In this case, people often assume that user data has been fully obtained and their behavior will remain stable for a long time, so that the recommendation model established in the above process can meet the actual needs.


                    

However, for many realistic scenarios, such as e-commerce or online news platforms, there will often be continuous and close interaction between users and recommendation systems. In this process, the feedback of users will make up for the possible data missing, and at the same time powerfully reveal their current behavior characteristics, thus providing an important basis for the system to carry out more accurate personalized recommendation.

Reinforcement learning provides strong support for solving this problem. According to the behavior characteristics of users, we divide the recommendation scenarios involved into static and dynamic scenarios and discuss them respectively.

1. Strengthening recommendations in static scenarios

In a static scenario, the user’s behavior characteristics remain stable in the process of interaction with the system. The development of contextual multi-armed bandit-based recommendation systems has provided an effective solution for overcoming the cold start problem in recommendation scenarios.

In many practical applications, users’ historical behavior tends to follow a specific long-tail distribution, that is, most users generate only limited historical data, while a few users generate sufficient historical data. The problem of data sparsity caused by this phenomenon makes it difficult for traditional models to obtain satisfactory practical results in many cases.

Therefore, a direct solution is to actively explore user behavior, that is, to launch a large number of trial recommendations to users to fully obtain their behavior data, so as to ensure the availability of the recommendation system. Unfortunately, this simple approach involves a lot of exploration overhead that makes it impractical in the real world.

To make active exploration feasible with utility costs, one tries to draw inspiration from the multi-arm slot machine problem. The multi-arm slot machine problem aims to make an optimal tradeoff between exploration and utilization, for which many classical algorithms have been proposed successively. Although different algorithms have different implementation mechanisms, they are all designed based on a common principle.

Specifically, when making recommendations, the system will take into account the recommended utility of items and cumulative attempts. Higher recommendation utility predicts lower exploration costs, while lower cumulative attempts indicate higher uncertainty. Therefore, different algorithms will design specific integration mechanisms, so that items with high recommendation effectiveness and uncertainty can be tried first.

      

2. Enhanced recommendations in dynamic scenarios

In the setting scenario of multi-arm slot machines, the real-time characteristics of users are assumed to be fixed, so the algorithm does not involve the dynamic migration of user behaviors. However, for many realistic recommendation scenarios, user behavior often changes during interaction. This requires the recommendation system to accurately estimate the state development according to the user feedback and formulate optimal recommendation strategies for it.

Specifically, an ideal recommendation system should meet the following two attributes. On the one hand, recommendation decisions should be fully based on users’ past feedback data; On the other hand, recommendation systems need to optimize the global benefits of the entire interaction process. Reinforcement learning provides powerful technical support to achieve the above goals.

Under the framework of reinforcement learning, the recommendation system is regarded as an agent, the current behavior characteristics of users are abstracted into states, and the objects to be recommended (such as candidate news) are regarded as actions. In each recommendation interaction, the system selects the appropriate action according to the user’s state to maximize a specific long-term goal (such as total number of clicks or duration of stay). The behavioral data generated during the interaction between the recommender system and users are organized into experiences, which are used to record rewards and state-transition generated by corresponding actions. Based on accumulated experience, the reinforcement learning algorithm obtains policies to guide the optimal action selection in a specific state.

Recently, we have successfully applied Reinforcement Learning Framework for News Recommendation (DRN: A Deep Reinforcement Learning Framework for News Recommendation, WWW 2018). Due to the serialization decision ability of the algorithm and its optimization of long-term goals, reinforcement learning will serve a wider range of realistic scenarios, thus greatly improving the user perception and personalized ability of the recommendation system.

Opportunities and challenges to enhance recommendations

There are still many challenging problems to be solved in reinforcement learning recommendation algorithm.

The current mainstream deep reinforcement learning algorithms try to avoid the modeling of the environment and directly carry out strategy learning (i.e. Model-free). This requires a large amount of empirical data to obtain the optimal recommendation strategy. However, the available interactive data in recommendation scenarios are often limited in scale and reward-sparsity, which makes it difficult to obtain satisfactory practical results simply by applying existing algorithms. How to use limited user interaction to get effective decision model will be the main direction of algorithm further improvement.

In addition, in reality, people often need to conduct independent strategy learning for different recommendation scenarios. Strategies vary from scenario to scenario, which requires a lot of effort to collect sufficient data for each scenario. At the same time, due to the lack of universality, existing strategies are difficult to quickly adapt to new recommendation scenarios. In the face of these challenges, it is necessary to propose a learning mechanism for general strategies as much as possible, so as to break down the barrier of the algorithm between different recommendation scenarios and enhance its robustness in changing scenarios.

In the next article, we will discuss the research on “user profiling in recommendation systems”. If you want to know more about the research hotspot of recommendation system, please keep paying attention.

Related reading:

Dry goods | personalized recommendation system five research hot spots of knowledge map (2)

Dry goods | personalized recommendation system five research hot spots of deep learning (a)

Build recommendation system quick start, just five steps!

Welcome everyone to like, collect, will share more technical knowledge to the friends around — your recognition is the direction of our efforts.

This account is the official account of the first recommendation of the fourth paradigm intelligent recommendation product. This account is based on the computer field, especially the cutting-edge research related to artificial intelligence. It aims to share more knowledge related to artificial intelligence with the public and promote the public’s understanding of ARTIFICIAL intelligence from a professional perspective. At the same time, we also hope to provide an open platform for discussion, communication and learning for people related to ARTIFICIAL intelligence, so that everyone can enjoy the value created by artificial intelligence as soon as possible.

Every member of the fourth paradigm has contributed to the landing of artificial intelligence. Under this account, you can read the academic frontier, knowledge and industry information from the computer field.

For more information, please search and follow the official Weibo and wechat account (ID: DSFSXJ).