Weaknesses and limitations of deep reinforcement learning

Editor’s note: Reinforcement learning is an area of machine learning that emphasizes how to act based on circumstances to maximize expected benefits. In recent years, there have been multiple large-scale researches on reinforcement learning, and the achievements represented by AlphaGo have not only stirred the academic community, but also attracted the attention of the media. So is reinforcement learning really the new hope for ARTIFICIAL intelligence? Over the Spring Festival, software engineer Alex Irpan cited Berkeley, Google Brain, DeepMind, and OpenAI papers from the past few years to detail the weaknesses and limitations of reinforcement learning.

Once, I posted this on Facebook:

Whenever people ask me if intensive learning can solve their problems, I say “no”. And I’ve found that this answer is correct at least 70 percent of the time.

Deep reinforcement Learning (Deep RL) has been surrounded by a lot of media hype these days. On the positive side, reinforcement learning (RL) is a cat-all, and it works incredibly well. In theory, a powerful, high-performance RL system should solve any problem. On this basis, it is appropriate for us to combine the idea of deep learning with it. And as things stand, Deep RL is one of the closest things to AGI, providing fuel for the “ai dream” that has attracted billions of dollars in investment.

Unfortunately, Deep RL still has many limitations.

I believe Deep RL has a great future. If not, I will not choose it as my direction. But to be honest, Deep RL still has a lot of problems, and many of them are hard to fix at the root. On the surface it looks as if an intelligent agent has done its job beautifully, but only we know the blood, sweat and tears behind it.

On several occasions I have seen people who have been intrigued by recent work on Deep RL and have taken the plunge, they invariably underestimate the Deep difficulties of reinforcement learning. The “toy problem” is not as easy as they think, and the introduction to reinforcement learning will probably mean a lot of total failure until those who stumble learn how to set realistic research expectations.

I hope to see more in-depth RL research in the future and a steady flow of new blood in the field, but I also hope that these new people really know what kind of world they are entering.

The worrying sampling efficiency of deep reinforcement learning

Atari games are one of the best known benchmarks for deep reinforcement learning. As the Deep Q-Networks paper shows, if Q-learning is combined with a reasonably sized neural network, along with some optimization techniques, researchers can achieve human or superhuman performance in several Atari games.

Atari games run at 60 frames per second, so just imagine how many frames would it take to get the most advanced DQN up to human performance?

The answer depends on the game. Let’s take a look at a recent Deepmind paper: Rainbow: Combining Improvements in Deep Reinforcement Learning. This paper makes some incremental improvements to the original DQN framework to prove that their RainbowDQN performs better. In the experiment, the agents played 57 Atari games and outperformed human players in 40.

The color curve at the top is RainbowDQN

On the Y-axis is the median score for human player performance. The researchers calculated the agent’s score by looking at DQN’s performance over 57 matches, and then plotted the agent’s performance using human performance as a metric. As you can see, the vertical axis of the RainbowDQN curve breaks 100% at 18 million frames, that is, beyond humans. This equates to 83 hours of play time, including training time and actual play time, but most of the time a human player can pick up an Atari game in just a few minutes.

It should be noted that rainbowdQN’s 18 million frames is quite a breakthrough compared to the 70 million frames in Distributional DQN (orange line). Just three years ago, Nature published a reinforcement learning paper that described the original DQN (yellow line), and its performance in the experiment was not 100% after 200 million frames.

Nobel Laureates Kahneman and Tversky once proposed a concept called the planning fallacy. This is when people are optimistic about getting something done and underestimate how long it will take to get it done. Deep RL has its own planning fallacy – learning strategies often require far more samples than previously thought.

In fact, Atari games aren’t the only problem. Another popular benchmark in reinforcement learning is the MuJoCo Benchmark, which is a set of tasks in the MuJoCo Physics simulator. In these tasks, the input to the system is usually the position and speed of each joint of a simulated robot. But even for such a simple task, the system typically takes 105-107 steps to learn, and the amount of experience it requires is staggering.

The following is a demonstration of DeepMind’s parkour robot. The researchers described in their paper on the Emergence of Locomotion Behaviours in Rich Environments that 64 workers and 100 hours of experimentation were used. Although they don’t explain what a worker is, I think a worker is equivalent to a CPU.

What DeepMind has done is amazing. When this video was first released, I was surprised that reinforcement learning could enable a robot to run. But after looking at the paper, the CPU time of 6,400 hours is a little frustrating. This is not to say that I think it takes too long, but that Deep RL’s actual sampling efficiency is several orders of magnitude higher than expected is even more disappointing.

Here’s the question: what happens if we ignore sampling efficiency? Sometimes we just need to adjust a few parameters to gain experience, and games are a good example. But if that doesn’t work, reinforcement learning is in a tough spot. Unfortunately, most real-world tasks fall into the latter category.

Other approaches work better if you only care about final performance

Researchers often have to make “choices” when looking for a solution to a research topic. On the one hand, they can focus only on the completion of the problem, that is, optimizing a method for the best performance; On the other hand, they can also refer to previous achievements and optimize a method with higher scientific value, but its final effect may not be the best. The ideal situation is to have both optimal performance and maximum contribution, but you can’t have your cake and eat it too.

When it comes to better results, Deep RL’s performance is somewhat disappointing, as it was actually hit by other methods. Below is a video of the MuJoCo robot, which is controlled by online trajectory optimization, allowing the system to perform calculations online in near real time without the need for offline training. It should be noted that this is the result of 2012.

This video can be compared to parkour videos. The biggest difference between the two papers is that this paper uses model predictive control, which can be used to plan for a real world model (physical simulator) on the ground, while reinforcement learning systems that do not build models do not have this planning process, so it is more difficult to learn. In other words, if it’s better to plan directly against a model, why bother training RL strategies?

Similarly, the ready-made Monte Carlo tree search easily outperforms DQN in Atari games. In 2014, the paper Deep Learning for Real-time Atari Game Play Using Offline Mont-Carlo Tree Search Planning by university of Michigan was included in NIPS. It looked at the effects of using offline Monte Carlo tree search in real-time Atari games. As shown in the chart below, the researchers compared the scores of DQN to those of UCT agents, the standard version of today’s MCTS, and found that the latter performed better.

Note that this is again an unfair comparison, as DQN cannot search, while MCTS can perform searches based on ground reality models (Atari simulators). But this level of unfairness sometimes doesn’t matter if you just want a good result.

Reinforcement learning can theoretically be used for anything, including environments where the world model is unknown. However, this generality comes at a cost, which is that it’s hard to apply it to any particular problem that helps with learning. This forced us to use a large number of samples to learn, even though these problems could be solved with simple coding.

Therefore, with a few exceptions, domain-specific algorithms can be more effective than reinforcement learning. That’s fine if you’re getting into reinforcement learning for the love of it, but be prepared when you want to compare your reinforcement learning results to other methods. And if you’re still wondering how far the gap between Deep RL-trained robots and those made in classic robotics is, you can take a look at the products of well-known bionic robotics companies like Boston Dynamics.

The Atlas biped uses no reinforcement learning techniques. Reading their paper, it uses the traditional techniques of time-VARYING LQR, QP Solvers and convex optimization. So if used correctly, classic techniques can perform better on specific problems.

Reinforcement learning usually requires a reward

An important assumption of reinforcement learning is that there are rewards that guide the agent in the “right” direction. This reward function, which can be set by the researcher or manually debugged offline, is usually fixed during the learning process. I say “generally” because there are occasional exceptions, such as imitation learning and inverse RL, but most reinforcement learning methods treat rewards as “predictions.”

What’s more, in order for the agent to do the right thing, the system’s reward function must capture exactly what the researcher wants. Notice, it’s accurate. Reinforcement learning has an annoying tendency to over-fit your goals, and the intelligence will easily slip through the holes and produce unexpected results. That’s why Atari games are such an ideal benchmark, because not only do they provide a large sample size, the goal of each match is to maximize the score, so we don’t have to worry about defining rewards.

Similarly, MuJoCo runs in a simulation, so we know all the states of the target, and the reward function is easy to design, which is the main reason for its popularity.

In the Reacher task above, we need to control the arm connected to the center point (blue) so that the end of the arm (orange) overlays with the red target. Since all the positions are known, we can define the reward as the aggregation from the end of the arm to the target, plus a small control cost. In theory, if the sensors are sensitive enough, we can do this in real life. But when it comes to what we want the system to do by solving this problem, the reward for the task is hard to design.

Of course, by itself, having a reward function isn’t a big deal, but it creates a ripple effect later on.

The reward function is hard to design

Setting a reward function is easy, but the difficulty comes when you try to design a reward function that encourages the desired action while still keeping the agent learning.

In the HalfCheetah environment, we have a biped robot that is confined to a vertical plane, which means it can only move forward or backward.

The goal is to learn a running gait, and the reward is speed. It’s a formalized reward, and the closer the robot gets to the reward goal, the more the system rewards it. This is in stark contrast to sparse rewards, where rewards are given only in the target state and nowhere else. Such formalized rewards often facilitate learning more easily because strategies provide positive feedback even if they don’t find a complete solution to the problem.

Unfortunately, formalized rewards can also interfere with learning. As mentioned earlier, it can cause the robot to behave differently than expected. A typical example is OpenAI’s blog “Faulty Reward Functions in the Wild”. As shown in the following picture, the predetermined goal of this rowing game is to complete the race. As you can imagine, a sparse reward will give a +1 reward for a given amount of time, otherwise it will give a 0 reward. In this game, the researchers set up two rewards: one for completing a race, and the other for collecting scoring goals in the environment. In the end, the OpenAI agent found a “farm” and repeated scores. It didn’t finish the race, but it scored higher.

To be honest, when this article first came out, I was a little angry because it wasn’t the reinforcement learning that was the problem, it was the reward setting. If the researchers gave strange rewards, the results of reinforcement learning must also be strange. But in writing this article, I found it useful to have such a compelling example of a mistake, because every time the subject is mentioned, the video can be used as a demonstration. So based on that premise, I will grudgingly admit that this is a “good” blog.

Reinforcement learning algorithms are concerned with a continuous process, which assumes that they know more or less about the environment they are in. The most extensive model free reinforcement learning is similar to black box optimization in that it only allows assumptions to exist in MDP, where the agent is simply told to do so in return for a +1 reward, and the rest is left to it to figure it out. Similarly, model-free reinforcement learning suffers from the same problem as black-box optimization, in that the intelligence perceiving all the +1 rewards as positive, even though the +1 May have gone astray.

A typical example of non-reinforcement learning is when a genetic algorithm is applied to a circuit design and a circuit diagram is obtained. In this circuit, the final design requires an unconnected logic gate.

The grey units in the diagram need to receive the correct action instructions, including the independent grey unit in the upper left corner

More research can be found on Salesforce’s 2017 blog, which focuses on automatically generating text summaries. They trained the benchmark model in supervised learning and then evaluated it using an automated metric called ROUGE. ROUGE is non-differentiable, but reinforcement learning can handle non-differential rewards, so they also tried to optimize ROUGE directly using reinforcement learning. Although the results are good, the article does not give a detailed summary. Here is an example:

Button denied 100th race start for McLaren after ERS failure. Button then spent much of the Bahrain Grand Prix on Twitter delivering his verdict on the action as it unfolded. Lewis Hamilton has out-qualified and finished ahead of Mercedes team-mate Nico Rosberg at every race this season. Bernie Ecclestone confirms F1 will make its bow in Azerbaijan next season.

Even though the reinforcement learning model got the highest score, they ended up using another model…

Another interesting example is the “Lego Stack” paper by Popov et al., from 2017. The researchers used distributed DDPGS to learn to master the learning strategy. In their experiment, the robot’s goal was to grab a red Lego block and stack it on top of a blue one.

Their research has been an overall success, but there have been some failures. For the initial lift, the reward was the height of the red block, which was the z-axis value of the red block’s underside. During the learning process, the robot found that simply turning the block over, with the bumps facing down, was equally rewarding.

The solution to this problem is to make the rewards sparse and reward only after the robot stacks up the blocks. Of course, sometimes this works, because sparse rewards can also promote learning. But in general, this is not an option, because the lack of positive rewards makes the learning experience harder to consolidate, making training difficult. Another solution is to set rewards more carefully, adding new reward conditions or adjusting the existing reward factor until the robot no longer takes any shortcuts. But this approach is essentially a battle between the brain and reinforcement learning, a relentless battle, and while “patching” is sometimes necessary, I never feel like I’ve learned anything.

If you don’t believe me, check out the “Lego Stack” reward function for a reference:

I don’t know how much time they spent designing this reward, but based on the number of conditions and the number of different coefficients, MY guess is that it’s “high.”

In my conversations with researchers in other reinforcement learning directions, I also heard many anecdotes about inappropriate reward Settings:

Someone is training a robot to navigate indoors. If the agent walks out of the room, the plot ends and the system determines that the robot has committed suicide, with no negative reward. At the end of the training, the robot chose to “die” almost every time, because the positive rewards were so hard to obtain and the negative rewards were so abundant that a quick termination of zero reward was preferable to it.
Someone is training a simulated robot arm to reach a point on a table. The problem with this experiment was whether it was defined as a table, whether it was a moving table or a table that was fixed somewhere. After long training, the best strategy learned by the agent is to smash the table so that it topbles over and the target point automatically rolls to the end of the arm.
Someone is training a robot to hammer a nail with a hammer. In the beginning, they defined the reward as the distance the nail was pushed into the hole, so the robot ignored the hammer completely and kept pounding the nail with its limbs. Later, they added incentives, such as encouraging the robot to pick up the hammer. After retraining, the robot learned to pick up the hammer, but it immediately dropped it again, continuing to pound and pound with its limbs.

Admittedly, this is all anecdotal, with no video or paper to back it up, but it all makes sense in my experience of being victimized by reinforcement learning over the years. I also know people who like to talk about “paperclip maximization” stories, and I really understand their fears, but making up a “destroy humanity” story out of surreal AGI like this is boring. Especially when these silly cases come up again and again.

Even the most reasonable reward cannot avoid local optimality

These previous examples of reinforcement learning are called “reward hacking,” and to me it’s really a clever out-of-the-box solution where agents end up getting more rewards than researchers expect. Of course, rewarding hackers is actually an exception. A more common problem in reinforcement learning is when the “get the observation-take the action” process goes wrong and the system gets the local optimal solution.

Normalized Advantage Function in a HalfCheetah environment

From an outsider’s point of view, the robot is a bit silly. But we can only call it “silly” because we have the perspective of God and a great deal of prior knowledge. We all know that running on your feet makes more sense, but reinforcement learning doesn’t know that it sees a vector of states, a vector of actions that it is about to take, and a vector of rewards that it has received before.

During the learning process, the agent thinks like this:

In random exploration, the advance strategy is better not to stand still;
You can do it all the time, so keep moving;
After the implementation of the strategy, if you use more force at once, you will do a backflip, so that the score is higher;
After doing enough back flips, it turned out to be a good idea for brushing, so it was incorporated into your existing strategy;
If the strategy keeps going backwards, which is easier? Do you correct yourself and run it the “standard way” or do you learn to use your back to move forward while lying on your back? Choose the latter.

The idea is interesting, but not what the researchers expected.

Another failure comes from Reacher, whom we mentioned earlier.

In this task, the initial random weights will tend to output highly positive or highly negative action outputs, that is, most actions will output the maximum or minimum acceleration. The problem with this is that this linkage arm will accidentally spin at a high speed, and as soon as you put maximum force into each joint, it will spin completely. In this case, once the robot starts training, this meaningless state will cause it to deviate from its current strategy — in order to prevent this from happening, you have to do some exploration to stop the spinning. While this is theoretically possible, the robot in the GIF failed to do so.

This is the classic exploration-exploitation dilemma, which is a common problem with reinforcement learning: your data comes from your current strategy, but if you explore your current strategy too much, you end up with a lot of useless data and no way to extract useful information from it. And if you try too much, you won’t find the best move.

The industry has several intuitive and simple ideas for solving this problem: intrinsic motivation, curiosity-driven exploration, and count-based exploration. Many of these approaches were proposed before the 1980s, and some of them have even been revisited with deep learning models, but they don’t work in all environments. I’m looking forward to a more general exploration technique, and I believe the industry will have a better answer in the next few years.

I used to think of reinforcement learning as a malign object, deliberately misrepresenting your reward and then actively seeking the most lazy local optimal solution. It sounds ridiculous, but it’s actually an apt description.

When Deep RL works, it may just be overadapting to weird patterns in the environment

Reinforcement learning is popular because it is the only machine learning network that can train only on test sets.

One of the great things about reinforcement learning is that if you want to perform better in an environment, you can overfit like crazy. But the downside is that the system only works for certain tasks, and if you want to scale to other environments, sorry, you can’t.

DQN solves a lot of Atari game problems, but it does so by focusing all of its learning on one goal — getting great results in one game — that its final model doesn’t apply to other games. You can fine-tune the parameters to adapt an already trained model to a new game (paper: Progressive Neural Networks), but you can’t guarantee that it will translate, and people generally don’t expect it to. Because the pre-trained ImageNet images are not very effective.

Of course, some people will think differently. Indeed, in principle, a model should avoid these problems if it trains in a widely distributed environment. For example, navigation allows you to randomly sample target locations and encapsulate them using Universal Value functions. I think this work is very promising, and I’ll give more examples of this later, but I still don’t think Deep RL’s generalization capabilities are sufficient to handle multiple tasks. While its “perception” is sharper, it’s not yet at the level of “used to control ImageNet,” which OpenAI Universe is trying to challenge, but it still has a lot of limitations.

Before the arrival of highly generalized Deep RL, we are still faced with learning strategies with very narrow adaptation. To this end, I participated in the paper Can Deep RL Solve Erdos-Selfridge-Spencer Games? Can provide a suitable case. We are looking at a two-person toy game that has only one closed solution to get the best results. In the first experiment, we kept player 1 constant and trained player 2 with a reinforcement learning algorithm, which actually treated player 1 as part of the environment. By pitting player 2 against player 1, we ended up with the best performing player 2. But when we trained player 1 in reverse, we found that it performed worse and worse because it only played against the best player 2, and not against the other non-best situations.

Marc Lanctot et al. ‘s paper A Unified Game Insertion Approach to Multiagent Reinforcement Learning (NIPS 2017) also shows similar results. In the picture below, two agents are playing laser Tag. They are trained using multi-agent reinforcement learning. To test the system’s versatility, the researchers played a game with five seeds of randomly generated agents. Here are the results of fixing one player and training the other with reinforcement learning:

You can see player 1 and Player 2 approaching each other and shooting at each other. The researchers then put player 1 from this experiment against player 2 in another experiment. By the end of the training, players who should theoretically have learned all the tactics of the fight ended up looking like this:

This seems to be a feature of multi-agent reinforcement learning: when agents train each other, they coevolve; An agent is good at confrontation, but its performance degrades when it is training with an invisible opponent. The above two GIFs use the same learning algorithm and the same hyperparameters. The only difference between them is that the former uses random seeds, whose divergent behavior is completely derived from the randomness of the initial conditions.

Having said that, this competitive game environment has some seemingly contradictory results. OpenAI’s blog Competitive Self-Play introduces their progress in this area, which is also an important part of AlphaGo and AlphaZero. My guess is that if agents are learning at the same rate, they can fight each other and accelerate their evolution, but if one learns faster, it will overuse the other and lead to overfitting. This may seem easy to solve, but when you change a symmetric game to a general multi-agent confrontation, you see how difficult it is to ensure the same rate of learning.

Deep RL is unstable and the results are difficult to reproduce

Hyperparameters affect the behavior of the learning system and are present in almost all machine learning algorithms, usually manually set or randomly searched for debugging. Supervised learning is stable: fixed data sets, real-time goals. If you change the hyperparameters a little bit, it doesn’t affect the overall system performance that much. There are good and bad hyperparameters, but thanks to years of experience among researchers, it’s now easy to find clues to the level of hyperparameters during training. Based on these clues, we can tell if we are off track and whether we should continue training or go back to the drawing board.

However, Deep RL is still unstable, which has become a bottleneck of research.

When I first entered Google Brain, one of the first things I did was reproduce their algorithm based on the NAF algorithm paper. I think I can skillfully use Theano (easy to convert to TensorFlow), also have a lot of Deep RL experience, the paper is also in Google Brain, I can consult him at any time, such a good time and place, 2-3 weeks should be able to solve it.

But as it turns out, it took me six weeks to reproduce their results. The main reason for the failure is some software bug, but the point is, why is it taking so long?

For this we can start with the simplest task in the OpenAI Gym: Pendulum. A Pendulum having an anchor attached to a point so that it swings when a force is applied to the Pendulum. Its input state is three-dimensional (the position and speed of the pendulum), its motion space is one-dimensional (the force applied to the pendulum), and our goal is to perfectly balance the pendulum. The simple way to do this is to make the pendulum close to vertical (inverted), so we can set the reward for the Angle between the pendulum and vertical direction: the smaller the Angle, the higher the reward. So the reward function is concave.

A strategy that nearly succeeded, although it did not completely stand on its head, it produced the torque needed to offset gravity

Below is a performance graph I got after fixing all the bugs. Each line was calculated by running 10 separate times, using the same hyperparameters, but with different random seeds.

You can see that only 7 of the 10 actors work. In fact, the failure rate of 30% is quite normal. Let’s look at the result of this paper, Variational Information Maximizing Exploration. Their environment is HalfCheetah, with few rewards, episode rewards on the Y-axis, and time step length on the X-axis. The algorithm used is TRPO.

The dark curve represents the median performance of 10 random seeds, and the shaded area represents 25% — 75%. First of all, this chart is a good argument in favor of VIME, but it’s a very good experiment where the reward stays close to zero 25% of the time, which is a 25% failure rate, and that’s just because of random seeding.

So while supervised learning is stable, there are a few exceptions to it. If my supervised learning code doesn’t beat random 30% of the time, THEN I’m pretty sure there’s something wrong with the code. But if this is reinforcement learning, then I don’t know if these bugs are because of the hyperparameters or just because I have a bad face.

The picture above is from the article “Why Is Machine Learning So hard?” , the author’s core point is that machine learning will add many dimensions to the failure case space, and these dimensions will further enrich the types and ways of failure. Deep RL adds a new dimension of “randomness” to reinforcement learning, and the only way to deal with it is to put more experiments into the problem to reduce the noise data.

As mentioned earlier, Deep RL algorithm already has problems such as low sampling efficiency and unstable training results, and the random dimension undoubtedly adds insult to injury, which will greatly reduce the final result rate. Maybe we only need to run a million steps, but multiply that by five random seeds, plus the hyperparametric adjustments, and we’ll have to explode calculations just to test the hypothesis.

Andrej Karpathy said this when he was still at OpenAI:

Here’s a consolation. When I was dealing with a series of reinforcement learning problems, it took about 50% of my time, or about six weeks, to re-establish the strategic gradient. I had been doing this for a long time, had a GPU array to work with, and had a good network of experienced expert friends to see every day.

I found that for the field of reinforcement learning, all the design theories about CNN I learned from supervised learning seemed to be of little use. What credit assignment, supervision bitrate, completely useless; ResNets, Batchnorms, deep networks, no voice at all.

In supervised learning, if we want to achieve something, even if we end up doing it badly, we can still summarize something that’s not random. But reinforcement learning doesn’t work that way, and if the function is designed wrong, or the hyperparameters are not tuned right, then the result may be worse than random generation. And just because it’s reinforcement learning, even when everything is perfect, we have a 30% failure rate.

In short, you failed not by going the neural network route, but by going the Deep RL route.

If a random seed is like a canary in a mine to see if the gas is toxic, it can make such a big difference, then if our code is wrong, you can imagine what the end result will be… The good news, of course, is That we don’t have to brainstorm our own ideas. Someone did it for us — Reinforcement Learning That Matters. Their conclusion:

Multiplying rewards and constants can have a significant impact on performance;
Five random seeds (a standard commonly used in papers) may not be enough, since careful screening can yield nonoverlapping confidence intervals;
Even with all the same hyperparameters and algorithms, different implementations can have different performance when solving the same task.

My take on this is that reinforcement learning is very sensitive to changes that occur during initialization and training, because all the data is collected in real time, and the only metric we can monitor, the reward, is a scalar. Strategies that perform well at random will be activated more quickly than strategies that don’t, and if a good strategy fails to produce a good case in time, reinforcement learning will hastily conclude that its performance is all bad.

Application of Deep RL

There’s no denying that Deep RL looks like a very cool area right now, especially when it comes to publicity. Imagine a single model that can learn using only the original images, without having to adjust it individually for each game, or think of AlphaGo and AlphaZero. Isn’t that still a little exciting?

However, apart from these successful cases, it is difficult to find other real-world applications for Deep RL.

I also thought about how to use Deep RL technology for real life, living and production, and found that it was too difficult to implement. In the end, I found only two projects that looked promising — one to reduce the power consumption of data centers, and the other to work on the recently proposed AutoML Vision. Both are Google projects, but OpenAI has previously proposed something similar to the latter.

As far as I know, Audi is also trying to explore reinforcement learning, and they demonstrated a self-driving car at NIPS that reportedly uses Deep RL technology. There are Deep RL-based text summarization models, chatbots, AD delivery, but when it comes to commercial use, even if they have done so, they are now “intellectually” silent.

So Deep RL is still a narrow, hot area of research. You can guess what the big companies are doing, but as an industry insider, I think this is unlikely.

Looking to the future

There is an old saying in academic circles that every researcher must learn how to hate the field he or she studies. The joke is that most researchers do it out of passion and interest, and they don’t get tired.

This is probably the biggest experience I’ve had studying reinforcement learning. Although this is only my personal opinion, I think we should extend reinforcement learning to a wider range of areas, even to problems that seem to have no application. We should study reinforcement learning more thoroughly.

Here are some conditions that I believe will help Deep RL to further develop:

Prone to nearly infinite experiences;
To reduce the problem to a simpler form;
Introducing self-learning into reinforcement learning;
There is a clear way to define what are learnable, non-rescindable rewards;
If rewards must be made, they should at least be varied.

Here’s my list of some reasonable guesses about future research trends, in the hope that Deep RL will surprise us more in the future.

Local optimality is sufficient. We’ve been trying to be globally optimal, but isn’t that arrogant? After all, human evolution has only moved in a few directions. Maybe in the future we will find that local optimization is enough, and we don’t need to blindly pursue global optimization;
The code can’t solve the problem, hardware comes. I am sure that some people believe that the achievements of ARTIFICIAL intelligence are due to the breakthrough of hardware technology. Although I don’t think hardware can solve all problems, I must admit that hardware plays an important role in this. The faster the machine, the less we need to worry about efficiency and the easier it is to explore;
Add more Learning Signal. Sparse rewards are hard to learn because we don’t have enough known information to be helpful;
Model-based learning can release sample efficiency. In principle, a good model can solve a range of problems, as AlphaGo did, and perhaps adding a model-based approach is worth trying;
Use reinforcement learning like parameter fine-tuning. The first AlphaGo paper started with supervised learning and then RL fine-tuning on top of it. This is a great approach because it allows us to use a faster but less functional method to speed up the initial learning;
Rewards can be learned. If it’s so hard to design rewards, maybe we can make the system learn to set rewards on its own, and there’s a lot going on with imitation learning and anti-reinforcement learning, maybe that works too;
Use transfer learning to help improve efficiency. Transfer learning means we can use previously accumulated task knowledge to learn new knowledge. This is definitely a trend;
Good prior knowledge can greatly shorten the learning time. This has something in common with the previous point. There is a view that transfer learning is about using past experience to lay a good foundation for learning other tasks. The RL algorithm is designed for any Markov decision process, which is arguably the root of all evil. So if we accept that our solutions only work well in a small number of environments, we should be able to use sharing to solve all problems. Pieter Abbeel said earlier in his talk that Deep RL only needs to solve tasks in the real world. I agree with that. So we can also build a real world prior by sharing, so that Deep RL can learn real tasks quickly, at the expense of being less good at learning virtual tasks;
Difficult and easy dialectic conversion. This is an idea from BAIR [Berkeley AI Research], who found from DeepMind work that if we add multiple agents to the environment and make tasks more complex, their learning process is actually greatly simplified. Let’s go back to ImageNet: Models trained on ImageNet will generalize better than models trained on cifar-100. So maybe we don’t need a highly generalized reinforcement learning system, just use it as a general starting point.

The original address: www.alexirpan.com/2018/02/14/rl-hard.html