The translator | Liu Zhiyong
Edit | Natalie
AI Front Line introduction:Agents that learn from experience are usually based on reinforcement learning. Reinforcement learning algorithms can generally be divided into two types: model-free learning strategies or value functions, and model-based learning a dynamic model. Although model-free deep reinforcement learning algorithms are capable of learning a large number of robotic skills, they have very high sample complexity, typically requiring millions of samples to achieve good performance, and can usually only learn one task at a time. The high complexity and inflexibility of these algorithms prevent them from being widely used to learn mobile skills in the real world. Model-based reinforcement learning algorithms are generally considered to be more efficient.





However, in order to achieve good efficiency samples, these traditional model-based algorithms use relatively simple function approximators and cannot well generalize complex tasks, or probabilistic dynamics models. How on earth can this conundrum be solved? Let’s take a look at the blog “TDM: From Model-free to Model-based Deep Reinforcement Learning, you may have a feeling of realization.






Please pay attention to the wechat official account “AI Front”, (ID: AI-front)

You decide to ride your bike from your uc Berkeley residence to the Golden Gate Bridge. It’s a fantastic 20-mile ride, but there’s a problem: you’ve never ridden a bike! Even worse, you’re new to the Bay Area and you only have one map, so how do you start?

Let’s find out how everyone learned to ride a bike first.

One strategy is to do a lot of learning and planning: read books on how to ride a bike, study physics and anatomy. Plan out all the muscle movements you will make in response to each disturbance. This may seem like an elegant approach, but anyone who has learned to ride a bike knows that this strategy is doomed to failure. There’s only one way to learn to ride a bike: trial and error. Some tasks, like riding a bicycle, are simply too complicated to plan in advance in your head.

Trial and error is a common way to solve problems and gain knowledge. This approach can be seen as one of the easiest ways to solve a problem, as opposed to using insight and theoretical reasoning. In the process of trial and error, one possible solution is selected and applied to the problem to be solved. If it fails after verification, another possible solution is selected and then another solution is tried again. The whole process ends when one of the attempted solutions produces the correct result.

Once you learn to ride a bike, how will you get to the Golden Gate Bridge? You can repeat the trial-and-error strategy. Do a few random turns and see if you’re on the Golden Gate Bridge. Unfortunately, this strategy takes a long time. Planning is a faster strategy for these kinds of problems, and requires relatively little practical experience and less trial and error. In terms of reinforcement learning, it is more sample-efficient.

Left: Some skills that can be learned by trial and error. Right: Other times, it’s better to plan ahead.

Simple as it is, this thought experiment highlights some important aspects of human intelligence. For some tasks, we use trial and error; For other tasks we use a planning approach. A similar phenomenon occurs in reinforcement learning. In terms of reinforcement learning, empirical results show that some tasks are better suited to model-free (trial-and-error) approaches, while others are better suited to model-based (programming) approaches.

The cycling analogy also underscores that the two systems are not completely independent. In particular, to say that learning to ride a bike is a trial-and-error oversimplification. In fact, when you learn to ride a bike by trial and error, you also apply some planning. Maybe your original plan was “Don’t fall.” As you progress, you make more ambitious plans, such as “ride forward two meters without falling.” Eventually, you get so good at cycling that you can start planning in very abstract terms (e.g. “Ride to the end of this road.” ), all that’s left is planning and not worrying about the details of cycling. We are witnessing a gradual transition from model-free (trial-and-error) strategies to model-based (programming) strategies. If we can develop artificial intelligence algorithms (reinforcement learning algorithms in particular) to model this behaviour, it may lead to an algorithm that both performs well (by using trial and error early on) and is capable of improving sample efficiency. Switch to a programming approach later to achieve a more abstract goal.

This paper introduces temporal Difference Model (TDM), a kind of reinforcement learning algorithm, which can capture smooth transitions between model-free and model-based reinforcement learning. Before talking about TDM, let’s first explain how typical model-based reinforcement learning algorithms work.

Note: Temporal Difference Learning, which combines dynamic programming and Monte Carlo method, is the core idea of reinforcement Learning. Monte Carlo’s method is to simulate (or experience) a sequence and, at the end of the sequence, estimate the value of the states according to the value of each state in the sequence. Sequential differential learning simulates (or runs through) a sequence, takes each step (or steps), estimates the value of the state before it is executed, based on the value of the new state. Monte Carlo method can be considered as time-series differential learning with maximum number of steps.

Model-based reinforcement learning

In reinforcement learning, we have some state spaceAnd action space. If theWe are in the zone all the timeTo take actionAccording to the dynamic modelTransition to a new state. Our goal is to maximize the sum of rewards for access states. Model-based reinforcement learning algorithms assume a dynamic model (or learning). Given this dynamic model, there are multiple model-based algorithms. For this article, we consider the following optimization to select a set of actions and states to maximize the reward:

Optimization methods suggest choosing a set of states and actions to maximize rewards while ensuring that the trajectory is feasible.

In this context, viable means that state transitions in each state are valid. For example, as shown in the figure below, if you go from stateGet started and take action, only the top oneTo achieve a viable transformation.

Planning a trip to the Golden Gate Bridge will be a lot easier if you can challenge physics. However, constraints in model-based optimization problems ensure that only traces like the first row can be output. The bottom two tracks may have high returns, but they are not feasible.

In our bike problem, optimization might plan a ride from Berkeley (top right) to the Golden Gate Bridge (center left), which looks like this:

An example of a plan (state and action) outputs the optimization problem.

While the concept sounds good, the plan is unrealistic. Model-based approaches use a modelTo predict the state of the next time step, which in robotics is usually equivalent to a tenth or a hundredth of a second. Therefore, a more realistic description of the final plan is possible, as shown below:

A more realistic plan.

If we think about our plans in everyday life, we realize that our plans are more abstract in terms of timing. Instead of trying to predict where the bike will be in the next tenth of a second, we make longer-term plans like, “I’ll get to the end of this road.” Moreover, once we learn how to ride a bike, all we can do is make these temporal abstractions. As mentioned earlier, we need methods that (1) use trial and error to learn, and (2) provide a mechanism to progressively increase the level of abstraction we use for planning. To this end, we introduce a sequential difference model.

Sequential difference model

A sequential difference model (TDM), which we write asIt’s a function given a state, action, and target status, predict the agent in the time stepThe extent to which goals are achieved. Intuitively, TDM answers the question: “If I were to ride to San Francisco in 30 minutes, how close would I be?” For robotics, a natural way to measure proximity is to use Euclidean distance.

Note: Euclidean distance, also known as Euclidean distance, is statistically the “ordinary” (i.e. straight) distance between two points in Euclidean space. Using this distance, Euclidean space becomes metric space. The associated norm is called the Euclidean norm. The older literature is called the Pythagorean metric.

TDM predicts how close you are to your target (the Golden Gate Bridge) after a fixed period of time. After 30 minutes of cycling, you may only get as far as the grey cyclist pictured above. In this case, the gray line segment represents the distance that TDM should predict.

For those familiar with reinforcement learning, it turns out that TDM can be thought of as a function of the objective condition Q in the limited-view MDP. Because TDM is just another Q function, we can train it using model-free (trial and error) algorithms. We use deep Deterministic Policy gradient (DDPG) to train TDM and trace the target and time layers to improve the sample efficiency of the learning algorithm. In theory, any Q learning algorithm can be used to train TDM, but we have found this to be effective and readers are welcome to consult the relevant papers for more details.

AI Front Note: Deep Deterministic Policy Gradient, DDPG algorithm is a modification of Deterministic Policy Gradient (DPG) method by Lillicrap et al. using the idea of DQN extended Q learning algorithm. An algorithm based on actor-critic (AC) framework is proposed, which can be used to solve THE DRL problem in continuous action space.

See Continuous Control with Deep Reinforcement Learning

https://arxiv.org/abs/1509.02971

Use TDM for planning

Once we train TDM, how do we use it for planning? It turns out that we can plan through the following optimization methods:

This intuition is similar to model-based formulas. It is possible to choose a series of actions and states to maximize returns. One key difference is that we only plan everyTime step, not every time step.The constraint of is forces the feasibility of trajectory. Speak figuratively, not explicitlySteps and similar actions:

We can plan directlyTime step, as shown in the figure below:

As theWe get more and more abstract plans over time. inThe model-free strategy “abstracts” the details of achieving the goal by taking actions using a model-free approach. For the bicycle problem andIf the value is large enough, the optimization may make a plan as follows:

Model-based planners can be used to select targets for sequential abstractions. These goals can be achieved using model-free algorithms.

One thing to note is that this formula can only be used in eachOptimize rewards in steps. However, many quests only care about a few states, such as the final state (” Get to the Golden Gate Bridge “, for example), so it still captures all sorts of interesting quests.

Related work

We are not the first to investigate the link between model-based and model-free reinforcement. Parr’08[¹] and Boyan’99[²] are particularly relevant, although they focus primarily on lists and linear function approximators. In Sutton’11[³] and Schaul’15[⁴], the idea of Q function of training objective conditions has also been explored in robot navigation and Atari games. Finally, the relabelling scheme we used was inspired by Andrychowicz 17[⁵].

The experiment

We tested TDM on five simulated continuous control tasks and one real-world robotic task. One of the simulations involved training a robot arm to push a barrel to a target position. The figure below shows an example of the associative learning curve that ultimately drives the TDM strategy:

Demonstrate dynamic map TDM policy is used to perform the task can’t upload, can visit: http://bair.berkeley.edu/static/blog/tdm/pusher-video-small.gif.

Learning curve. The blue line is TDM (the lower the better).

The learning curve shown above shows a direct relationship between the distance to the final goal and the number of samples (the lower the better). Our simulation controls the robot at 20Hz, which means 1000 steps correspond to 50 seconds in the real world. The dynamics of this environment are relatively easy to learn, which means that a model-based approach should work well. As expected, the model-based approach (purple curve) learned quickly — about 3,000 steps, or 25 minutes — and performed well. The TDM method (blue curve) can also be learned quickly — about 2000 steps or 17 minutes. Model-free DDPG (TDM) baselines eventually solve this task, but require more training samples. One reason the TDM approach is so quick to learn is that its effectiveness is model-based masquerading.

The model-free approach looks much better when we move to the motor task, which is much more dynamic. One of the motor tasks involved training a quadruped robot to move to a certain position. The resulting TDM strategy is shown below, along with the corresponding learning curve:

TDM strategies for exercise tasks.

Learning curve. The blue line is TDM (the lower the better).

Just as we use trial and error instead of planning to master cycling, we expect the model-free approach to perform better on these motor tasks than the model-based approach. This is exactly what we see on the learning curve above: the model-based approach remains flat in performance. The model-free DDPG approach is slower to learn, but ultimately better than the model-based approach. TDM can learn quickly and get the best final performance. There are more experiments in this paper, including training a real world 7-dof Sawyer to take his place. We encourage readers to check it out!

Future development direction

The time-series difference model provides a formal and practical algorithm for model-free to model-based control interpolation. But much work remains to be done. First, push to assume that the environment and strategy are deterministic. In fact, most environments are random. Even if they are deterministic, there are compelling reasons to use random strategies in practice (see this blog post, Learning Diverse Skills via Maximum Entropy Deep Reinforcement Learning [⁶]). Extending TDM to this setting helps move TDM to a more realistic environment. Another idea is to combine TDM with model-based programming optimization algorithms, which we use in this paper. Eventually, we hope to apply TDM to more challenging tasks, such as robot movement, manipulation and, of course, cycling to the Golden Gate Bridge.

This paper has been accepted by ICLR 2018. For more information about TDM, check out the following links:

  • ArXiv Preprint https://arxiv.org/abs/1802.09081

  • Open source code https://github.com/vitchyr/rlkit

We call it the sequential difference model because we use sequential difference to learn and trainAnd useAs a model.

References:

[1] https://users.cs.duke.edu/~parr/icml08.pdf

[2] https://pdfs.semanticscholar.org/61d4/897dbf7ced83a0eb830a8de0dd64abb58ebd.pdf

[3] http://www.incompleteideas.net/papers/horde-aamas-11.pdf

[4] http://proceedings.mlr.press/v37/schaul15.pdf

[5] https://arxiv.org/abs/1707.01495

[6] http://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/

Original link:

TDM: From Model-Free to Model-Based Deep Reinforcement Learning

http://bair.berkeley.edu/blog/2018/04/26/tdm/