In the previous article, we discussed the problem of strategy evaluation in Model Free, mainly introduced the prediction and control problem of Monte Carlo (MC) sampling method. This time, we introduce another method — time series difference method (TD).

1. Time-series Differential Sampling method (TD)

For MC sampling method, if we do not have a complete state sequence, then we cannot use monte Carlo method to solve. When the complete state sequence cannot be obtained, Temporal Difference (TD) method can be used.

1. Introduction to TD

The monte Carlo sampling method is used to calculate state benefits:


G t = R t + 1 + gamma R t + 2 + gamma 2 R t + 3 + gamma T t 1 R T G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2R_{t+3}+\cdots \gamma^{T-t-1}R_T

As for the time-series difference method, we do not have a complete state sequence, but only a partial state sequence. Then how can we approximate the harvest of a certain state? Review the Behrman equation:


v PI. ( s ) = E PI. [ R t + 1 + gamma v PI. ( S t + 1 ) S t = s ] v_\pi(s) = E_\pi[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s]

We can use Rt + 1 + gamma v PI (St + 1) R_ (t + 1} + \ gamma v_ \ PI (S_ (t + 1}) Rt + 1 + gamma v PI (St + 1) approximation to replace GtG_tGt, at this time are:


v ( S t ) please v ( S t ) + Alpha. ( R t + 1 + gamma v ( S t + 1 ) v ( S t ) ) \color{red}{v(S_t) \leftarrow v(S_t) + \alpha(R_{t+1} + \gamma v(S_{t+1})-v(S_t))}
  • Generally the Rt + 1 + gamma v (+ 1) St R_ (t + 1} + \ gamma v (S_ (t + 1}) Rt + 1 + gamma v (St + 1) is called TD Target

  • Called the Rt + 1 + gamma v (St + 1) – v (St) R_ (t + 1} + \ gamma v (S_ (t + 1}) – v (S_t) Rt + 1 + gamma v (St + 1) – v (St) TD Error

  • Called the process of replacing Gt with TD Target approximate boostraping

So, in this case, we only need two consecutive states with the corresponding reward to try to solve the reinforcement learning problem.

2. N-step sequential difference

We use Rt + 1 + gamma v on PI (St + 1) R_ (t + 1} + \ gamma v_ \ PI (S_ (t + 1}) Rt + 1 + gamma v PI (St + 1) approximation GtG_tGt instead. One step forward to approximate GtG_tGt, can we take two steps forward? Of course it can, where GtG_tGt is approximately:


G t ( 2 ) = R t + 1 + gamma R t + 2 + + gamma 2 V ( S t + 2 ) G_t^{(2)} = R_{t+1} + \gamma R_{t+2}+\cdots + \gamma^2V(S_{t+2})

The approximate value function v(St)v(S_t)v(St) is:


v ( S t ) please v ( S t ) + Alpha. ( R t + 1 + gamma R t + 2 + gamma v ( S t + 2 ) v ( S t ) ) v(S_t) \leftarrow v(S_t) + \alpha(R_{t+1} + \gamma R_{t+2} + \gamma v(S_{t+2})-v(S_t))

From two steps, to three steps, and then to n steps, we can conclude that the n-step time-series differential harvest Gt(n)G_t^{(n)}Gt(n) is expressed as follows:


G t ( 2 ) = R t + 1 + gamma R t + 2 + + gamma n 1 R t + n + gamma n V ( S t + n ) G_t^{(2)} = R_{t+1} + \gamma R_{t+2}+\cdots + \gamma^{n-1}R_{t+n} + \gamma^nV(S_{t+n})

Two steps forward, the corresponding algorithm is called TD(2), three steps forward, the corresponding algorithm is called (TD3). As n gets larger and larger and approaches infinity, or approaches using a complete sequence of states, n-step time-series difference is equivalent to the Monte Carlo method. In particular, for algorithms that go one step ahead, we call TD(0).

3, TD summary

TD vs. MC:

  • TD can learn before the result is known, or when there is no result, or in an ongoing environment, while MC can learn until the final result. Time-series difference method can update the value estimate of the state more quickly and flexibly.
  • TD in update the status value is used in TD target, which is based on instant reward and the forecast of the next state value instead of the current state at the end of the sequence of state may get harvest, value is the current state of biased estimation, and the MC is using the actual harvest to update the status value, is a strategy status value of unbiased estimates, This is better for MC than TD.
  • Although the value obtained by TD is a biased estimate, its variance is lower than that obtained by MC, and it is sensitive to the initial value, and is generally more efficient than MC.

Second, TD to solve the control problem

As we know, TD has many advantages over MC, for example, TD has lower variance and can learn incomplete sequences. So we can use TD instead of MC in the policy control loop. Therefore, the current mainstream reinforcement Learning solution methods are based on TD. Here we introduce the two most commonly used algorithms, namely Sarsa and Q-learning

1, Sarsa:

For an action sequence as shown below:

In iteration, we select an action AtA_tAt in the current state StS_tSt based onϵ \ Epsilon ϵ – greed method, and then enter the next state St+1S_{t+1}St+1, and get the reward Rt+1R_{t+1}Rt+1, In the new state St+1S_{t+1}St+1 we also select an action At+1A_{t+1}At+1 based onϵ \ Epsilon ϵ- greed and use it to update our value function as follows:


A ( S t . A t ) please A ( S t . A t ) + Alpha. [ R t + 1 + gamma Q ( S t + 1 . A t + 1 ) A ( S t . A t ) ] A(S_t, A_t) \leftarrow A(S_t, A_t) + \alpha \left[R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) – A(S_t, A_t)\right]
  • Note: the action we choose here At+1A_{t+1}At+1 is the action to be performed next, which is the biggest difference with q-learning algorithm

  • TD Target + gamma delta t = Rt + 1 Q (St + 1, + 1) At \ delta_t = R_ (t + 1} + \ gamma Q (S_} {t + 1, A_ {t + 1}) + gamma delta t = Rt + 1 Q (St + 1, At + 1)

  • StS_tSt is updated in each non-terminating state

  • For an update, we will get five data, < St, At, Rt + 1, + 1 St, At + 1 > < S_t A_t, R_} {t + 1, S_} {t + 1, A_ {t + 1} > < St, At, Rt + 1, + 1 St, At + 1 >, which is the origin of the name Sarsa algorithm

The Sarsa algorithm flow is as follows

n-step Sarsa

The Sarsa algorithm above is that we update every step forward, in fact, analogous to TD, we can go many steps forward and then update, called n-step Sarsa


Q ( S t . A t ) please Q ( S t . A t ) + Alpha. ( q t ( n ) Q ( S t . A t ) ) Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha(q_t^{(n)} – Q(S_t, A_t))

Q (n)q_{(n)}q(n) is:


q t ( n ) = R t + 1 + gamma R t + 2 + + gamma n 1 R t + n + gamma n Q ( S t + n . A t + n ) q_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1}R_{t+n} + \gamma^nQ(S_{t+n}, A_{t+n})

2. On-policy and off-policy

For example, the Sarsa algorithm above is an on-policy algorithm. As can be seen from the update formula, the exploration of Sarsa algorithm uses ϵ− epsilon-ϵ− greedy method, while the action of updating value is also explored with ϵ\epsilonϵ. In other words, exploring and updating V(s)V(s)V(s) V(s) is the same strategy π\ PI π, which is called on-policy Learning.

Another important method is off-policy Learning, which is different strategy Learning. In heterogeneous strategy learning algorithms, we have two different strategies. One strategy π\ PI π is to obtain the optimal V(s)V(s)V(s) V(s) (such as using the greedy method), which we call the target policy. Another strategy, μ\muμ, is to generate a different trajectory and is more exploratory (such as ϵ− epsilon-ϵ− greedy method), which we call behavior policy.

The process of reinforcement learning is primarily a matter of exploration and exploitation. How to better explore the environment to collect track, and how to use the collected track experience. In fact, Off Policy is to divide exploration and optimization into two parts. I only pursue maximization during optimization, and I don’t need to consider Epsilon exploration like On Policy. Therefore, the advantage of Off Policy is to ensure the global optimal solution to a greater extent.

As shown in the figure above, this is a good metaphor for Off policy. The pirate is more adventurous, he can retain the track (or experience) he gained from the wind and waves (environment), and then the pirate can use these lessons to learn how to sail better at sea. For on-policy, the policy will change after each update, so the interaction data generated before is no longer valuable. We need to discard and generate new interaction data again. Off Policy has an advantage in this respect. It can repeatedly use previous data, because its goal strategy and behavior strategy are separated. Q-learning is an off-policy algorithm.

3, Q – Learning

For Q-Learning, we use ϵ−\epsilon-ϵ− greedy method to select new actions, which is exactly the same as SARSA. However, for the update of the value function, Q-Learning uses the greedy method instead of SARSA’s ϵ−\ Epsilon-ϵ − greedy method. This is the essential difference between SARSA and Q-Learning.

First, we select the action AtA_tAt based on the state StS_tSt using ϵ−\epsilon-ϵ− greedy method, and then perform the action AtA_tAt, get the reward RtR_{t}Rt, and enter the state St+1S_{t+1}St+1. For q-learning, Instead of selecting the action A’A ‘A ‘A ‘using ϵ−\epsilon-ϵ− greedy method, it selects A’A ‘A ‘A’ using greedy strategy based on state St+1S_{t+1}St+1. A’ a ‘a’ is selected to update the value function with the maximum aaa of Q(St+1,a)Q(S_{t+1}, a)Q(St+1,a). The corresponding action in the figure above is to select one of the three black circle actions at the bottom of the figure that maximizes Q(S’, a)Q(S’, a)Q(S’, a) as a ‘a’. Note: the selected action will only participate in the update of the value function and will not actually be executed.

  • For behavior policy μ\muμ, use ϵ− epsilon-ϵ− Greedy policy
  • For target policy π\ PI π, the greedy strategy is used

PI. ( S t + 1 ) = a r g m a x a    Q ( S t + 1 . a ) \pi(S_{t+1}) = argmax_{a’} \; Q(S_{t+1}, a’)

Therefore, q-learning Target:


R t + 1 + gamma Q ( S t + 1 . A ) = R t + 1 + gamma Q ( S t + 1 . a r g m a x    Q ( S t + 1 . a ) ) = R t + 1 + gamma    m a x a Q ( S t + 1 . a ) R_{t+1} + \gamma Q(S_{t+1}, A’) = R_{t+1} + \gamma Q(S_{t+1}, argmax \; Q(S_{t+1}, a’)) \\ = R_{t+1} + \gamma \; max_{a’} Q(S_{t+1}, a’)

The updated formula of Q-learning is as follows:


Q ( S t . A t ) please Q ( S t . A t ) + Alpha. [ R t + 1 + gamma    m a x a Q ( S t + 1 . a ) Q ( S t . A t ) ] Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha[R_{t+1} + \gamma \; max_aQ(S_{t+1}, a) – Q(S_t, A_t)]

Q – Learning algorithm

Sarsa VS Q-Learning

As two classical control algorithms of timing difference method, they have their own characteristics.

Q-learning directly learns optimal strategies, whereas Sarsa is exploring optimal strategies as well. As a result, when we learn the optimal strategy, if Sarsa is used, in order to ensure convergence, we need to develop a strategy to make ϵ− epsilon-ϵ− greed method’s superparameter ϵ\epsilonϵ gradually smaller in the process of iteration. Q-learning doesn’t have that problem.

Another is that q-learning directly learns the optimal strategy, but the optimal strategy depends on a series of data generated in training, so it is greatly affected by sample data, so it is greatly affected by the variance of training data, and even affects the convergence of Q function. Deep Q-Learning, the Deep reinforcement Learning version of Q-Learning, also has this problem.

In the Learning process, Sarsa encourages exploration in the process of convergence, so that the Learning process will be smooth and not too radical, while Q-learning directly selects the optimal action, which is more radical than Sarsa algorithm.

For example, in the Cliff Walk problem, as shown below, the Sarsa algorithm takes the blue safety line, while the radical Q-learning algorithm takes the red optimal route. (The grey part is the cliff)

Sarsa and Q-learning algorithms can be seen in the next article reinforcement Learning – Sarsa and Q-learning algorithms in action

3. Comparison of DP, MC and TD

1 With or without Bootstrapping and Sampling

Bootstrapping Sampling
DP Yes
MC Yes
TD Yes Yes

2. DP and TD

3. Intuitive understanding of the three algorithms

  • DP

  • MC

  • TD

3. The relationship between the three algorithms

The figure above shows a cross section of the reinforcement learning method space, highlighting two important dimensions: the depth and width of updates

These two dimensions are related to the type of update operation that improves the value function. The horizontal direction represents whether to use sampling updates (based on a sampling sequence) or to expect updates (based on the distribution of possible sequences). Expectation update requires a probability distribution model, while sampling update requires only a sampling model.

The vertical direction corresponds to the depth of the update and the degree of “bootstrap.” From single step sequential differential update to Monte Carlo update. The middle section includes methods based on n-step updates (e.g. λ\lambdaλ updates based on qualification traces)

As you can see, dynamic programming is in the upper right of the graphic space, because dynamic programming is a one-step expectation update. In the lower right corner is an extreme case of an expected update, where each update iterates through the possible scenarios until the final termination state is reached (or, in the case of a continuing task, the multiplier of the discount factor causes the subsequent payoff to have no effect on the current payoff). This is a case of heuristic search.

Sutton, Reinforcement Learning, 2nd Ed

At this point, the basic theory of traditional reinforcement Learning is introduced. The next chapter is the code introduction of SARSA and Q-learning algorithm. Of course, this blog series is fairly basic, but for a more in-depth look at Sutton’s Reinforcement Learning book, hn77.

In recent years, with the rise of neural networks, reinforcement learning based on neural networks has been in full swing. Many new interesting algorithms have emerged, and of course, they are more powerful. Therefore, the following articles will introduce neural network + reinforcement learning, also known as deep reinforcement learning (DRL). I this rookie will try their best to explain to you clearly, but also please do not go away, a lot of attention!