This is the second day of my participation in the August More text Challenge. For details, see:August is more challenging

This article is the fourth part of the introduction to reinforcement learning series. It mainly introduces two very common sequential difference algorithms in reinforcement learning: Q-learning and Sarsa.

Sequential difference TD

First of all, what is sequential differential learning?

Time-series differential learning can learn strategies directly from the experience of interacting with the environment, without the need to build a model about the dynamic characteristics of the environment, which is often called model-free method. The above dynamic programming method, which belongs to the model-based method, needs to know the dynamic characteristics of the environment.

Time-series differential learning uses experience to solve predictions. In other words, the state value function VπV_{\ PI}Vπ needs to be updated according to some empirical data learned from the given strategy.

This paper introduces two classical time-series differential learning methods, Q-learning and Sarsa.

Q-Learning

Q-learning is an off-policy algorithm, which can learn both the current data and the past data.

How to learn?

First, there is a Q-table, which is updated by iteration. The core of Q-learning is that it has a Q table on which all value updates are carried out. Tables store historical data, so Q-learning can learn not only current data, but also past experience data.

For example

Let’s say there are three states, and each state has three actions. Let’s say that action A1a_1A1 has a reward of +1, action A2a_2a2 has a reward of -1, and action a3a_3A3 has a reward of 0. The Q table can be represented as a matrix.

The first is to initialize the Q table, so that each Q value, the action value, is initially zero.

Step 1: Make a decision.

Let’s say the initial state is s1s_1s1. The action is selected according to the ϵ−greedy\epsilon-greedyϵ−greedy strategy. Since the initial Q is 0, the action is selected randomly. Assume that action A1A_1A1 is selected to enter s2s_2s2, and the reward is +1.


Q ( s . a ) please Q ( s . a ) + Alpha. [ r + gamma max a Q ( s . a ) Q ( s . a ) ] Q(s,a)\leftarrow Q(s,a)+\alpha[r+\gamma \max_{a’}Q(s’,a’)-Q(s,a)]

Step 2: Update the Q value.

We assume that the learning rate is 1 and the discount coefficient γ\ Gamma γ is 0.8. So Q (s1, a1) Q (s_1, a_1) Q (s1, a1) values should be updated to 0 + (1 + 0.8 x 0-0) = 10 + (1 + 0.8 \ times0-0) = 10 + 0.8 x (1 + 0-0) = 1, Q meter also update as follows:

This completes an update. It can be found that Q-learning belongs to one-step update.

The next step is the same as above, such iteration, update, is Q-learning.

Algorithm process

The pseudocode is as follows:

How to make decisions?

On the strategy for selecting actions, the epsilon-greedy approach selects the action A with the highest value from the current state.

How to update the Q value?

On an update to the Q table, the action A generates a reward R, as well as the next state s’, but the action on the next state s’ uses the Max value for the update.

It is worth noting that q-learning is not suitable for continuous action problems because of the operation Max.

Sarsa

Sarsa is an on-policy algorithm. It can only learn current data.

How to learn?

Sarsa algorithm is similar to Q-learning in that Sarsa also has a Q table and is the same as Q-learning in the part of action selection, that is, decision making. The difference is that in the update of Q value, Q-learning takes Max operation for the next state and action, which is greedy, while Sarsa is relatively conservative. The update of Q(s,a)Q(s,a)Q(s,a) is based on the next Q(s’,a’)Q(s ‘,a’)Q(s ‘,a’)Q(s ‘,a’)Q(s ‘,a’).

Both algorithms use epsilon-Greedy to select the action. However, Sarsa selects an action with epsilon-greedy before the learn, and makes sure to use that action for the learn, so the next update action is the action to learn. So the current learning is not necessarily the most rewarding. Q-learning takes Max for the next action, so each step learns the biggest reward at the moment. However, due to the randomness of Epsilon-Greedy, it may jump out of greed and turn into random selection, so the next update action is unlikely to be learned.

Q learning example

Let’s assume that the Q table is set as follows:

Suppose that the current state is S1S_1S1, and the action a1A_1A1 is selected to enter the next state s2S_2s2, and then the action A3A_3A3 is selected according to the greedy policy under the next state s2S_2s2.

Then, when updating the value of Q, since Sarsa should consider (s’,a’)(s ‘,a’)(s ‘,a’)(s ‘,a’) at the next moment, that is, (s2,a3)(s_2,a_3)(s2,a3), according to the formula:


Q ( s . a ) please Q ( s . a ) + Alpha. [ r + gamma Q ( s . a ) Q ( s . a ) ] Q(s,a)\leftarrow Q(s,a)+\alpha[r+\gamma Q(s’,a’)-Q(s,a)]

Q (s1, a1) Q (s_1, a_1) Q (s1, a1) values should be updated to 2 + 0.8 x (1 + 2-2) = 2.62 + (1 + 0.8 \ times 2-2) = 2.62 + 0.8 x (1 + 2-2) = 2.6,

(s2,a1)(s_2,a_1)(s2,a1), (s2,a1), (s2,a1), (s2,a1), (s2,a1). At this time of the update is 2 + (1 + 0.8 x 4-2) = 4.22 + (1 + 0.8 \ times 4-2) = 4.22 + 0.8 x (1 + 4-2) = 4.2.

Algorithm process

The pseudocode is as follows:

Sarsa because the update takes into account the value of the continuous time, Sarsa is suitable for the continuity problem.

summary

Q-learning and Sarsa are an off-policy algorithm and an on-policy algorithm. The algorithmic difference between the two is the difference in Q updating. In other words, q-learning considers the maximum Q value at the next moment in the next update, while Sarsa is an instant update, and only considers the Q value at the next moment in the update.

reference

  1. What is Q Leaning – Reinforcement Learning (Reinforcement Learning) | don’t bother Python (mofanpy.com)
  2. What is Sarsa – Reinforcement Learning (Reinforcement Learning) | don’t bother Python (mofanpy.com
  3. DQN from getting started to giving up 5 deep reading DQN algorithms