This article is the second part of the introduction to reinforcement learning series. It mainly introduces the MDP Markov decision process, a very important theoretical framework in reinforcement learning.

MDP Markov decision-making process

MDP is Markov Decision Process. MDP is a mathematical form of reinforcement learning problem, so to speak, this section begins with the theoretical part of reinforcement learning.

Basic concept

What is reinforcement learning?

A few concepts need to be made clear first. The first is the agent. An agent is a machine that can learn and implement decisions. Everything other than an agent that interacts with it is called an environment. In the environment, the agent interacts with the environment, selects the action from the state of the environment at a certain moment, and the environment gives the corresponding feedback to the action, changes to the new state at the next moment, and generates a reward to return to the agent. This is an agent – environment interaction. The diagram below.

Reinforcement learning considers the interactive learning between the agent and the environment. The learning goal of the agent is the reward returned by the environment, and the RL task is the expectation of maximizing the cumulative sum of rewards. It is a method of active learning without supervision. Rewards are also the basis for evaluating action choices.

MDP

MDP is the foundation of reinforcement learning and the theoretical framework of RL. In MDP, we consider state SSS, action AAA, and reward RRR. Specifically, the agent observes a certain characteristic expression sts_TST of the environment state at time TTT, then selects the action ATA_TAT, and receives the result of the action ATA_TAT at the next moment, that is, the reward RT + 1R_ {t+1} RT +1. At the same time enter the next state st+1s_{t+1}st+1. When the state, action, reward set (S, A, R)(S, A, R)(S, A, R) in MDP has A finite number of elements, such MDP is also called finite MDP. The formalized sequence is as follows:


( s 0 . a 0 . r 0 . . . . . s t . a t . r t . . . . ) (s_0,a_0,r_0,… ,s_t,a_t,r_t,…)

Four-parameter expression


p ( s . r s . a ) = P ( S t = s . R t = r S t 1 = s . A t 1 = a ) p(s’,r|s,a)=P(S_{t}=s’,R_{t}=r|S_{t-1}=s,A_{t-1}=a)

Here’s a summary:


Process ( s 0 . s 1 . s 2 . . . . . s t . . . . )    with    P ( s t s t 1 . . . . . s 0 ) Markov Process ( s 0 . s 1 . s 2 . . . . . s t . . . . )    with    P ( s t s t 1 . . . . . s 0 ) = P ( s t s t 1 ) Markov Process ( s 0 . r 0 . s 1 . r 1 . s 2 . r 2 . . . . . s t . r t . . . . )    with    P ( s t s t 1 . . . . . s 0 ) = P ( s t s t 1 ) Markov Decision Process ( s 0 . a 0 . r 0 . s 1 . a 1 . r 1 . . . . . s t . a t . r t . . . . )    with    P ( s t s t 1 . . . . . a 0 . . s 0 ) = P ( s t s t 1 . a t 1 ) \begin{aligned} &\text{Process} \\ &\quad(s_0,s_1,s_2,… ,s_t,…) \; \text{with}\; P(s_t|s_{t-1},… ,s_0)\\ &{\text{Markov Process}}\\ &\quad(s_0,s_1,s_2,… ,s_t,…) \; \text{with}\; P(s_t|s_{t-1},… ,s_0)=P(s_t|s_{t-1})\\ &\text{Markov Process}\\ &\quad(s_0,r_0,s_1,r_1,s_2,r_2,… ,s_t,r_t,…) \; \text{with}\; P(s_t|s_{t-1},… ,s_0)=P(s_t|s_{t-1})\\ &\text{Markov Decision Process}\\ &\quad(s_0,a_0,r_0,s_1,a_1,r_1,… ,s_t,a_t,r_t,…) \; \text{with}\; P(s_t|s_{t-1},… ,a_0,,s_0)=P(s_t|s_{t-1},a_{t-1})\\ \end{aligned}

The probability equation above shows that in MDP, the probability of each possible value of ST, RTS_T, R_TST, rt depends only on the previous state and action ST −1, at−1s_{t-1}, A_ {T-1} ST −1, AT −1, irrespective of the earlier state and action. Such a state is therefore said to be Markov.

For continuous tasks, G is defined as the weighted sum of rewards:


G t = R t + 1 + gamma R t + 2 + gamma 2 R t + 3 + . . . = k = 0 up gamma k R t + k + 1 = R t + 1 + gamma G t + 1 \begin{aligned} G_t=&R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+… =\sum_{k=0}^{\infty}\gamma^k R_{t+k+1}\\ =&R_{t+1}+\gamma G_{t+1} \end{aligned}

Gamma \gamma is called the discounting factor.

In fact, the total reward GtG_tGt is still finite, because as long as the reward is non-zero constant and γ>1\gamma>1γ>1, the total reward GtG_tGt is convergent, for example, when the reward is a constant 1, it can be expressed as


G t = k = 0 up gamma k = 1 1 gamma G_t=\sum_{k=0}^{\infty}\gamma^k=\dfrac{1}{1-\gamma}

State value function and action value function

Policy is a mapping from the state to the probability of selection of each action, which is a random function of the agent’s selection of actions. In terms of PI \ PI PI. π\ PI π is bound to SSS. PI (a) ∣ s \ PI PI (a | s) (a) ∣ s said state s action selection under a probability, PI (s)/PI (s) of PI (s) is in the state (that is, the moment) of the probability density function. PI (s)/PI (s) (s) is a function of PI, PI (a ∣ s)/PI (a | s) PI (a ∣ s) is the probability of a specific value.

The value function of state under different strategies is different, so we use vπ(s)v_{\ PI}(s)vπ(s) to represent the value function of state SSS under strategy π\ PI π. Therefore, qπ(s,a)q_{\ PI}(s,a)qπ(s,a) is used to represent the value function of action aaa under the state SSS under the policy π\ PI π. V πv_{\ PI}vπ is also called the state value function of the strategy π\ PI π, and qπq_{\ PI}qπ is also called the action value function of the strategy π\ PI π. The probability of taking the same action may be different under different strategies.

For all states SSS, the value of the current state is the expected probability of the reward that the agent will receive for performing an action based on the current policy, starting from state S, given a strategy π\ PI π. The expectation here is because when entering from SSS to the next state s’s ‘, different actions are taken to enter different S’s ‘and the rewards are different. Therefore, the expectation should be obtained for all possible S’s’.


v PI. ( s ) = E PI. [ G t       S t = s ] = E PI. [ t = 0 up gamma k R t + k + 1       S t = s ] \begin{aligned} v_{\pi}(s)=& E_{\pi}[G_t\;|\;S_t=s]= E_{\pi}[\sum_{t=0}^{\infty}\gamma^k R_{t+k+1}\;|\;S_t=s] \end{aligned}

Similarly, for the action value function qπq_{\ PI}qπ, that is, given the strategy π\ PI π, after the execution of action A, because different action environment may produce different rewards, so to expect the reward of all possible decision sequences, we have


q PI. ( s . a ) = E PI. [ G t       S t = s . A t = a ] = E PI. [ t = 0 up gamma k R t + k + 1       S t = s . A t = a ] \begin{aligned} q_{\pi}(s,a)=& E_{\pi}[G_t\;|\;S_t=s,A_t=a]= E_{\pi}[\sum_{t=0}^{\infty}\gamma^k R_{t+k+1}\;|\;S_t=s,A_t=a] \end{aligned}

To understand this, you can see that,
v PI. v_{\pi}
The expectation of reward from state to state,
q PI. q_{\pi}
The expectation of reward from action to action. The probability of
p p
It’s the environment, so we can only optimize the strategy.

In fact, it is not difficult to find that vπv_{\ PI}vπ has the following relationship with qπq_{\ PI}qπ :


v PI. ( s ) = E a …… A ( s ) [ q PI. ( s . a ) ] = a PI. ( a s ) q PI. ( s . a ) v_{\pi}(s)=E_{a\sim A(s)}[ q_{\pi}(s,a)]=\sum_a \pi(a|s) q_{\pi}(s,a)

If you look at the formula, the value of the state is equal to the value of all the possible actions in that state, times the expectation of the probability of choosing the possible actions under the current strategy. The value of the action depends on the expected value of the subsequent rewards and the sum of the rest of the rewards.

Q π(s,a)q_{\ PI}(s,a)qπ(s,a)


q PI. ( s . a ) = E [ R t + 1 + gamma v PI. ( S t + 1 )       S t = s . A t = a ] = s . r p ( s . r s . a ) [ r + gamma v PI. ( s ) ] q_{\pi}(s,a)=E[R_{t+1}+\gamma \cdot v_{\pi}(S_{t+1})\;|\;S_t=s,A_t=a] =\sum_{s’,r}p(s’,r|s,a)[r+\gamma v_{\pi}(s’)]

Further, for any strategy π\ PI π and any state SSS, we can obtain the relationship between the current value of state SSS and its possible successor state s’s ‘:


v PI. ( s ) = E PI. [ R t + 1 + gamma G t + 1       S t = s ] = a PI. ( a s ) s r p ( s . r s . a ) [ r + gamma E PI. [ G t + 1       S t + 1 = s ] = a PI. ( a s ) s . r p ( s . r s . a ) [ r + gamma v PI. ( s ) ] \begin{aligned} v_{\pi}(s) =& E_{\pi}[R_{t+1}+\gamma G_{t+1}\;|\;S_t=s] \\ =&\sum_a \pi(a|s)\sum_{s’}\sum_{r}p(s’,r|s,a)[r+\gamma E_{\pi}[G_{t+1}\;|\;S_{t+1}=s’] \\ =&\sum_a \pi(a|s)\sum_{s’,r}p(s’,r|s,a)[r+\gamma v_{\pi}(s’)] \\ \end{aligned}

This equation is also known as the Bellman equation. Since PPP is a probability density function, the last term is actually the expected value, which is qπq_{\ PI}qπ.

Berman’s optimal equation

How do we derive that?

We know that the Behrman equation, in fact, is the recursive form of the state value function:


v PI. ( s ) = a PI. ( a s ) s . r p ( s . r s . a ) [ r + gamma v PI. ( s ) ] v_{\pi}(s)=\sum_a \pi(a|s)\sum_{s’,r}p(s’,r|s,a)[r+\gamma v_{\pi}(s’)]

How do you find the optimal strategy?

In fact, there may be more than one optimal strategy (π∗\pi_*π∗), but they must share the optimal action value function as well as the state value function, so there are:


v ( s ) = max PI. v PI. ( s ) q ( s . a ) = max PI. q PI. ( s . a ) v_*(s)=\max_{\pi} v_{\pi}(s) \\ q_*(s,a)=\max_{\pi} q_{\pi}(s,a)

In the optimal policy, the value of each state must be equal to the expected reward of the optimal action in that state (because the policy has been determined), as follows:


v ( s ) = max a A ( s ) q PI. ( s . a ) = max a E PI. [ G t S t = s . A t = a ] = max a E PI. [ R t + 1 + gamma G t + 1 S t = s . A t = a ] = max a E [ R t + 1 + gamma v ( S t + 1 ) S t = s . A t = a ] = max a s . r p ( s . r s . a ) [ r + gamma v ( s ) ] \begin{aligned} v_*(s)=&\max_{a\in A(s)} q_{\pi_*}(s,a) \\ =&\max_a E_{\pi_*}[G_t|S_t=s,A_t=a] \\ =&\max_a E_{\pi_*}[R_{t+1}+\gamma G_{t+1}|S_t=s,A_t=a] \\ =&\max_a E[R_{t+1}+\gamma v_*(S_{t+1})|S_t=s,A_t=a] \\ =&\max_a\sum_{s’,r}p(s’,r|s,a)[r+\gamma v_*(s’)] \end{aligned}

This is the Bellman optimal equation for the state value function.

And for the action value function, its Behrman optimal equation is as follows:


q ( s . a ) = E [ R t + 1 + gamma max a q ( S t + 1 . a )       S t = s . A t = a ] = s . r p ( s . r s . a ) [ r + gamma max a q ( s . a ) ] \begin{aligned} q_{*}(s,a)=&E[R_{t+1}+\gamma \max_{a’}q_*(S_{t+1},a’)\;|\;S_t=s,A_t=a] \\ =& \sum_{s’,r}p(s’,r|s,a)[r+\gamma \max_{a’}q_*(s’,a’)] \end{aligned}

reference

  1. Barto, Sutton. Reinforcement Learning (Second Edition)