1. Introduction to reinforcement learning

Reinforcement learning tasks are usually described by Markov Decision Process (MDP). Specifically, the machine is in an environment, and each state is the perception of the current environment. The machine can only affect the environment through action. When the machine performs an action, it will make the environment transfer to another state according to a certain probability. At the same time, the environment feeds back a reward to the machine based on the underlying reward function. In general, reinforcement learning mainly consists of four elements: state, action, transfer probability and reward function.

According to the figure above, when agent (agent) is performing a certain task, it first interacts with environment to generate a new state state, and the environment gives a reward. In this cycle, agent and Environment continue to interact with each other to generate more new data. Reinforcement learning algorithm is to interact with the environment through a series of action strategies, generate new data, and then use the new data to modify its own action strategy. After several iterations, the agent will learn the action strategy needed to complete the task.

2. Markov Process

Current state markov sex contains useful information about future forecast needed, past information to predict the future is not important, the satisfies the markov property, strictly speaking, is a state contains the histories of all relevant information, as long as the current state, the all historical information is no longer needed, the current state can decide the future, The state is considered to have Markov property. Described by the formula:


P ( S t + 1 S t ) = p ( S t + 1 S 1 . S 2 . . S t ) P(S_{t+1}|S_t) = p(S_{t+1}|S_1, S_2, \cdots , S_t)

Markov process is also called Markov Chain (Markov Chain). It is a random process without memory, which can be represented by a tuple
,>

  • SIt’s a finite number of states
    S = s 1 . s 2 . s 3 . . s t S ={s_1, s_2, s_3, \cdots, s_t}
  • PIs the state transition probability matrix
    p ( S t + 1 = s s t = s )    p(S_{t+1} = s’|s_t=s) \;
    Among them
    s s’
    Represents the state at the next moment,
    s s
    Represents the current state

For state S1S_1S1, there is a probability of 0.1 remaining constant, a probability of 0.2 moving to s2S_2S2, and a probability of 0.7 moving to S4S_4S4.

We can express this using a matrix:


P = ( P ( s 1 s 1 ) P ( s 2 s 1 ) P ( s N s 1 ) P ( s 1 s 2 ) P ( s 2 s 2 ) P ( s N s 2 ) P ( s 1 s N ) P ( s 2 s N ) P ( s N s N ) ) P = \begin{pmatrix} P(s_1|s_1) & P(s_2|s_1) & \cdots & P(s_N|s_1) \\ P(s_1|s_2) & P(s_2|s_2) & \cdots & P(s_N|s_2) \\ \vdots & \vdots & \ddots & \vdots \\ P(s_1|s_N) & P(s_2|s_N) & \cdots & P(s_N|s_N) \\ \end{pmatrix}

3. Markov Reward Process

3.1. Concept introduction

Markov reward process is represented by



,r,p,>
,r,p,>
,r,p,>
,r,p,>

  • RRR: indicates the SSS status at a certain time StS_tSt The expectation of the reward at the next time (t+1)(t +1)(t +1)

R s = E [ R t + 1 S t = s ] R_s = E[R_{t+1}|S_t=s]
  • GtG_tGt: Harvest GtG_tGt is the decaying sum of all rewards on a Markov reward chain starting at time t

G t = R t + 1 + gamma R t + 2 + gamma 2 R t + 3 + + gamma T t 1 R T G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots + \gamma^{T-t-1}R_T

  • gamma \gamma
    : discount factor
    ( D i s c o u n t    f a c t o r gamma [ 0 . 1 ] ) (Discount \; Factor γ ∈ [0, 1])

    • 1, in order to avoid the situation of state loop
    • 2. System predictions about the future are not always accurate, so take them with a grain of salt
    • It is clear that theThe closer you get to 1, the longer you think about the benefits.
  • V(s)V(s)V(s) : The state value function represents the expectation of harvesting GtG_tGt from the Markov chain starting from this state

v ( s ) = E [ G t S t = s ] v(s) = E[G_t|S_t = s]

Example: For the following states, we set the reward for entering S1S_1S1 to 5, for entering S7S_7S7 to 10, and for remaining states to 0. The RRR can be said as follows: R =,0,0,0,0,0,10 [5] R =,0,0,0,0,0,10 [5] R =,0,0,0,0,0,10 [5], the discount factor gamma \ gamma gamma of 0.5. For the following two Markov processes, the reward is:

  • S4, S5 and S6 and S7: S_4, S_5, S_6, S_7: S4, S5 and S6 and S7:0 + 0 0 + + 0.5 * 0.5 * 0.125 * 10 = 1.25
  • S4, S3 and S2, S1S_4, S_2, S_3 S_1S4, S3 and S2 and S1:0 * 0 0 + 0.25 + 0.5 + 0.125 * 5 = 0.625

3.2, Bellman Equation


v ( s ) = E [ G t S t = s ] v(s) = E[G_t|S_t = s]


= E [ R t + 1 + gamma R t + 2 + gamma 2 R t + 3 + S t = s ] = E[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots|S_t = s]


= E [ R t + 1 + gamma ( R t + 2 + R t + 3 + ) S t = s ] = E[R_{t+1} + \gamma(R_{t+2} + R_{t+3} + \cdots)|S_t=s]


= E [ R t + 1 + gamma v ( S t + 1 ) S t = s ] =E[R_{t+1} + \gamma v(S_{t+1})|S_t = s]


= E [ R t + 1 S t = s ] Current rewards + gamma E [ v ( S t + 1 ) S t = s ] The value expectation of the state at the next moment = \ underbrace {E (R_ (t + 1} | S_t = s]} _ {current reward} + \ underbrace {\ gamma E [v (S_ (t + 1}) | S_t = s]} _ {state of the next moment the value of the expected}

Using The Behrman equation, the state value VVV can be expressed as:


V ( s ) = R ( s ) I m m e d i a t e    r e w a r d + gamma s S P ( s s ) V ( s ) D i s c o u n t e d    s u m    o f    f u t u r e    r e w a r d V(s) = \underbrace{R(s)}_{Immediate \; reward} + \underbrace{\gamma \sum_{s’ \in S}P(s’|s)V(s’)}_{Discounted \; sum \; of \; future \; reward}

S is all the states at the next moment, S prime is all the possible states at the next moment

It can be seen from Behrman equation that the value function V (s)v(s)v(s) consists of two parts: one is the expectation of the current reward, namely R(s)R(s)R(s) R(s), and the other is the expectation of the state at the next moment, namely the sum of the product of the possible state at the next moment to obtain the reward expectation and the corresponding state transition probability, and finally the discount is added. As shown in the figure below, for state S1s_1S1, its value function is: V (s1) = R (s1) + gamma (0.1 V (s1) ∗ ∗ + 0.2 V (s4) (s2) ∗ + 0.7 V) V (s_1) = R (s_1) + \ gamma (+ 0.2 * 0.1 * V (s_1) V (s_2) + 0.7 * V (s_4)) V (s1) = R (s1) + gamma (0.1 V (s1) ∗ ∗ + 0.2 V (s4) (s2) ∗ + 0.7 V)

4. Markov Decision Process

Markov Decision process is A Decision process added on the basis of Markov reward process, which is equivalent to an action set. It can be used as



, where P and R correspond to specific behavior A. Unlike markov reward processes that correspond only to one state.
,a,p,r,γ>
,>
,a,p,r,γ>

  • AAA represents a finite set of behaviors

  • SSS represents a finite set of states


  • P a P^a
    is dynamics / transition model for each action


    P ( s t + 1 = s s t = s . a t = a ) = P [ S t + 1 = s S t = s . A t = a ] P(s_{t+1} = s’|s_t = s, a_t = a) = P[S_{t+1}=s’|S_t = s, A_t = a]
  • RRR is reward function R (st = s, the at = a) = E = [Rt ∣ st = s, at a] R (s_t = s, a_t = a) = E [R_t | s_t = s, a_t = a] R = (st = s, at a) = E [Rt ∣ st = s, the at = a]

4.1 Policy

In PI \ PI PI strategy set, its elements PI (a) ∣ s \ PI PI (a | s) (a) ∣ s said a state s take the probability of a possible behavior


PI. ( a s ) = P ( A t = a S t = s ) \pi(a|s) = P(A_t=a|S_t=s)

Here are some things to note:

  • Policy defines a fully defined individual behavior mode, that is, it includes all behaviors and probabilities of individuals in various states
  • At the same time, a certain Policy is static, independent of time
  • A Policy is related to the current status, not historical information, but an individual can update the Policy over time

The strategy π\ PI π satisfies the following equation in the Markov reward process, which can be understood by referring to the figure below


State transition probability: P PI. ( s s ) = a A PI. ( a s ) P ( s s . a ) Reward function: R PI. ( s ) = a A PI. ( a s ) R ( s . a ) State transition probability: \ quad P ^ \ | s PI (s) = \ sum_ {a \ in a} \ PI (a | s) P (s’ \ \ | s, a) reward function: \ quad R ^ \ PI (s) = \ sum_ {a \ in a} \ PI (a | s) R (s, a)

The state transition probability can be described as: the probability of a state transition from S to S ‘is equal to the sum of the product of the probabilities of performing all the actions in that state and the probability of the corresponding actions leading to a state transition from S to S’ in the execution of strategy π\ PI π. Refer to the below

The reward function can be described as: the reward obtained when executing strategy π\ PI π is equal to the sum of the product of the probability of performing all actions in that state and the immediate reward generated by the corresponding action.

We introduce strategies, which can also be understood as action guidelines, to describe individual behaviors in a more standardized way. Now that we have action guidelines, we need to judge the value of action guidelines, and we need to introduce value functions based on strategies.

State Value Function based on Policy

  • V(s)V(s)V(s) represents the expected gain from following the current policy starting with state SSS

v PI. ( s ) = E PI. [ G t S t = s ] v_{\pi}(s) = E_{\pi}[G_t|S_t = s]

GtG_tGt can refer to markov reward process. We have a measure of value, if state S is a good state, how to choose the action to get to that state, then we need to judge whether the action is good or bad, measure the value of the action.

Policy-based Action Value Function

  • Q PI q_ (s, a) {\} PI PI (s, a) q (s, a) current state s a can perform a specific behavior to harvest expectations

q PI. ( s . a ) = E PI. [ G t S t = s . A t = a ] q_{\pi}(s, a) = E_{\pi}[G_t|S_t=s, A_t=a]
  • It can be deduced according to Bellman formula (refer to the derivation of V in markov reward process)

q PI. ( s . a ) = R ( s . a ) + gamma s S P ( s s . a ) V PI. ( s ) q_{\pi}(s,a) = R(s, a) + \gamma \sum_{s’ \in S} P_{(s’|s, a)} \cdot V_{\pi}(s’)

The value of taking a certain behavior in a certain state can be divided into two parts: one is the value of leaving the state, and the other is the sum of all the values entering the new state by the product of their transition probability. Refer to the picture below to understand

  • According to the definition of state value function and behavior value function, the relationship between them can be obtained

v PI. ( s ) = a A PI. ( a s ) q PI. ( s . a ) v_{\pi}(s) = \sum_{a \in A}\pi(a|s) \cdot q_{\pi}(s,a)

As we know, strategy is used to describe the probability of executing different behaviors in different states, and state value is the expectation of harvest obtained when following the current strategy, that is, the value of state S is reflected in the sum of the product of the probability of taking all possible behaviors following a certain strategy in this state. Refer to the picture below for understanding

  • The Bellman Expectation Equation can be obtained by combining the above two formulas

v PI. ( s ) = a A PI. ( a s ) ( R ( s . a ) + gamma s S P ( s s . a ) v PI. ( s ) ) v_{\pi}(s) = \sum_{a \in A}\pi(a|s)\left(R(s, a) + \gamma \sum_{s’ \in S} P_{(s’|s, a)} \cdot v_{\pi}(s’)\right)

q PI. ( s . a ) = R ( s . a ) + gamma s S P ( s s . a ) a A PI. ( a s ) q PI. ( s . a ) q_{\pi}(s,a) = R(s, a) + \gamma \sum_{s’ \in S} P_{(s’|s, a)} \cdot \sum_{a’ \in A}\pi(a’|s’) \cdot q_{\pi}(s’,a’)

5. Optimal value function

Solving reinforcement learning problems means finding an optimal strategy that allows individuals to consistently gain more than other strategies during interaction with the environment, and this optimal strategy can be represented by π∗\ PI_ *π∗. Once the optimal strategy π∗\pi_*π∗ is found, the reinforcement learning problem is solved. Generally speaking, it is difficult to find an optimal strategy, but a better strategy can be determined by comparing the advantages and disadvantages of several different strategies, that is, the local optimal solution.

We generally compare the advantages and disadvantages of strategies by the corresponding value function, that is to say, the search for a better strategy can be completed by looking for a better value function. The optimal state value function can be defined as the largest among many state value functions generated under all strategies, namely:


V ( s ) = m a x PI. V PI. ( s ) V_*(s) = max_\pi V_\pi(s)

Similarly, the optimal action value function can also be defined as the largest among the numerous action state value functions generated under all strategies, namely:


q ( s . a ) = m a x PI. q PI. ( s . a ) q_*(s, a) = max_\pi q_\pi (s,a)

We can maximize the optimal behavior value function to find the optimal strategy:


PI. ( a s ) = { 1 . if  a = a r g m a x    q ( s . a ) 0 . otherwise \pi^*(a|s) = \begin{cases} 1, & \text{if $a = argmax \; q_*(s,a)$} \\ 0, & \text{otherwise} \end{cases}

Bellman Optimality Equation

As long as we find the maximum state value function or action value function, the corresponding strategy π∗\pi_*π∗ is the solution to our reinforcement learning problem. At the same time, using the relationship between state value function and action value function, we can also get:


v ( s ) = m a x a q ( s . a ) v_*(s) = max_a q_*(s,a) \\

When optimal, the value of a state is equal to the maximum value of the action in the current state


q ( s . a ) = R ( s . a ) + gamma s S P s s a V ( s ) q_*(s,a) = R(s, a) + \gamma \sum_{s’ \in S} P_{s’s}^a \cdot V_*(s’)

Combine these two equations and you have the Bellman Optimality Equation


v ( s ) = m a x a ( R ( s . a ) + gamma s S P s s a v ( s ) ) v_*(s) = max_a (R(s, a) + \gamma \sum_{s’ \in S} P_{s’s}^a \cdot v_*(s’))

q ( s . a ) = R ( s . a ) + gamma s S P s s a m a x a q ( s . a ) q_*(s,a) = R(s, a) + \gamma \sum_{s’ \in S} P_{s’s}^a \cdot max_{a’}q_*(s’, a’)

6. MDP instance

Here is an example to help you understand the MDP process

The example is a student studying for an MDP exam. The lower left circle is the starting point, and the box is the ending point. The actions above are Study, pub, Facebook, quit, sleep, and the immediate reward R for each state action has been marked. Our goal is to find the optimal action value function or state value function, and then find the optimal strategy.

For convenience, we assume that the attenuation factor gamma = 1, PI (a ∣ s) = 0.5 \ gamma = 1, \ quad \ PI (a | s) = 0.5 gamma = 1, PI (a ∣ s) = 0.5

For the position of the end box, since it has no next state and no action of the current state, its state value function is 0. For the other four states, we define their values as v1, V2, V3, V4V_1, V_2, V_3, V_4V1,v2,v3, V4 correspond to the circles in the upper left, lower left, lower middle and lower right positions respectively. We are based on


v PI. ( s ) = a A PI. ( a s ) ( R ( s . a ) + gamma s S P ( s s . a ) v PI. ( s ) ) v_{\pi}(s) = \sum_{a \in A}\pi(a|s)\left(R(s, a) + \gamma \sum_{s’ \in S} P_{(s’|s, a)} \cdot v_{\pi}(s’)\right)

To compute the value functions of all the states, you get the following equations

  • For v_1: v1 = 0.5 ∗ (- 1 + v1) + 0.5 ∗ (0 + v2) v_1 = 0.5 * (1 + v_1) + 0.5 * (0 + v_2) v1 = 0.5 ∗ (- 1 + v1) + 0.5 ∗ (0 + v2)
  • For v_2: v2 = 0.5 ∗ (- 1 + v1) + 0.5 ∗ (- 2 + v3) v_2 * (1 + v_1) = 0.5 + 0.5 * (- 2 + v_3) v2 = 0.5 ∗ (- 1 + v1) + 0.5 ∗ (- 2 + v3)
  • For v_3: v3 = 0.5 ∗ (0 + 0) + 0.5 ∗ (- 2 + v4) v_3 = (0 + 0) + 0.5 * 0.5 * (- 2 + v_4) v3 = 0.5 ∗ (0 + 0) + 0.5 ∗ (- 2 + v4)
  • In v_4: V4 ∗ = 0.5 + 0.5 (10 + 0) ∗ (1 + 0.2 + 0.4 + 0.4 ∗ ∗ v2 v3 ∗ v4) v_4 = 0.5 * (10 + 0) + 0.5 * (1 + 0.2 * v_2 v_3 + 0.4 + 0.4 * * V_4) = 0.5 v4 ∗ (10 + 0) + 0.5 ∗ (1 + 0.2 + 0.4 + 0.4 ∗ ∗ v2 v3 ∗ v4)

The solution of the equations can be obtained: V1 = – 2.3, = – 1.3 v2, v3 = 2.7, the v4 v_1 = = 7.4 2.3, v_2 = 1.3, v_3 = 2.7, v_4 v1 = = 7.4-2.3, = – 1.3 v2, v3 = 2.7, the v4 = 7.4 is the value of each state function

We fixed the strategies above PI PI (a ∣ s) (a) ∣ s, PI PI (a | s) (a) | s, PI PI (a ∣ s) (a) ∣ s, although the calculated value of the state of each state function, but it is not necessarily the optimal value function. So how do we figure out the optimal value function? Here, due to the simplicity of the state machine, it is relatively easy to obtain the optimal state value function V ∗(s)v_*(s)v∗(s) or action value function Q ∗(s)q_*(s)q∗(s).

We take action value function Q ∗(s)q_*(s)q∗(s) for example. And then we follow the following formula,


q ( s . a ) = R s a + gamma s S P s s a m a x a q ( s . a ) q_*(s, a) = R_s^a + \gamma \sum_{s’ \in S} P_{ss’}^a max_{a’}q_*(s’,a’)

We get the following equation

All Q ∗(s,a)q_*(s,a)q∗(s,a), Then using v ∗ (s, a) = maxaq ∗ v_ (s, a) * (s, a) = max_aq_ * (s, a) v ∗ (s, a) = maxaq ∗ (s, a) you can find out all the v ∗ v_ (s) * (s) v ∗ (s), the final result is shown in the following figure

7. Summary of MDP

Although MDP can solve simple problems directly with equations, it cannot solve complex problems, so we need to find other methods to solve them

The next article uses the dynamic programming method to solve the MDP problem, involving two algorithms, Policy Iteration and Value Iteration. Please pay more attention to it.

This article is the beginning of my intensive learning, if you think the article is good, but also please see the officials like the collection and attention ah, the back will always be written, thank you again

References:

  • B station Teacher Zhou intensive learning outline second lecture