What is reinforcement learning

David Silver deep Reinforcement Algorithm learning + Project Presentation Zhou Bolei: Reinforcement Learning Platform online materials: David Silver open course on Reinforcement learning Chinese and practice

How do agents learn from their interactions with the environment by receiving feedback on success and failure, reward and punishment

There is a very important prerequisite, that is, when an agent interacts with the environment, it needs the environment to provide feedback information — Reinforcement information or Reward information at all times, so that the agent can know which behaviors can get positive rewards and which behaviors can get negative rewards, and adjust its behavior strategy accordingly.

This way of learning is very similar to how animals build conditioned responses based on reinforcement-they recognize pain and hunger as negative rewards, and pleasure and food as positive rewards, and act accordingly.

In the field of reinforcement learning, “reward” is the general name of positive reward and negative reward.

In other environments, rewards occur less frequently and may occur late.

For example, in board games such as Go, chess, TicTacToe, and so on, each time the reward always appears at the end of the board game. This problem is called delay reliability assignment problem and it is an important problem in reinforcement learning

Reinforcement learning is a branch of machine learning: supervised learning, unsupervised learning and reinforcement learning

Characteristics of reinforcement learning:

  1. No monitoring data, just reward signals
  2. The reward signal is not necessarily real-time, but it can be delayed, sometimes much later.
  3. Time (sequence) is an important factor
  4. Current behavior affects subsequent received data

Environmental agent interaction model

A basic interaction sequence between the environment and the agent is that the agent perceives the current state of the environment and then makes a decision to choose the appropriate action to execute, and the environment responds by updating the state and generating a Reward or Reward.

A reward is a specific numerical signal that reflects the popularity of the result of performing an action.

Some basic concepts in reinforcement learning

  • Incentive Reward

    Rt R_t Rt is the feedback of the signal, it’s a scalar, it’s a reflection of how well the individual is doing at time T. The individual’s job is to maximize cumulative rewards.

    Reinforcement learning is based on the “reward hypothesis” : all problem-solving goals can be described as maximizing cumulative rewards.

  • Sequential Decision Making

    • Goal: Select a set of behaviors to maximize overall future rewards
    • These behaviors can be a long-term sequence
    • Rewards can and usually are delayed
    • Sometimes it’s better to sacrifice immediate (short-term) rewards for more long-term rewards
  • Agent & Environment

    Reinforcement learning problems can be described from both individual and environment.

    At time T, the individual can:

    1. There is an observation assessment of the environment Ot O_t Ot,
    2. I have an action At At At,
    3. Get a reward signal from the environment.

    The environment can:

    1. Receiving individual actions At A_t At,
    2. Update environmental information while allowing the individual to get the next observation,
    3. Give the individual a reward signal.

  • History and status History & State

    • history

      History is A sequence of observations, actions, rewards: H t=O 1,R 1,A 1,.. O T − 1,R T − 1,A T − 1, O t,R t,A T H_t=O_1,R_1,A_1… ,O_{t-1},R_{t-1},A_{t-1},O_t,R_t,A_t Ht=O1,R1,A1,… Ot – 1, Rt – 1, the At – 1, Ot, Rt, ats

    • state

      A state is all the existing information that determines the future. It is a function of history: St= F (Ht) S_t= F (H_t) St= F (Ht)

    • Environment state

      Is a private rendering of the environment, including all the data the environment uses to determine the next observation/reward

      It is often not completely visible to the individual, that is, the individual sometimes does not know all the details of the state of the environment.

      Even when the state of the environment is completely visible to the individual, this information may contain irrelevant information.

    • The individual state

      Is the internal representation of the individual, including all the information that the individual can use to determine future actions.

      Individual state is the information that reinforcement learning algorithms can use, which can be a function of history: Sta= F (Ht) S^a_t= F (H_t) Sta= F (Ht)

    • State information

      Including all useful information in history, also known as Markov states.

  • Markov Property of horse

    A state St S_t St is Markov if and only if: P [S T + 1 ∣ S t] = P [S t + 1 ∣ S 1, S t] [S_ P {t + 1} | S_t] = P [S_ (t + 1} | S_1,…, S_t] P = P [St + 1 ∣ St] [St ∣ S1 + 1,…, St]

    The next state is related to the current state, not the historical state

  • Fully Observable environment

    Individuals can directly observe the state of the environment. In this case:

    Individual observation of the environment = individual state = environmental state

    Formally, this problem is a Markov Decision Process (MDP).

  • Partially Observable Environments, Partially Observable Environments

    The individual indirectly observes the environment

    Eg: A poker player can only see their own cards and other cards that have been played, but not the state of the entire environment (including the opponent’s cards).

    In this case:

    Individual state ≠ environment state

    Formally, this problem is a partially observable markov decision process.

    The individual must construct its own representation of the state, for example: remember the complete history: Sta=Ht S^a_t = H_t Sta=Ht

    This approach is primitive and naive. There are other options, such as:

    1. Beliefs of environment state: Although the individual does not know what the environment state is, the individual can take advantage of the existing experience (data) and present the individual state at the current moment with the probability distribution of various individual known states:

    1. It has a Recurrent neural network: it feeds into the Recurrent neural network (RNN) to obtain a present individual state, without knowing the probability, based on the current individual state and individual observations at the current moment:

Reinforcement learning is the main component of the individual

Individuals in reinforcement learning can consist of one or more of the following three components:

  • Strategy Policy

    Strategies are the mechanisms that determine individual behavior. It’s a mapping from state to behavior, which can be deterministic or indeterminate.

  • Value Function

    It’s a prediction of future rewards that evaluates how good or bad your current state is. When faced with two different states, an individual can use a Value Value to evaluate the difference between the final rewards that the two states may obtain, and then guide the choice of different behaviors, namely, the formulation of different strategies. Meanwhile, a value function is based on a specific strategy, and the value of the same state is different under different strategies. The value function under a certain strategy is expressed by the following formula:

  • Model Model

    An individual’s modeling of the environment, which reflects how the individual thinks about the operating mechanism of the environment. The individual hopes that the model can simulate the interaction mechanism between the environment and the individual.

    The model should solve at least two problems: one is the probability of state transformation, that is, the probability of predicting the occurrence of the next possible state:

    Another task was to predict possible immediate rewards:

    The model is not necessary to build an individual, and in many reinforcement learning algorithms the individual does not attempt to build a model.

Classification of reinforcement learning individuals

Individuals can have a combination of tools to solve reinforcement learning problems

  • Solve the problem by establishing an estimate of the value of the state
  • Solve the problem by directly establishing an estimate of the strategy.

These are tools in the toolbox that individuals can use. Therefore, individuals can be divided into the following three categories according to the classification of “tools” contained in them:

  1. Value Based only on Value function:

    In such an individual, there is a value estimation function for the state, but there is no direct strategy function, which is indirectly obtained by the value function.

  2. Policy Based only on Policy:

    In such an individual, the behavior is directly generated by the policy function, and the individual does not maintain an estimation function of the value of each state.

  3. Actor- judge form Actor-Critic:

    The individual has both value function and strategy function. The two work together to solve the problem.

In addition, individuals can be divided into two categories according to whether they build a model of environmental dynamics when solving reinforcement learning problems:

  1. Non-model-based individuals:

    Such individuals do not seek to understand how the environment works, but focus only on value and/or policy functions.

  2. Model-based individuals:

    Individuals attempt to create a model that describes how the environment works to guide the updating of value or policy functions.

Learning & Planning

  • Learning: The environment is unknown at the beginning, and the individual does not know how the environment works. By interacting with the environment, the individual gradually improves his behavior strategy.
  • Planning: How the environment works is known or nearly known to the individual, who does not actually interact with the environment but performs calculations based on the model he builds to improve his behavioral strategy.

A common reinforcement learning problem solving approach is to learn how the environment works, that is, to understand how the environment works, to learn a model, and then use that model to plan.

Exploration and Exploitation

Reinforcement learning is similar to trial-and-error learning in that individuals need to discover a good strategy from their interaction with the environment without losing too much reward in the process of trial and error. Exploration and utilization are two aspects that individuals need to balance when making decisions.

An analogy is that when you go to a restaurant,

“Explore” means you’re interested in trying new restaurants, probably going to a new restaurant that you haven’t been to before,

“Leverage” means you choose a favorite restaurant from a restaurant you’ve eaten at before, rather than trying a restaurant you’ve never been to before.

These two approaches are usually contradictory, but both are very important for solving reinforcement learning problems.

Prediction & Control

In reinforcement learning, we often need to solve the problem about prediction first, and then solve the problem about Control on this basis.

  • Prediction: Given a strategy, evaluate the future. It can be regarded as the process of solving the value function under a given strategy.
  • Control: Find a good strategy to maximize future rewards.

eg:

Prediction: Now given the rewards from A to A ‘and from B to B’, how to know the value of each position under the strategy of “move in 4 directions at random”.

Control: What is the optimal value function under all possible strategies under the same conditions? What is the optimal strategy?

Markov decision making process

Markov decision process (MDP) describes completely observable environment, that is to say, the observed state content completely determines the characteristics required for decision making. Almost all reinforcement learning problems can be transformed into MDP

Markov Process

Markov Property

A certain state information contains all relevant history. As long as the current state is known, all historical information is no longer needed, and the current state can determine the future, this state is considered to have Markov property.

Markov properties can be described by the following state transition probability formula: P = P s s’ [‘ ∣ s t s t + 1 = s = s] P_ {} = ss ‘P [S_ (t + 1} = s’ | S_t = s] Pss’ = P [St + 1 = s’ ∣ St = s] under the state transition matrix defines all the state transition probability: P = (P 11… P 1 n ⋮ ⋱ ⋮ p 1 n… p n n ) P = \begin{pmatrix} p_{11} & \dots & p_{1n} \\ \vdots & \ddots &\vdots \\ p_{1n} & \dots & p_{nn} \end{pmatrix} P = ⎝ ⎜ ⎛ p11 ⋮ p1n… ⋱… P1n ⋮ PNN ⎠⎟⎞ where n is the number of states, the sum of elements in each row of the matrix is 1

Markov Process

Markov process is also called Markov Chain, which is a random process without memory

It can be represented by a tuple <S,P>, where S is a finite number of state sets and P is the state transition probability matrix.

eg:

The circle represents the state the student is in, the square Sleep is a termination state arrow represents the transition between states, and the number on the arrow represents the probability of the current transition.

Class1 When the students are in their first class

He/she has a 50% chance of attending Class2 (Class2)

When the students come to the second lesson (Class2)

Will have an 80% chance of continuing in the third class (Class3)

When the students are in the third period,

He has a 60% chance of passing and a 100% chance of dropping out of the course,

And 40 percent of the time had to go to a library or something to find a reference,

Later, according to their understanding of the class content, they will have a 20%, 40% and 40% chance to return to the first, second and third classes to continue learning.

There is also a 20% chance that they will find the class difficult and drop out (Sleep).

There was also a 50% chance that they were not paying attention to their lectures and were browsing Facebook instead.

When you’re looking at your Facebook status,

He or she has a 90% chance of continuing at the next point,

There’s also a 10% chance of going back to the lecture.

A possible student Markov chain starts with state Class1 and ends with Sleep. There are many possibilities in the process according to the state transformation diagram, which are all called Sample Episodes.

The following four Episodes are possible:

  • C1 – C2 – C3 – Pass – Sleep
  • C1 – FB – FB – C1 – C2 – Sleep
  • C1 – C2 – C3 – Pub – C2 – C3 – Pass – Sleep
  • C1 – FB – FB – C1 – C2 – C3 – Pub – C1 – FB – FB – FB – C1 – C2 – C3 – Pub – C2 – Sleep

The state transition matrix of the student Markov process is shown as follows:

Markov Reward Process

Markov reward process adds reward R and attenuation coefficient γ on the basis of Markov process:

.
,p,r,γ>

R is a reward function. S reward is under a certain time (t) is under a state of S (t + 1) in the next moment can obtain the reward expectation R S = E/R t + 1 ∣ S t = S] R_s = E (R_ (t + 1} | S_t = S] Rs = E/Rt + 1 ∣ St = S attenuation coefficient Discount Factor: γ∈ [0, 1]

The figure below shows an example of the “Markov reward process” graph, which adds rewards for each state on the basis of the “Markov process”. Since there is no calculation related to attenuation coefficient, this graph does not specify the value of attenuation coefficient.

Harvest Return

Definition: Harvest Gt G_t Gt is the decayed sum of all rewards in a Markov reward chain starting at time t. It can also be translated as “gain” or “return”. The formula is as follows:

When gamma = 0, the agent is shortsighted because it only cares about maximizing the current reward; And the closer the gamma gets to 1, the more far-sighted the agent is, because the more important it is to be rewarded in the future

Even if T = 1, as long as γ < 1 and R T R_t Rt is bounded, then G T G_t Gt is finite. In some cases, if it doesn’t cause Gt Gt Gt to be infinite, let gamma = 1.

Value Function

A value function gives the long-term value of a state or behavior.

Definition: a markov reward in the process of the value of a state function is starting from the status of the expectations of the markov chain harvest: v (s) = E/G t ∣ s t = s v (s) = E [G_t | S_t = s] v (s) = E = s] [Gt ∣ St

Note: Value can describe only a state, a behavior in a state, and in some special cases, a behavior. Throughout the open Video course, except specifically, the convention uses the state value function or value function to describe the value against the state; Behavior value function is used to describe the value of performing a certain behavior in a certain state. Strictly speaking, behavior value function is the abbreviation of “state behavior pair” value function.

Illustrate harvest and value calculations

For the convenience of calculation, the sample graph of “Student Markov Reward Process” is expressed in the form of the following table.

The second row in the table corresponds to the instant reward value of each state, and the numbers in other areas are the state transition probability, which is the probability of changing from the state in the row to the state in the column:

Consider the following four Markov chains. It is calculated that when γ= 1/2, the harvest of state S1 S_1 S1 at t=1 (S1=C1 S_1 = C_1 S1=C1) is as follows:

It can also be understood from the above table that harvest is for a certain state in a Markov chain.

When gamma = 0, the value of the instant reward for each state is the same as that state.

When γ≠ 0, the value of each state needs to be calculated. Here, the value of each state under the condition that γ is 0, 0.9 and 1 respectively is given, as shown in the figure below.

The number inside each status circle indicates the value of the status, while R=-2 outside the circle indicates the immediate reward for the status.

Derivation of value function

  • Bellman equation

Try the definition formula for value first and see what you get:

Gt+1 G_{t+1} Gt+1 becomes V (St+1) v(S_{t+1}) v(St+1) only when the last line is derived.

The reason is that the expectation of harvest equals the expectation of harvest expectation. Bellman equation:

We can see from the equation that v(s), v(s), v(s) consists of two parts

One is the immediate reward expectation of that state, which is equal to the immediate reward, because by the definition of the immediate reward, it is independent of the next state;

The other is the value expectation of the state at the next moment, which can be obtained according to the probability distribution of the state at the next moment.

If s’ is used to represent any possible state at any time under S state, Bellman’s equation can be written as follows:

  • Interpretation of equations

The values of states C, 3, C3 and C3 can be calculated by the values of states Pub and Pass and the probability of state transition between them:

1.0 * 4.3 = 2 + (0.6 * 10 + 0.4 * 0.8)

  • Matrix form and solution of Bellman equation

The specific expression of the combination matrix is easier to understand:

Bellman equation is a linear system of equations, so theoretically the solution can be directly solved:

In practice, the computational complexity is O(n3) O(n^3) O(n3), so direct solution is only suitable for small scale.

Large – scale solutions usually use iterative methods. Common iteration methods are:

  • Dynamic Programming
  • Monte Carlo Evaluation
  • Temporal Difference learns Temporal Difference

Markov Decision Process

Compared with markov reward process, Markov decision process has one more action set A

It is A tuple like this:

. It looks very similar to the Markov reward process, but the P and R here correspond to the specific action A,
,>

Unlike markov reward processes that correspond only to one state, A represents A finite set of actions. The specific mathematical expression is as follows:

  • Example — Student MDP

The following figure shows a possible MDP state transition diagram.

The red text in the figure represents the action taken, not the previous state name.

Strategy Policy

Strategy PI \ PI PI is the probability of collection or distribution of the elements of PI (a) ∣ s \ PI PI (a | s) (a) ∣ s to a certain state in the process of s take the probability of a possible behavior. With the PI (a ∣ s)/PI (a | s) PI (a ∣ s) said. PI (a ∣ s) = P (a, t = a ∣ s t = s] \ PI = P (a | s) [A_t = a | S_t = s] PI (a ∣ s) = P = a ∣ St = s] [At a strategy defines the complete individual behavior, That is to say, it defines the various possible behaviors of individuals in various states and the size of their probabilities.

The Policy is related to the current status and has nothing to do with historical information. At the same time, a certain Policy is static and independent of time. But individuals can update their policies over time.

When given A MDP: M = (A, S, P, R, gamma > M = < A, S, P, R, \ gamma > M = < A, S, P, R, gamma > and A strategy of PI \ PI PI

  • State sequence S 1,S 2,S 3.. S_1,S_2,S_3… S1,S2,S3,… Is a Markov process <S,Pπ> <S,P^{\ PI}> <S,Pπ>

  • State and reward sequences of 1 S, 2 R, S, 2 R, 3 S, 3… S_1 and S_2, R_2, R_3, S_3,… S1,R2,S2,R3,S3,… Is a markov reward process < S, P, PI, R PI, gamma > < S, P ^ {\} PI, R ^ {\ PI}, \ gamma > < S, P, PI, R PI, gamma >

    And the following two equations are satisfied in this reward process

    The literal description is that the probability of a state transition from s to s’ is equal to the sum of a series of probabilities in the execution of strategy π \ PI π,

    The series of probabilities refers to the product of the probability that an action is performed during the execution of the current strategy and the probability that the action causes the state to change from S to S ‘.

    The reward function is expressed as follows:

    In literal terms, the immediate reward for performing a given strategy in the current state S is the sum of the rewards for all possible actions under that strategy multiplied by the probability of that action occurring.

Value function based on strategy π

V π(s) v_{\ PI}(s) vπ(s) is defined as a state value function based on policy π under MDP.

Represents the expected gain from following the current policy starting at state S; Or measure the value of an individual in state S when executing the current strategy π. Mathematics said as follows: v PI PI (s) = E [G t ∣ s t = s] v_ (s) = {\ PI} E_ \ PI [G_t | S_t = s] v PI PI (s) = E = s] [Gt ∣ St

Note: Strategies are used to describe the probability of performing different behaviors in different states.

Q π(s,a) q_\ PI (s,a) qπ(s,a) behavior value function is defined to represent the expected harvest of a specific behavior a in the current state S under the implementation of strategy π.

Behavior value function is generally corresponding to a specific state, a more elaborate description is the state behavior versus value function. The formula of behavioral value function is described as follows: Q PI PI (s, a) = E/G t ∣ s t = s, a t = a] q_ \ PI (s, a) = E_ \ PI [G_t | S_t = s, A_t = a] q PI PI (s, a) = E [Gt ∣ St = s, the At = a]

Bellman Expectation Equation

The state value function and behavior value function under MDP are similar to the value function under MRP, and can be expressed by the state value function or behavior value function at the next moment. The specific equation is as follows: V PI PI (s) = E/R t + 1 + gamma v PI (s (t + 1) ∣ s t = s] v_ \ PI (s) = E_ \ PI (R_ (t + 1} + \ gamma v_ \ PI (S_ (t + 1}) | S_t = s] V PI PI (s) = E (Rt + 1 + gamma v PI (St + 1) ∣ St = s]

Q π (s, a) = E π [R t + 1 + γ q π (S t + 1, a T + 1) = s, A t = a ] q_\pi(s,a) = E_\pi[R_{t+1}+\gamma q_\pi(S_{t+1},A_{t+1})|S_t = s,A_t = a] Q PI PI (s, a) = E (Rt + 1 + gamma q PI (St + 1, At + 1) ∣ St = s, the At = a]

  • V PI v_ (s)/PI (s) v PI PI (s) and q (s, a) q_ \ PI PI (s, a) q (s, a)

    In the image above, the larger hollow circles represent states, and the smaller black solid circles represent actions themselves,

    When strategy π is followed, the value of state S is represented by the sum of the product of the probabilities of all possible actions taken in the state following a strategy.

    Similarly, an action value function can be expressed as a state value function:

    It shows that the value of taking an action in a certain state can be divided into two parts:

    • One is the value of leaving the state,
    • The second is the sum of the values of all states entering the new state multiplied by their transition probabilities.

    If combined, you get the following result:

You can also get the following result:

The figure below explains how the state value of the red hollow circle state is calculated by following the strategy random strategy, where all possible actions have an equal chance of being selected for execution.

  • Bellman expectation equation matrix form

The optimal

  • Optimal value function

    Optimal state value function V∗(x) V_*(x) V∗(x) refers to the selection of the function that maximizes the value of state S among the state value functions generated from all strategies:

PI PI V V ∗ = m x V_ (s) * = \ underset {\ PI} {Max} V_ \ PI (s) V ∗ = PI maxv PI (s)

Similarly, optimal behavioral value function Q ∗(S,a) q_*(S,a) q∗(s,a) refers to the function with the greatest value to state behavior < S,a> < S,a> < S,a>,a> from the behavioral value function generated by all strategies: Q ∗ (s, a) = m a PI PI q x q_ (s, a) * (s, a) = \ underset {\ PI} {Max} q_ \ PI (s, a) q ∗ (s, a) = PI maxq PI (s, a)

  • The optimal strategy

    Policy π is superior to policy π ‘when, for any state S, the value of following policy π is not less than that of following policy π’ :

    Theorem For any MDP, the following holds:

    • There is an optimal strategy that is better than or at least equal to any other strategy;
    • All optimal strategies have the same optimal value function;
    • All optimal strategies have the same behavior value function.
  • Find the optimal strategy

    The optimal strategy can be found by maximizing the optimal behavior value function:

Bellman Optimality Equation

The optimal value for V∗ V_* V∗ of a state is equal to the greatest behavioral value generated by all actions taken from the state:

For Q ∗ q_* q∗, in a certain state S, the optimal value of taking a certain behavior consists of two parts:

  • Part of it is the immediate reward for leaving state S,

  • The other part is the sum of the optimal state value of all reachable states S ‘according to the occurrence probability:

Combined, for V ∗ v_* V ∗, there is

For Q ∗ q_* q∗, there is:

  • Solve Bellman’s optimal equation

Bellman optimal equation is nonlinear and has no fixed solution. It is solved by some iterative methods: value iteration, strategy iteration, Q learning, Sarsa, etc.

Dynamic programming seeks optimal strategies

Introduction;

Dynamic programming algorithm is a method to solve complex problems. The algorithm decomposes complex problems into sub-problems and then obtains the solution of the whole problem by solving the sub-problems.

When solving subproblems, the results often need to be stored and used to solve subsequent complex problems.

Dynamic programming is usually considered when the problem has the following characteristics:

  • The optimal solution of a complex problem consists of the optimal solution of several small problems. The optimal solution of the complex problem can be obtained by searching for the optimal solution of the sub-problem.
  • Subproblems are repeated in complex problems so that the solutions of subproblems can be stored and reused.

Markov decision process (MDP) has the above two properties: Bellman equation recurses the problem into solving subproblems, and the value function stores the solutions of some subproblems, which can be reused. Therefore, dynamic programming can be used to solve MDP.

We use dynamic programming algorithms to solve a class of problems called programming. “Planning” refers to solving the optimal strategy on the basis of understanding the whole MDP, that is, on the basis of clear model structure: including state behavior space, transformation matrix, reward, etc. These kinds of problems are not typical reinforcement learning problems, and we can use programming to predict and control them.

The mathematical description is as follows:

* * forecast: * * given A MDP < A, S, P, R, gamma > < S, A, P, R, \ gamma > < S, A, P, R, gamma > and strategies for PI, PI PI, Or given a MRP < S, P, PI, R PI, gamma > < S, P ^ \ PI, R ^ \ PI, \ gamma > < S, P, PI, R PI, gamma >, required output based on the current strategy of PI PI value function V V_ \ V PI PI

Control: * * * * given A MDP < A, S, P, R, gamma > < S, A, P, R, \ gamma > < S, A, P, R, gamma >, demand to determine the optimal value function V ∗ ∗ V_ * V and optimal strategy PI PI ∗ ∗ \ pi_ *

Iterative strategy Evaluation for Iterative Policy Evaluation

The theory of

Problem: Evaluating PI for a given strategy solves the “prediction” problem.

** Solution: ** Reverse iteration applies Bellman’s expectation equation

Specific methods:

Synchronous reverse iteration, that is, during each iteration, for the k+1 iteration, the value of all state S is calculated and updated with vk(s) v_k(s’) vk(S ‘) vK (s’). Where S ‘is the successor state of S.

This method finally converges to Vπ V_\ PI Vπ by repeated iteration

It is also possible to iterate asynchronously in reverse, that is, update the state value in iteration K with the state value of the next iteration.

Formula is:

That is, in one iteration, the value of state S is equal to the sum of the immediate reward of that state in the previous iteration and the value of the next possible state of S ‘multiplied with its probability

The matrix form of the formula is: vk+1=Rπ+γpπvk v^{k+1} =R ^\ PI + \gamma p^\ PI v^k vk+1=Rπ+γpπvk

example

Check the world

Known:

State space S: as shown in figure. S1-s14 non-terminated state, ST terminated state, two positions as shown in the gray square below;

Behavior space A: {n, e, S, W} can have four behaviors of southeast, northwest movement for any non-terminating state;

** Transfer probability P: ** Any action attempting to leave the grid world will not change its position, the rest of the condition will be 100% transferred to the action pointing state;

** Instant reward R: ** Any transfer between non-terminated states will be rewarded with -1 instant reward, and entering terminated states will be rewarded with 0 instant reward;

Attenuation coefficient γ : 1;

Current strategy PI: * * * * Agent by random action strategy, in any one the termination conditions have equal chance to take any move the behavior, namely the PI (n | •) = PI (e | •) = PI (|, s) = PI | • (w) = a quarter.

** Problem: ** Evaluates a given strategy in this grid world.

This problem is equivalent to solving the value function of (state) in the grid world under a given strategy, that is, solving the value of each state in the grid world under a given strategy.

Iterative solution (iterative strategy evaluation)

State values converge after the 153rd iteration

  • How to improve your strategy

Through the example of the grid world, we get a method of optimizing the strategy, which is divided into two steps:

First we iteratively update the value function under a given strategy:

Then, on the basis of the current strategy, the behavior is selected greedily, which increases the value of the subsequent state the most:

In the lattice world, the value iteration based on a given strategy eventually converges to the optimal strategy, but it is not a common phenomenon that the optimal strategy can be found through a round of iterative calculation of value combined with strategy improvement. Often, there is a need to continue to evaluate improvement strategies, many times over. However, this method always converges to the optimal strategy π∗ \ PI ^* π∗.

Policy Iteration

V values are iterated on the current strategy, and then the strategy is greedingly updated according to the v values. This is repeated many times, and finally the optimal strategy π∗ \ PI ^* π∗ and the optimal state value function V∗ v ^* v ∗ are obtained.

Greed refers to taking only those actions that maximize the value of the state.

  • Strategies to improve
  1. Consider a certain strategy: a=π(s) a= PI (s) a=π(s)

  2. Optimized strategies by greedy calculation:

  3. Update the value of the new behavior value function q

    After one round of iteration, there will be a new strategy. According to the new strategy, a new behavior value function Q will be obtained

  4. Iterate over the best function

Sometimes there is no need to continuously iterate to the most valuable function. Some conditions can be set to terminate the iteration in advance, such as setting a Ɛ to compare the square deviation of the value function of the two iterations. Set the number of iterations directly; And update the policy every iteration.

Value iteration

Value iteration is also derived based on Behrman optimal equation, but value iteration transforms Behrman optimal equation into Update rule to carry out continuous backtracking iteration, without the participation of policy in the whole iteration process

  1. Iterating to find the optimal value function: first continuously iterating the Behrman optimal equation (i.e., maximizing the Q function) to get the optimal V *
  2. Optimal policy extraction process: use Argmax to reconstruct the Q function, and then find that the maximum action is the policy he chooses

Compare Policy Iteration with Value Iteration

Both approaches are designed to solve the MDP control problem.

Policy Iteration consists of Policy Evaluation and Policy Improvement. Two steps alternate iteration, more intuitive.

The Value Iteration is directly solved iteratively by using The Behrman optimal equation. Finally, there is a Policy extraction, which obtains the final strategy from the action Value.

If it is a Prediction problem, it can be solved iteratively with Behrman expectation equation directly.

If it is a Control problem, the value iteration can be carried out by the method of Behrman expectation equation + strategy iterative update or directly by using Behrman optimal equation.