Markov process-Markov reward process-Markov decision process

An overview of the

Markov Decision Processes (MDPs) is a mathematical description of reinforcement learning problems.

  • The environment is required to be fully observed.

Markov sex

“Just know that the present, future and past are conditionally independent, and you can throw away all information from the past.”

Definition: if the state at time TIf the following equation is satisfied, the state is calledMarkov state, that is, the state satisfies Markov property


Note:

  • stateContains all historical information, i.eAll previous information can be displayed in this state(It can replace all the previous states)
  • Therefore, the environment is required to be fully observed (if it is partially observed, the state information is missing).
  • Whether or not markov properties are satisfied is closely related to the definition of states

Example:

  • Playing chess
  • Tetris

With the Markov state:

  • State transition matrices can be defined
  • Ignore the time and focus on the present moment

Note: Whether a state satisfies Markov property is closely related to the definition of state.


State transition matrix

State transition probability

The probability of state transition refers to the probability of skipping from a Markov state S to its successor state S ‘. It’s the conditional probability distribution of the current state.


State transition matrix

If the states are discrete (finite) :

  • All the states form rows
  • All subsequent states form a column,

Get the state transition matrix


  • Is the number of states
  • Each row adds up to 1

State transition function

If the number of ** states is excessive, or infinite (continuous state) **, it is appropriate to use the function form of the uppermost expression in this section.


  • At this point,

Markov process

define

A Markov process (MP) is a random process without memory, that is, a sequence of Markov states.

A Markov process can be defined by a binary

  • : indicates the state set
  • : indicates the state transition matrix

Usually assume thatIs the existence and stability of whenWhen unstable, online learning, fast learning and other methods are adopted

An example of a Markov process

  • There are two termination states in markov process:
    • Time to stop
    • State of termination

(Episode)

Definition: reinforcement learning, from the initial stateTo terminate stateSequence process.



Markov reward process

define

On the basis of Markov process, markov reward process is obtained by assigning different reward values in the transfer relation.

Markov Reward Process (MRP) consists of a quad

  • S: set of states
  • : State transition matrix
  • : reward function,Describes the reward in state S,
  • : Attenuation factor

Return value

  • Bonus value: Evaluation of a status
  • Return value: The evaluation of a segment

Return value) is the cumulative decay reward starting at time T


Value functions in MRPs

Why value functions? The return value is the result of a fragment, there is a large sample bias and the return value is labeled T, and the value function is concerned with the state S

An MRP value function is defined as follows


The value function here is for the state S, so it’s called the state value function, also known as the V function

Behrman equation in MRPs (Emphasis)


The value function of the current state consists of two parts:

  • Number one: instant rewards
  • Second term: value function of subsequent state times attenuation coefficient

Since there may be multiple subsequent states, if the transition matrix is known, then


The matrix-vector form is:


Is essentially a linear equation that can be solved directly:

Direct solution only applies to small MRPs:

  • Computational complexity
  • For the known

Markov decision process

In both MP and MRP, we act as observers to observe state transitions and calculate returns. For an RL problem, we would prefer to change the state transition process to maximize the return value. Therefore, Markov Decision Processes (MDPs) are obtained by introducing decisions into MRP.

define

A Markov decision process (MDPs) consists of a quintuple

  • : A collection of actions
  • : State transition matrix

  • : reward function, representing the reward for doing action A in state S.

strategy

In MDPs, a Policy π is the probability distribution of actions in a given state


  • The strategy is time stable, it only depends on s, it doesn’t depend on time T
  • Is the ultimate goal of RL problems
  • If the distribution is one-hot, it is a deterministic strategy; otherwise, it is a random strategy

The relationship between MDPs and MRPs

If the MDP problem is given a policy, will degenerate into an MRP problem.

Value functions in MDPs

  1. State value function (V function)

    • Definition: Starting from state S, policies are usedThe expected return you get

  2. State action value function (Q function)

    • Definition: The state action value function in MDPs is the expected return value obtained by executing action A starting from state S and then using strategy π

      Action A doesn’t have to come from strategyYou actually follow the strategy after you’ve done action APerform action selection


Behrman expectation equation

Similar to MRP, the value function in MDPs can be divided into two parts: instant reward and subsequent state value function