A basic iterative approach to reinforcement learning

Author | Nathan Lambert compile | source of vitamin k | forward Data Science

Study value iteration and strategy iteration.

This paper focuses on understanding basic MDP (briefly reviewed here) and applying it to basic reinforcement learning methods. The methods I will focus on are “value iteration” and “strategy iteration”. These two methods are the basis of q-value iteration, which directly leads to Q-learning.

You can read some of my previous posts (intentionally independent) :

What is the Markov decision process? (towardsdatascience.com/what-is-a-m…).
Reinforcement learning linear algebra (towardsdatascience.com/the-hidden-…).

Q-learning starts the wave of deep reinforcement Learning that we are in, and is an important part of the Learning strategy of reinforcement Learning students.

Review the Markov decision making process

Markov decision process (MDPs) is a stochastic model supporting reinforcement learning (RL). You can skip this section if you’re familiar with it, but I’ve added some explanations.

define

State set $s\in s, action set $a\in a $ States and actions are a collection of all possible positions and actions for an agent. In advanced reinforcement learning, states and actions are continuous, so this requires rethinking our algorithm.
The transformation function T(s, a, s’). Given the current position and given the action, T determines how often the next state occurs. In reinforcement learning, we do not access this function, so these methods attempt approximate or implicit learning of sampled data.
The reward function R(s, a, s’). This function tells you how much reward you can get for each step. In reinforcement learning, we do not use this function, so we learn from the sample value r, which enables the algorithm to explore the environment and then take advantage of the optimal trajectory.
The discount factor gamma (gamma, range [0,1]) adjusts the value of the next step to a future reward. In reinforcement learning, we do not use this function, and gamma controls the convergence of most learning algorithms and Bellman system optimizations.
The initial state, s0, could be the end state.

The important value

MDP has two important characteristics, the status value and the Q value of the chance node. * in any MDP or RL value indicates the optimal amount.

State value: The value of the state is the optimal recursive sum of the rewards since the state began.
Q value of state, action pair: The Q value is the optimal sum of discount rewards associated with the state-action pair.

The optimal value is related to the optimal action condition Q. Then, the value and Q value update rules are very similar (weighted conversion, reward and discount factors). Top: coupling of value and q value; Middle: recursion of Q value:, bottom: iteration of value. Reference: inst.eecs.berkeley.edu/~cs188/sp20…

Leading reinforcement learning

Value iteration

We learn the values of all the states, and then we can operate on gradients. Value iteration learns the value of the state directly from Bellman updates. Under some nonrestrictive conditions, Bellman updates are guaranteed to converge to an optimal value.

Learning a strategy may be more straightforward than learning a value. Learning a value can take an infinite amount of time to converge to the numerical accuracy of a 64-bit floating-point number (consider the moving average of a constant in each iteration, which will always add a smaller and smaller non-zero number after the initial estimate of zero).

Policy iteration

Learn value-related strategies. Policy learning incrementally looks at the current value and extracts the policy. Since the action space is limited, we expect it to converge faster than iteration. Conceptually, the last change to the operation will occur before the end of the small rolling average update. Policy iteration has two steps.

The first, called strategy extraction, is how to convert from a value to a strategy that maximizes the expected value.

The second step is strategic assessment. Policy evaluation takes the policy and iterates values against the policy. These samples are always relevant to the policy, but we have to run iterative algorithms to reduce the number of steps required to extract the relevant action information.

Like value iteration, policy iteration guarantees convergence for most reasonable MDPs because of the underlying Bellman update.

Q value iteration

The problem with learning optimal values is that it’s hard to extract strategies from them. The Argmax operator is obviously nonlinear and difficult to optimize, so the Q-value iterative method is a step towards direct strategy extraction. The optimal strategy for each state is the maximum q for that state.

The reason most instructions start with “value iteration” is that it naturally goes into Bellman updates. Q value iteration requires the replacement of two key MDP value relationships together. Having done so, this is the first step in what we will learn about Q-learning.

The reason most instructions start with a value iteration is that it is more natural to insert Bellman updates. Q value iteration requires the replacement of two key MDP value relationships together. When you do that, it’s one step away from what we’re going to know about Q-Learning.

What about these iterative algorithms?

Let’s make sure you understand all the terminology. Essentially, each update consists of two summed terms (or possibly a selection action by Max). Let’s put them in parentheses and discuss their relationship to MDP.

The first term is the sum of the products of T(s, a, s’)R(s, a, s’). This item represents the potential value and the possibility of a given state and transition. T, or transitions, determines the likelihood of getting a given return from transitions (recall that a tuple s, A,s’ determines that one of the actions A takes an agent from one state S to another state S ‘). This is going to do some things, this is going to do some things, like balance the low probability state with the high reward against the frequent state with the low weight.

The next item determines the “Bellman nature” of these algorithms. It is the data weighting of the last step of the iterative algorithm V, and the formula above has one term. This gets information about values from neighboring states so that we can understand long-term transitions. Think of this term as the primary place where recursive updates occur, while the first term is the priority weight determined by the environment.

Convergence condition

All iterative algorithms are told to “converge to the best value or policy under certain conditions”. These conditions are:

Total coverage of state space. The condition is that all states, actions, and next_state tuples are arrived under the condition policy. If you do not, some of the information from the MDP will be lost, and the values may remain at the initial values.
Discount factor γ < 1. Otherwise, it creates an infinite loop that eventually goes to infinity.

Thankfully, in practice, these conditions are easy to meet. Most exploration is Epsilon greedy, including always the chance of random action (so any action is possible), and the non-one discount factor leads to better performance. Ultimately, these algorithms work in many Settings, so it’s definitely worth a try.

Reinforcement learning

How do we turn what we see into reinforcement learning problems? We need to use samples instead of the real T(s, a, s’) and R(s, a, s’) functions.

Sample-based learning – How to solve hidden MDP

The only difference between the iterative approach in MDPs and the basic approach to solving reinforcement learning problems is that the RL samples come from the underlying transformation and reward functions of MDP, rather than being included in update rules. There are two things we need to update, replace T(s,a,s ‘) and replace R(s,a,s ‘).

First, let’s approximate the transformation function to the average action condition transformation for each observed tuple. All values we didn’t see were initialized with random values. This is the simplest form of model-based reinforcement learning (my field).

Now, all that’s left is to remember how to use the reward. However, we actually get a reward for each step, so we can get away with it (the method averages out the correct value with many samples). Consider the q-valued iterative equation approximated by sampling rewards, as shown below.

The equation above is Q-learning. We start with some vectors Q(s, A) filled with random values, then collect interactions with the world and adjust alpha. Alpha is a learning rate, so we lower it when we think the algorithm is converging.

The results show that Q-learning is very similar to q-value iteration, but we just run this algorithm with an incomplete world view.

Q-learning, used in robots and games, is a neural network that approximates a large table containing all state-action pairs in a more complex feature space.

The original link: towardsdatascience.com/fundamental…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/