Original link:tecdat.cn/?p=11105

Original source:Tuo End number according to the tribe public number

 

In reinforcement learning, we are interested in determining a strategy for maximizing reward acquisition. Assuming that the environment is the ideal model of Markov decision process (MDP), we can apply dynamic programming methods to solve reinforcement learning problems.

In this article, I introduced three dynamic programming algorithms that can be used in the context of MDP. To make these concepts easier to understand, I implemented the algorithm in the context of a grid world, which is a popular example of reinforcement learning.

Before I start using the application, I want to quickly provide the theoretical background needed for future work in the grid world.

MDP key reinforcement learning terms

The following sections explain the key terms of reinforcement learning, namely:

  • Policy: Which actions should be performed by the agent in which state
  • Status value function: The expected value of future rewards for each state
  • Action value function: The expected value of performing a particular action against a future reward in a particular state
  • Transition probability: The probability of passing from one state to another
  • Reward feature: The reward the agent receives when switching between states

 

State value function

Given a strategy ππ, the state value function Vπ (s) Vπ (s) maps each state SS to the expected benefits that the agent can obtain in that state:

 

Where, STST represents the state of tt at the moment. The parameter γ∈[0,1]γ∈[0,1] is called the discount factor. It determines the impact of future rewards.

Action value function

Given a strategy ππ, the action value function Qπ (s, a) Qπ (s, a) determines the expected reward for performing the action AA in state SS:

 

 

Transition probability

The agent can be converted to state S ‘s’ by performing the action AA in state SS. The likelihood of this transition occurring is described by Pass’Pss’a.

Reward function

The reward function Rass’Rss’a specifies the reward received when the agent transitions from state SS to state S ‘s’ via the action AA.

A demonstration of three basic MDP algorithms in Gridworld

In this article, you will learn how to apply three algorithms to MDP in a grid world:

  1. Strategy Evaluation: Given strategy ππ, what is the value function associated with ππ?
  2. Strategy iteration: Given ππ, how do we find the best strategy π∗π∗?
  3. Value iteration: How to find the best strategy π∗π∗ from scratch?

In GridWorld, the agent’s goal is to reach a specified location in the grid. The agent can move north, east, south, or west. These actions are represented by the set {N, E, S, W} {N, E, S, W}. Note that the agent is always aware of state (that is, its position in the grid).

There are walls in the grid that the agent cannot pass through.

Basic Gridworld implementation

I’ve implemented GridWorld in an object-oriented fashion. The following sections describe how I designed the code for the map and policy entities.

Gridworld map

To implement GridWorld, the first thing I want to do is make classes that represent maps. I have defined the following format to represent each grid cell:

  • #Indicating the walls
  • XShow that target
  • Whitespace represents an empty block

With these symbols, the following map is constructed:

################### # X# # ########### # # # # # # # ####### # # # # # # # # # # # # # # # # # # ### ### # # # # # # # # # # # # # # # # # # # # #Copy the code

  

 

I implemented MapParser, which generates a Map object. Map objects control access to gridWorld cells. Individual cell cell classes define the behavior of specific cells, such as empty cells, walls, and target cells. Each cell can be identified using its row and column indexes.

With this setting, it is easy to load a map:

parser = MapParser() gridMap = parser.parseMap(".. /data/map01.grid")Copy the code

load

For reinforcement learning, we need to be able to handle a strategy π (s, a) π (s, a). In GridWorld, each state ss represents the location of the agent. These actions move the agent to one of four geographic directions. We will map the policy to the map using the following notation:

  • N for the actionGO_NORTH
  • E for actionGO_EAST
  • S for the actionGO_SOUTH
  • W to actionGO_WEST

Unknown symbols are mapped to the NONE operation to obtain the complete policy.

Using these definitions, I define the following policies:

###################
#EESWWWWWWWWWWWWWX#
#EES###########SWN#
#EES#EEEEEEEES#SWN#
#EES#N#######S#SWN#
#EES#N#SSWWW#S#SWN#
#EEEEN#SS#NN#SWSWN#
#EENSSSSS#NNWWWWWN#
#EENWWWWEEEEEEEEEN#
###NNNNNNNNNNNNN###
#EENWWWWWWWWWWWWWW#
###################
Copy the code

Note that the policy file preserves the walls and target units to improve readability. The policy has two objectives:

  1. The agent should be able to achieve the goal. For policies that do not implement this property, policy evaluation will not give a reasonable result because the target payoff will never be achieved.
  2. This strategy should not be optimal. This means that in some states, the business representative did not take the shortest path to the goal. Such a strategy allows us to see the effect of an algorithm that tries to improve on the original strategy.

To load the policy, I implemented a policy parser that stores the policy as a policy object. Using these objects, we can load the initial policy like this:

policyParser = PolicyParser() policy = policyParser.parsePolicy(".. /data/map01.policy")Copy the code

Policy objects have the capability to model π (s, a) π (s, a) :

def pi(self, cell, action):
    if len(self.policy) == 0:
        # no policy: try all actions (for value iteration)
        return 1

    if self.policyActionForCell(cell) == action:
        # policy allows this action
        return  1
    else:
        # policy forbids this action
        return 0
Copy the code

Preparation for intensive learning

To prepare for the implementation of reinforcement learning algorithms, we still need to provide transition and reward functions.

The transition function

To define the conversion function Pass’Pss’a, we first need to distinguish between illegal and legal actions. In GridWorld, there are two ways to make actions illegal:

  1. If the action takes the agent out of the grid
  2. If the action puts the agent in trouble

This provides us with the first rule of conversion functions:

1. When an action is illegal, the agent should remain in its previous state.
Copy the code

In addition, we must demand:

2. When an action is illegal, transitions to states other than its previous state should be forbidden.
Copy the code

Of course, the state transition must be valid for the selected action. Since each action moves the agent by only one position, it is recommended that state S ‘s’ must have the agent in the cell adjacent to state SS:

3. Only allow transitions through actions that would lead the agent to an adjacent cell.
Copy the code

For this rule, we assume that there is a predicate adj (s, s’) adj (s, s’) to indicate whether the subject’s transition from SS to S ‘s’ involves adjacent cells.

Finally, once the target state S ∗ S ∗ is reached, we do not want the agent to leave again. To illustrate this, we introduce the final rule:

4. Don't transition from the goal cell.
Copy the code

Based on these four rules, we can define the conversion function as follows:

 

The provided Python implementation getTransitionProbability is not as explicit as a mathematical formula:

def getTransitionProbability(self, oldState, newState, action, gridWorld): proposedCell = gridWorld.proposeMove(action) if proposedCell is None: # Rule 1 and 2: illegal move return transitionProbabilityForIllegalMoves(oldState, newState) if oldState.isGoal(): # Rule 4: stay at goal return 0 if proposedCell ! = newState: # Rule 3: move not possible return 0 else: # Rule 3: move possible return 1Copy the code

Note that its proposeMove simulates the successful execution of an operation and returns a new grid cell for the agent.

Reward function

In gridWorld, we want to find the shortest path to the terminal state. We want to maximize the reward, so the reward for target states S ∗ S ∗ should be higher than the reward for other states. For GridWorld, we will use the following simple functions:

 

 

assessment

The goal of the policy evaluation algorithm is to evaluate policy π (S, a) π (S, a), that is, determine the values of all states according to V (S) ∀sV (S) ∀ S. The algorithm is based on Bellman equation:

 

For iteration k + 1k + 1, the equation obtains the value of state SS by the following formula:

  • π (s, a) π (s, a) : the probability of selecting the action AA in state SS
  • Pass’Pss’a: The probability of passing from state SS to state S ‘s’ using the action AA
  • Rass’Rss’ A: The expected reward for transitioning from state SS to state S ‘s’ using the action AA
  • γγ : discount rate
  • Vπk (s’) Vkπ (s’) : The value of state S ‘s’ in step kk given the policy ππ

To better understand the equations, let’s consider each in the context of gridworld:

  • π (s, a) π (s, a) : Since we are in a deterministic environment, the policy specifies only one action aa, where π (s, a) =1π (s, a) =1, and all other actions A ‘a’ have π (s, a’) =0π (s, a’) =0. Therefore, multiplying π (s, a) π (s, a) selects only the operation specified by the policy.
  • ∑ S ‘∑ S’ : The sum is the sum of all states S ‘, which can be obtained from the current state SS. In the gridworld, we only need to consider the adjacent like yuan and the current yuan itself, namely s’ ∈ {x | adj (x, s) ∨ x = s} ‘s ∈ {x | adj (x, s) ∨ x = s}.
  • Pass’Pss’a: This is the probability of passing from state SS to S ‘s’ by action AA.
  • Rass’Rss’ A: This is a bonus for transitioning from SS to S’s via AA. Note that in GridWorld, the reward is determined only by the next state s’s’.
  • γγ : The discounting factor moderates the effect of expected rewards.
  • Vk (S ‘) Vk (S ‘) : Expected reward in the proposal state s’s’. The term exists because policy evaluations are dynamically programmed: we are updating current values with previously calculated values.

We’re going to use gamma =1 and gamma =1, because we’re in a situation where learning stops when we reach the target state. Thus, the value function represents the length of the shortest path to the target cell. More precisely, let D (s, S ∗) d (s, S ∗) represent the shortest path from state SS to the target. Then, for S ≠ S ∗ s≠ S ∗, Vπ (s) =-d (s, S ∗) +1Vπ (s) =-d (s, S ∗) +1.

For policy evaluation purposes, we will typically scan the state space multiple times. Each scan requires a value function from the previous iteration. The difference between the new value and the old value function is usually used as the termination condition of the algorithm:

Def findConvergedCells(self, V_old, V_new, theta = 0.01): diff = abs(V_old-V_new) return np. Where (diff < theta)[0]Copy the code

This function determines the index of grid cells whose value functions differ less than θθ. When the values of all states converge to a stable value, we can stop. Since this is not always the case (for example, if the policy specifies state actions that do not lead to a goal, or if transition probabilities/rewards are misconfigured), we also specify the maximum number of iterations.

When the stop condition is reached, evaluatePolicy returns the latest status value function:

def evaluatePolicy(self, gridWorld, gamma = 1): if len(self.policy) ! = len(gridWorld.getCells()): # sanity check whether policy matches dimension of gridWorld raise Exception("Policy dimension doesn't fit gridworld dimension.") maxIterations = 500 V_old = None V_new = initValues(gridWorld) # sets value of 0 for viable cells iter = 0 convergedCellIndices = np.zeros(0) while len(ConvergedCellIndices) ! = len(V_new): V_old = V_new iter += 1 V_new = self.evaluatePolicySweep(gridWorld, V_old, gamma, convergedCellIndices) convergedCellIndices = self.findConvergedCells(V_old, V_new) if iter > maxIterations: break print("Policy evaluation converged after iteration: " + str(iter)) return V_newCopy the code

The evaluatePolicySweep function performs a policy evaluation scan. This function iterates through all the cells in the grid and determines the new value of the state.

Note that the ignoreCellIndices parameter represents the pixel index of the unchanged value function for subsequent scanning. These units are ignored in further iterations to improve performance. This is fine for our GridWorld example, because we just want to find the shortest path. So this is the best value for the state value function when it doesn’t change the first time.

Use the evaluatePolicyForState function to calculate the status value. At its core, this function implements the Bellman equation we discussed earlier. The important idea of this function is that when evaluating the value function of state SS, we do not want to scan all states S ‘s’. This is why the state generator only generates states that are likely to actually occur (that is, the transition probability is greater than zero).

Evaluate the results

With a proper implementation, we can find the policy’s status-value function by executing the following command.

To plot the value function with the policy, we can use Pyplot in Matplotlib after converting the one-dimensional array used to represent the map to a two-dimensional array:

def drawValueFunction(V, gridWorld, policy):
    fig, ax = plt.subplots()
    im = ax.imshow(np.reshape(V, (-1, gridWorld.getWidth())))
    for cell in gridWorld.getCells():
        p = cell.getCoords()
        i = cell.getIndex()
        if not cell.isGoal():
            text = ax.text(p[1], p[0], str(policy[i]),
                       ha="center", va="center", color="w")
        if cell.isGoal():
            text = ax.text(p[1], p[0], "X",
                       ha="center", va="center", color="w")
    plt.show()
Copy the code

Using this function, we can visualize the policy’s state value function:

 

For non-target units, the graph is annotated using the actions specified by the policy. The top of the X label represents the target in the upper-right cell.

The values of other cells are indicated by color. The worst state (with the lowest reward) is shown in purple, the bad state in blue, the blue-green intermediate state in green, the good state in green, and the very good state (with the highest reward) in yellow.

Looking at these values, we can see that the results match the actions specified in the policy. For example, the value of the state directly to the west of the target is very low because the action of that state (GO_WEST) leads to long detours. The cell directly south of the target is of high value because its action (GO_NORTH) leads directly to the target.

Note that the performance evaluatePolicy will be critical in future work, as we will call it multiple times. For the example of computation, the function takes 61 iterations, which translates to about half a second on my laptop. Note that for policies that are closer to the best policy, policy evaluation will require fewer iterations because the values will propagate faster.

It’s nice to be able to determine the state value function – now we can quantify the benefits of the proposed strategy. But we have not yet solved the problem of finding the best policy. This is where policy iteration comes in.

Policy iteration

Now that we have been able to calculate the state value function, we should be able to refine our existing strategy. A simple strategy is the greedy algorithm, which iterates through all the cells in the grid and then selects the operation that maximizes the expected reward based on the value function.

The definition for

 

The improvePolicy function determines the value function of the policy, and then calls findGreedyPolicy to identify the best action for each state.

The findGreedyPolicy thing to do is to construct an improved version of the input policy by considering each unit and selecting the action that maximizes the expected reward. For example, after implementing improvePolicy once and re-evaluating the strategy, we got the following results:

 

All the cells next to the target now give us a high return compared to the original value function, because the operation has been optimized. However, we can see that these improvements are only partial. So how do we get optimal policy?

The idea of strategy iteration algorithm is that we can find the optimal strategy by iteratively evaluating the state value function of the new strategy, and improve the strategy by using greedy algorithm until the optimal is reached:

def policyIteration(policy, gridWorld):
    lastPolicy = copy.deepcopy(policy)
    lastPolicy.resetValues() # reset values to force re-evaluation of policy
    improvedPolicy = None
    while True:
        improvedPolicy = improvePolicy(lastPolicy, gridWorld)
        if improvedPolicy == lastPolicy:
            break
        improvedPolicy.resetValues() # to force re-evaluation of values on next run
        lastPolicy = improvedPolicy
    return(improvedPolicy)
Copy the code

The result of policy iteration

Running the algorithm on GridWorld found the best solution in 20 iterations — about 4.5 seconds on my laptop. The termination after 20 iterations is not surprising: The Gridworld map is 19 wide. Therefore, we needed 19 iterations to optimize the value of the horizontal corridor. We then need to do an additional iteration to make sure that the algorithm can be terminated because the policy has not changed.

A good tool for understanding policy iterations is to visualize each iteration:

 

The following figure shows the optimal value function constructed using policy iteration:

 

A visual check shows that the value function is correct because it selects the shortest path for each cell in the grid.

Value iteration

With the tools we have explored so far, a new question arises: Why do we need to think about initial strategies at all? The idea of the value iterative algorithm is that we can calculate the value function without a strategy. Rather than having policy PI PI indicate which operations are selected, we do not select those that maximize the expected rewards:

 

Because the evaluation of a value iteration is very similar to policy evaluation, I have implemented the ability to use the value iteration evaluatePolicyForState in the method I defined earlier.

As long as no policy is available, this function performs a value iteration algorithm. In this case, len(self.policy) will be zero, so that PI always returns a value, and V is determined to be the maximum expected reward for all actions.

Therefore, we don’t have to do a lot of coding to implement value iteration. We simply need evaluatePolicySweep to iteratively call the Policy object’s value function without knowing it until the process provides us with the best result. Then, to determine the appropriate policy, we simply call findGreedyPolicy the function we defined earlier.

Result of value iteration

 

When performing value iterations, rewards (high: yellow, Low: dark) are extended from the target’s final state (top right X) to other states:

 

Abstract

We have already seen how reinforcement learning can be applied in MDP. Our working assumption is that we have a thorough understanding of the environment and that the agent has a complete understanding of the environment. Based on this, we were able to facilitate dynamic programming to solve three problems. First, we use policy evaluation to determine the state value function for a given policy. Next, we apply the policy iteration algorithm to optimize the existing policy. Third, we apply value iteration to find the best strategy from scratch.