1. Example of Q-learning

Let’s say I have a room like this


If a room is represented as a point, and then a line is represented by the connectivity between rooms, as shown in the figure below:


This is the map of the room. We first put the agent (robot) in any position and let him walk around by himself until he reaches room 5, indicating success. In order to go out, we set a certain weight between each node. The edge that can directly reach 5 is set to 100, and the edge that cannot be set to 0, so that the graph of the network is as follows:


In Qlearning, the most important things are “state” and “action”. State refers to which node in the graph, such as 2 nodes, 3 nodes, etc., while action refers to the operation from one node to another node.

First, we generate a reward matrix matrix, in which -1 means no pass, 0 means yes pass, and 100 means direct arrival at the destination:


At the same time, we create a table Q, representing the lessons learned, of the same order as table R, initialized as a 0 matrix, representing the discounted present value of the total rewards that can be obtained from one state to another.


image.png

The values in the Q table are updated according to the following formula:


In the above formula, S represents the current state, A represents the current action, S represents the next state, A represents the next action, λ is the greed factor, 0

Therefore, the learning steps of Q-learning can be summarized as follows:


After iteration to convergence, we can choose our path out of the room according to Q-learning. To see a practical example, first set λ=0.8 and initialize the reward matrices R and Q to:




Select a state at random, such as 1, and look at the table R corresponding to state 1, i.e. 1 can reach 3 or 5. Randomly, we select 5, according to the transition equation:


Thus, the Q table is:


Thus, the goal is reached, and the attempt is over. Then select a random state, for example, the next state corresponding to 3 is (1, 2, and 4 are non-negative states corresponding to state 3). Randomly, we select 1, thus updating according to the algorithm:


In this way, the Q table is:


After continuous iteration, our Q table is as follows:


We can transfer the numbers in the Q table to our original diagram:


After obtaining the Q table, we can choose our path according to the following algorithm:


For example, suppose our initial state is 2, then according to table Q, we choose actions 2-3, and when we reach state 3, we can choose 1, 2, and 4. But according to table Q, we go to 1 to get the maximum value, so we choose action 3-1. Then in state 1, we choose action 1-5 with the maximum value, so the path is 2-3-1-5.

2. Code implementation

The above example can be implemented with the following code, very simple, directly paste the complete code:

import numpy as np import random r = np.array([[-1, -1, -1, -1, 0, -1], [-1, -1, -1, 0, -1, 100], [-1, -1, -1, 0, -1, 1], [1, 0, 0, 1, 0, 1], [0, 1, 1, 0, 1, 100], [1, 0, 1, 1, 0, 100]]) q = np.zeros([6,6],dtype=np.float32) gamma = 0.8 step = 0 while step < 1000: State = random. Randint (0,5) if state! = 5: next_state_list=[] for i in range(6): if r[state,i] ! = 1: next_state_list.append(i) next_state = next_state_list[random.randint(0,len(next_state_list)-1)] qval = R [state,next_state] + gamma * Max (q[next_state]) q[state,next_state] = qval print(q) print(q) # verify for I in range(10): Format (I + 1)) state = random. Randint (0, 5) print(' android '. Format (state)) count = 0 while state! Q_max = q[state].max() q_max_action = [] for action in range(6): if q[state, action] == q_max: q_max_action.append(action) next_state = q_max_action[random.randint(0, len(q_max_action) - 1)] print("the robot goes to " + str(next_state) + '.') state = next_state count += 1Copy the code

The output of the code is:




Reference: mnemstudio.org/path-findin… www.zhihu.com/question/26… zhuanlan.zhihu.com/p/29213893