Simple Q – learning | xiao Ming’s one dimensional world (1) simple Q – learning | xiaoming one dimensional world (2)

One-dimensional world of acceleration

In this world, Ming can only control his own acceleration, and can only perform the following three operations on acceleration: increase 1, decrease 1, or stay the same. So the action space is: {u1 = – 1, u2 = 0, u3 = 1} \ {u_1 = 1, u_2 = 0, u_3 = 1 \} {u1 = – 1, u2 = 0, u3 = 1}

Note: In order not to confuse the acceleration symbol AAA, the action symbols are changed to uuu.

At the moment, Xiaoming has speed information as well as position information, so the state of three-dimensional ST =

s_t=

st=

. Where, xtx_txt is the position of Xiaoming at TTT moment, vtv_tvt is the speed of Xiaoming at TTT moment, and ata_TAT is the acceleration of Xiaoming at TTT moment. Here, Xiao Ming’s acceleration space is also discrete. Do not break in general, the acceleration space set to {a1 = – 2, a2 = – 1, a3 = 0, a4 = 1, the a5 = 2} \ {a_1 = 2, a_2 = 1, a_3 = 0, a_4 = 1, a_5 = 2 \} {a1 = – 2, a2 = – 1, a3 = 0, a4 = 1, the a5 = 2}
,vt,at>
,>
,vt,at>

According to the combination principle, xiao Ming has a total of 21×7×5=73521\times 7 \times 5=73521×7×5=735 states. S={s1=

,s2=

,… ,s147=

}S=\{s_1=

, s_2=

,… ,s_{147}=

\}S={s1=

,s2=

,… ,s147=

}
,v7,a5>
,v1,a1>
,v1,a1>
{21},>
,>
,>
,v7,a5>
,v1,>
,v1,>

In order to accelerate the convergence speed, with dense reward function here: r (s) = – ∣ x ∣ – ∣ n ∣ – ∣ a ∣ r (s) = x – | | | — – | v | a | r (s) = – ∣ x ∣ – ∣ n ∣ – ∣ a ∣, when xiao Ming stone in the middle, and the velocity is zero, the greatest rewards.

QtableQ_{table}Qtable is 735×3735\times 3735×3 matrix.

  • training
import numpy as np import matplotlib.pyplot as plt %matplotlib inline def model_update(x, v, a, u): a = a+u if a < -2: [-2,2] a = -2 if a > 2: a = 2 v = v+a if v < -3: [-3,3] v = -3 if v> 3: v = 3 x = x+v if x < -10 [-10, 10] x = -10 if x > 10 Vt = np.random. Randint (-2, 3) at = np.random. Randint (-1, 2) at = np.random. 2) Q_table = np.zeros((735, 3)) # initialize Q = zero for I in range(5000000): U = np.random. Randint (0,3)-1 xt1, vt1, at1 = model_update(xt, vt, at, u) r = -abs(xt1)-abs(vt1)-abs(at1) Q_table[((at+2)*7+(vt+3))*21+xt+10, U + 1) = r + 0.9 * np. Max (Q_table [((at1 + 2) * 7 + (vt1 + 3)) * 21 + xt1 + 10]) # update Q xt = xt1 n = vt1 at = at1Copy the code
  • Using the strategy

Initial state to the left, the minimum speed, also namely s0 = < – 10, 3 -, 2 – > s_0 = < -, 10-3-2 > s0 = < 2 > – 10, 3 -, –

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

is_ipython = 'inline' in matplotlib.get_backend()
if is_ipython:
    from IPython import display
    
plt.ion()

xt = -10
vt = -3
at = -2
x = np.arange(-10, 11)
y = np.zeros(21)
for i in range(100):
    u = np.argmax(Q_table[((at+2)*7+(vt+3))*21+xt+10])-1
    xt1, vt1, at1= model_update(xt, vt, at, u)
    print(xt, vt, at, u , xt1, vt1, at1)
    xt = xt1
    vt = vt1
    at = at1
    plt.clf()
    plt.plot(x, y, 'b')
    plt.plot(xt,[0], 'or')
    plt.pause(0.1)
    if is_ipython:
        display.clear_output(wait=True)
        display.display(plt.gcf())
Copy the code

steps. (xt,vt,at,ut,xt+1,vt+1,at+1)(x_t, v_t, a_t, u_t, x_{t+1}, v_{t+1}, A_ {t + 1}) (xt, vt, ats, ut, xt + 1, n + 1, + 1) at 1. (-, 10 -, 3 -, 1, 2 -, 10 -, 3-1) to (-, 10-3, 2, 1, 10, 3, – 1), (2, 1-10, 3 -, -, -, 10 -, 3-1) (2) (-, 10 -, 3 -, 1, 1-10, 3, 0) – (-, 10-3, 1, 1, 10, 3, (0) – 10-3-1, 1, 10 -, – 3, 0) (3) (-, 10 -, 1, 3 -, 10-2, 1) (-, 10-3, 0, 1, 10, 2, (1) – 10, 3, 1 -, -, 10-2, 1) 4. (2,1,1 10 -, -, – 10,0,2) (10, 2, 1, 1, 10, 0. 2,1,1 2) (10 -, -, – 10,0,2) 5. (10,0,2 -, -, 1-9,1,1) (10, 0, 2, 1, 9, 1, 1) (1-10,0,2, -, 9 -,1,1) 6. (,1,1,0-9, 7, 2, 1) – (9, 1, 1, 0, 7, 2, 1) (,1,1,0-9, 7, 2, 1) – 7. (7, 2, 1 -, – 1, 5 0) – (7, 2, 1, 1, 5, 2, 1, 0) (7, 2, 1 -, – – 5 0) 8. (5,2,0,0 -, 3-0) (5, 2, 0, 0, 3, 2, 0) (5,2,0,0 -, 3-0) 9. (1-3,2,0,0, – 0) (3, 2, 0, 0, 1, 2, 0) (1-3,2,0,0, – 0) 10. (1, 1, 0-1 -, – 1) (1, 2, 0, 1, 0, 1, 1) (1, 1, 0-1 -, – 1) 11. (0, 1, 1 -,0,0,0, – 1) (0, 1, 1, 0, 0, 0, 1) (0, 1, 1 -,0,0,0, – 1) 12. (0, 0, – 1,1,0,0,0) (0, 0, 1, 1, 0, 0, 0) (0, 0, – 1,1,0,0,0) 13. (0,0,0,0,0,0,0) (0, 0, 0, 0, 0, 0, 0) (0,0,0,0,0,0,0)

Motion picture – The green dots represent Ming

The initial state of the tests here is the worst possible value, so the step size may be a little longer. If starting from the leftmost position, the initial velocity is 0 and the initial acceleration is 0, then the final step required from the leftmost to the middle position is: acceleration world < velocity world < position world. But it has to do with the interval that you set for velocity and acceleration. In general, the world of acceleration is more flexible and responsive than the world of velocity; In the world of speed, Ming reacts faster than in the world of position, rather than taking one silly step at a time.

# # epilogue

This is the end of Ming’s one-dimensional world system. From a one-dimensional world of position to a one-dimensional world of velocity to a one-dimensional world of acceleration. The world goes from easy to difficult, the number of states goes from few to many, the number of steps required for training goes from few to many. Of course, this is all in q-learning algorithm based on Q-Table. If q-Table is changed into neural network with stronger representation ability, we can do more complex and interesting things.