Introduction to painless enhancement Learning Introduction to painless enhancement learning It includes the basic idea of reinforcement learning, MDP framework, introduction of several basic learning algorithms, and some simple practical cases.

As a very important branch of machine learning, augmentation learning has achieved very surprising results in recent years, which makes more and more people join the team of augmentation learning. The knowledge and content of reinforcement learning is not easy compared with classic supervised and unsupervised learning, and there are fewer small examples that can be explained. This series will briefly introduce readers to the basic knowledge of reinforcement learning, with a small example throughout.

  • Introduction to Painless enhancement learning: Basic Concepts
  • Introduction to painless enhancement Learning: Formalization of enhancement learning
  • Introduction to painless reinforcement learning: Strategy iteration
  • Introduction to Painless enhancement learning: The Value Iteration
  • Introduction to painless enhancement learning: Generalization Iteration
  • Introduction to painless enhancement learning: the Monte Carlo Method

7. Differential timing method

7.1 Monte Carlo variance problem

In the last section, we introduced the Monte Carlo method, which is a good way to solve the model-free snake chess problem. Of course, it also has some imperfections, which are reflected in many aspects. Next, let’s take a look at one of them — the variance of the estimate.

As we have known before, Monte Carlo uses a series of simulation games to get the return information of some episodes, and then averages the return information to get the estimated value function. Theoretically speaking, there is no problem with this method, and its foundation lies in the large number theorem in probability theory. There is no problem with the large number theorem, but the large number theorem only shows convergence, and does not show the speed of convergence. Obviously, if the variance of the sampling sequence is large, it will take longer to make it converge. If the variance of the sampling sequence is small, the convergence speed will be correspondingly reduced.

Based on the above analysis, let’s see if the previous Monte Carlo method has problems in terms of variance. Let’s calculate using the method in the previous section, save the return values of state=50 and action=1, and synthesize all the values into a histogram, as shown in Figure 7-1.

Figure 7-1 Histogram of every-visit return collection in Monte Carlo Method

It can be seen that the span of return is still relatively large, so the model is not easy to converge.

Of course, the large span is also related to other issues, one of which is the sampling frequency in episode. The method adopted in the previous section is called “every-viist”, which means that every state-action pair in the episode will participate in the calculation, which brings difficulties to statistics. Because in snake chess, a position can be crossed twice because of a ladder. For example, in the example of position “50”, we did not distinguish between the first “50” and the second “50” in the calculation process, and directly added the two returns together. Since the two returns must be different, such statistical method will increase the variance and make it difficult to converge.

So what to do about it? One solution is to replace the “every-visit” method with the “first-visit” method. The so-called “first-visit” method is to count only the first occurrence of the state and not the subsequent occurrence of the same state. From the concept of method, the problem of variance increase caused by multiple occurrences of the same state in the same episode will be solved. Here are the results of “first-visit” :

Figure 7-2 Histogram of return collection for First-visit in Monte Carlo method

According to the results, the effect of first-visit was slightly improved, but not significantly. Because even if we only take a non-repeating state, we still encounter first-visit in different situations, and the value difference between each state is still large.

Since monte Carlo method has some such regrets, let’s take a look at the method introduced next: Temporal Difference.

7.2 Differential timing series

Differential timing method is a combination of Monte Carlo method and dynamic programming method. In terms of the main structure of the algorithm, it is similar to monte Carlo method, which also solves the problem by simulating episode. From the core idea of the algorithm, it uses Bellman formula, a classic formula in enhancement learning, to update itself iteratively.

The form of Bellman formula of state-action value function mentioned above is:

Q (s, a) = ∑ ‘s p (s’ | s, a) [R + ∑ a’ PI (a | s’) q (s’ a ‘)]

Then, using Monte Carlo’s method, we can change the formula into:

Q (s, a) = 1 n ∑ Ni = 1 / R (s’ I) + q (s’ I, a ‘I)]

Of course, this formula can also be changed to the form of the previous Monte Carlo method in practice:

Qt (s, a) = qt – 1 + 1 (s, a) n [R (‘ s) + q (s’ a ‘) – qt – 1 (s, a)]

From this formula we can see the relationship between this method and Monte Carlo and dynamic programming. Let’s to implement this method, we first need to realize the method referred to as “SARSA”, the name looks A little strange, in fact it comes from the method of five key factors: S (and state), A (and action), R (simulated) reward from, S (simulation into the next state), A next action of the (simulated).

7.3 On the policy: SARSA

Next, we will implement this algorithm. In fact, its implementation process is relatively simple. Here we only show the evaluation process:

__Fri Nov 03 2017 10:16:44 GMT+0800 (CST)____Fri Nov 03 2017 10:16:44 GMT+0800 (CST)__for i in range(episode_num):
env.start()
state = env.pos
prev_act = -1
while True:
act = self.policy_act(state)
reward, state = env.action(act)
if prev_act != -1:
return_val = reward + (0 if state == -1 else self.value_q[state][act])
self.value_n[prev_state][prev_act] += 1
self.value_q[prev_state][prev_act] += (return_val - \
self.value_q[prev_state][prev_act]) / \
self.value_n[prev_state][prev_act]

prev_act = act
prev_state = state

if state == -1:
break__Fri Nov 03 2017 10:16:44 GMT+0800 (CST)____Fri Nov 03 2017 10:16:44 GMT+0800 (CST)__Copy the code

So how does this code work? Let’s also write a test piece of code.

__Fri Nov 03 2017 10:16:44 GMT+0800 (CST)____Fri Nov 03 2017 10:16:44 GMT+0800 (CST)__def td_demo(): Np.random. Seed (0) env = Snake(10, [3,6]) agent = MonteCarloAgent(100, 2, env) with timer('Timer Temporal Difference Iter'): agent.td_opt() print 'return_pi={}'.format(eval(env,agent)) print agent.policy agent2 = TableAgent(env.state_transition_table(), env.reward_table()) with timer('Timer PolicyIter'): agent2.policy_iteration() print 'return_pi={}'.format(eval(env,agent2)) print agent2.policy__Fri Nov 03 2017 10:16:44 GMT+0800 (CST)____Fri Nov 03 2017 10:16:44 GMT+0800 (CST)__Copy the code

In order to test the effect, we increased the number of iteration rounds in SARSA algorithm to 2000, and the final result is:

__Fri Nov 03 2017 10:16:44 GMT+0800 (CST)____Fri Nov 03 2017 10:16:44 GMT+0800 (CST)__Timer Temporal Difference Iter COST: 3.8632440567 return_pi = 80 [0 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0] policy evaluation proceed 94 iters. policy evaluation proceed 62 iters. policy evaluation proceed 46 iters. Iter 3 Rounds converge the Timer PolicyIter COST: 0.330899000168 return_pi = 84 [0, 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1  1 1 1 1 1 1 1 1 1 1 1 0 0 0]__Fri Nov 03 2017 10:16:44 GMT+0800 (CST)____Fri Nov 03 2017 10:16:44 GMT+0800 (CST)__Copy the code

As can be seen from the results, the effect of SARSA algorithm is not good enough, of course, we have not added more strategies to improve the effect. But it doesn’t look as good as Monte Carlo, so how does it compare to Monte Carlo? Let’s take a look at the aforementioned variance problem and take a look at the resulting graph of the SARSA algorithm:

Figure 7-3 Return collection histogram of the TD method

It can be seen that the numerical variation of this method is more stable and the span is smaller than before.

So why do TD series methods have better variance control? The answer lies in its updated formula. Monte Carlo calculation method uses accurate return, so it is more accurate in estimating the value. But at the same time, it requires the information of a sequence, which has more fluctuations, so the variance is relatively large. And TD method only consider the step, the rest should be used in the calculation of the previous estimates, so when the overall system did not achieve the optimal, such estimates are there is a deviation, but since it only estimates the step, so it is less volatility in terms of estimate, so bring a corresponding to reduce the variance of many.

Therefore, it has been found before that Monte Carlo method and TD method symbolize two extremes — one loosens variance in order to pursue minimal error, and the other loosens error in order to reduce variance. This question seems to return to the classic problem of machine learning — the tradeoff between bias and variance.

So, what other algorithms are there besides SARSA? In the last section we’ll look at another algorithm for TD.

The authors introduce

Feng Chao, graduated from the University of Chinese Academy of Sciences, is the director of visual research of the Ape Tutoring research team, and one of the principals of the little ape photography search. In 2017, I independently wrote the book “Deep Learning Easy: Core Algorithms and Visual Practice”, which introduced the basic structure of deep learning, details of model optimization and parameter setting, and applications in the visual field in a light and humorous language. Since 2016, I have opened my own column “Painless Machine Learning” on Zhihu, publishing articles related to machine learning and deep learning, which have received good response and been reprinted by many media. Participated in community technology sharing activities for many times.