【5分钟 Paper】 Reinforcement Learning with Double q-learning

Reinforcement Learning with Double Q-Learning

The problem solved?

Q-learning algorithm has the problem of overestimate action values (because its update equation contains one item of maximization action value function). Will such overestimate affect its algorithm performance? Can we avoid such an overestimation problem?

background

If all the action value functions are uniformly added to a constant, it does not seem to matter a little. The problem is that when you have an overestimated action value function and then add exploration and exploitation techniques, it tends to favor the overestimated action value function, and some of the actions are not selected at all, and that affects strategy learning. So what you get is a suboptimal solution.

The method adopted?

DQN uses the same value function to select and evaluate actions, so the author breaks them down here, and the specific formula is as follows:

Suppose you have two networksand. One is used to select actions, to decidegreedy policyThe other one that determines the action value function. For convenience withDQNAlgorithm comparison, let me write it firstDQNFormula:

The formula of Double q-learning is as follows:

The main difference is whether the policy selection and policy evaluation in the Target are on the same network.

The result?

The author of the experiment uses polynomials to fit curves through sampling points. The original text reads as follows: The estimate is a D-degree polynomial that is ﬁt to The true values at sampled states, Where D = 6 (top and middle rows) or D = 9 (bottom row). In the figure below: The experiments in the first and second lines are compared to analyze the universality of the overestimation problem, and the experiments in the second and third lines are to analyze the relationship between the overestimation problem and the fitting ability of approximate function.

The authors design this environment so that the optimal action value function is only related to the current state. The top optimal action value function is designed as:, the middle and bottom lines are designed as. The figure on the left shows the approximation of the state action value function, and the green points are the sampling points during the experiment.

The fitting effect on the sampling point is very good, but the approximation effect of the whole value function equation is not very ideal. In particular, the error on the left side of the sampling point is large.

The authors then compare it to the largest, and the right-most figure best shows that Double DQN alleviates the overestimation problem. Details are as follows:

The above experiment also points out a problem, that is, the enhancement of the fitting ability of the approximate function usually has a better fitting effect on the known data points, and a larger fitting error on the unknown data points.

The above shows that overestimation exists. Will overestimation affect the optimal learning strategy?

In fact, it will. The experimental results are as follows:

As can be seen from the lower two figures in the figure above, with the increase of overestimation function, its scoring performance decreases, so overestimation actually damages the scoring performance of the algorithm.

Published information? Author information?

A 2016 DeepMind team article published at the Ational Conference on Artificial Intelligence by Hado Van Hasselt, a Google DeepMind research scientist, Colleague Rich Sutton.

Profile: hadovanhasselt.com/about/

Theorem proving

Theorem1

Theorem 1 description: Gives a state, its real optimal action value function and value function satisfy the equation:. Assuming thatIs the function of any action value in the current state, and its unbiased estimation can be expressed as:. But this description is not entirely correct, for example:. Among them.Represents the number of actions available in the current state. The following inequality exists under the above conditions:

The above theorem actually states that even if your action value function is evaluated correctly on average, i.e, a little disturbance will still overestimate, making it deviate from the real optimal value function.

The figure below shows that the lowest lower bound of overestimation decreases as the dimensions of the action space increase.

Theorem 1 proof:

Define the error for each action. Assuming that there isA set like that, which hasPositive numbers form the set.The negative numbers form the set(). ifBy theCan be launched, this has to do withmeetContradiction, therefore must. It follows that:, the use ofYou can get, which means. Thus, the following formula can be obtained:

Now we can combine these relationships to calculate everythingUpper bound on the sum of squares.

This is different from the hypotheticalContradiction, therefore, to the setThe element satisfies the constraint. We can setfor, andLet’s check this outThe lowest lower boundThat’s right. It can be verified thatand.

Theorem2

Description of Theorem 2:

A given state, for all real optimal action value functionsEquation. Assume estimator errorinAnd satisfy independent uniform random distribution. There are:

Theorem 2 is proved:

define; This is one of theUniform random variable in space.The probability of phi is the same as the probability of all phi.The probability. Since the error of the evaluator is independent, we can derive:

functionisIs the cumulative distribution function (cumulative distribution function(CDF)), simply defined as:

This means:

Given a random variable, her expectation can be written in the integral form as follows:

Among themPhi is the probability density function of this variable, defined as phiCDFThe derivative of:. So for, we haveAnd then the integral is calculated as follows:

Refer to the link

The problem of estimation that has been solved before is inadequate value function approximation

Thrun and A. Schwartz. Issues in using function approximation for reinforcement learning. In M. Mozer, P. Smolensky, D. Touretzky, J. Elman, and A. Weigend, editors, Proceedings of the 1993 Connectionist Models Summer School, Hillsdale, NJ, 1993. Lawrence Erlbaum.

Or add a little noise

Van Hasselt. Double q-learning. Advances in Neural Information Processing Systems, 23:2613 — 2621, 2010.
van Hasselt. Insights in Reinforcement Learning. PhD thesis, Utrecht University, 2011.

My wechat account is MultiAgent1024. I mainly study deep learning, machine game, reinforcement learning and other related topics. Look forward to your attention, welcome to learn and exchange progress together!