• Deep Recurrent Q-learning for Partially Observable MDPs

This paper was originally proposed in 2015, but the latest revision is from 2017, and everything below is based on the 2017 version.

  • Links to papers: arxiv.org/abs/1507.06…

The problem solved?

As mentioned in the title, the author solved the problem that Partially Observable Markov Decision Process (POMDP) is difficult to obtain all Observable information.

It’s mainly a revamp of DQN, which has turned into a Deep Recurrent Q-Network (DRQN). There is an old saying in China called cause and effect. Generally, 4 frames of image data are taken in DQN, which has a severe separation degree to the sequential state. It is often difficult to consider the influence of the state long ago on the present, but some problems need to be considered comprehensively.

The method adopted?

The author uses Long Short Term Memory (LSTM) proposed by Hochreiter and Schmidhuber in 1997 and DQN to solve this partially observable problem.

The network structure is as follows:

Since there is LSTM in the network, the author mainly considers two update modes: Bootstrapped Sequential Updates and Bootstrapped Random Updates.

  • Bootstrapped Sequential Updates: Update one at a timeEpisodeFrom beginning to end, the whole sequenceLSTMWalk the city.
  • Bootstrapped Random UpdatesFrom:EpisodeA random fragment is selected to update.

The difference between these two updates is whether the implied state is cleared. If each Episode is updated, more things can be learned, while random words are more in line with the idea of random sampling in DQN. The experimental results of the two methods are very similar. In this paper, the random sampling method is adopted, which is expected to have stronger generalization ability.

The result?

Partially Observable Environment: At each timestep, the game art is blurred with a probability of 0.5. Here the author gives two results, one best and one worst.

The author also raises a question: can the reinforcement learning algorithm trained directly under the MDP framework be generalized directly to POMDP? The experimental results are as follows:

It can be seen from the above results that the generalization ability of DRQN is much better than the robustness of DQN. This means that LSTM can not only process POMDP but also improve its performance and robustness compared with DQN.

Published information? Author information?

This is an article published in the National Conference on Artificial Intelligence in 2015. Matthew Hausknecht, a PhD atthe University of Texas at Austin, is a senior researcher at Microsoft.

Refer to the link

The author is not the ideological founder of this article, as early as in reference 1: In 2007, Wierstra used LSTM to solve the paper under partially observable Markov decision framework, but in the method of Policy Gradient, and DRQN was trained together with convolutional neural network to avoid manual feature extraction.

Reference 2: In 2001, Bakker tested under cartpole task that LSTM solved POMDP better than RNN.

The literature

  1. Wierstra, D.; Foerster, A.; Peters, J.; and Schmidthuber, J. 2007. Solving deep memory POMDPs with recurrent policy gradients.
  2. J. Bakker, B. 2001. Reinforcement learning with Long shortterm memory. In NIPS, 1475 — 1482.

My wechat account is MultiAgent1024. I mainly study deep learning, machine game, reinforcement learning and other related topics. Look forward to your attention, welcome to learn and exchange progress together!