Introduction: How long has it been since you read an academic paper? Most high-level papers are in English. The author translated a paper on deep reinforcement learning by Google DeepMind in 2013, which is regarded as a milestone of reinforcement learning + deep learning, and its results were published in the top academic journal Nature in 2015. The content of this article is that CNN (convolutional neural network) + Q-learning (an iterative method of delay-time-series differential – reinforcement learning) = DQN, and let the neural network learn to play The Yadali mini-game (a video game of the 1970s / 1980s, the control rules are very simple and straightforward).

The whole idea is extremely simple, and it’s really just an entry-level task for reinforcement learning beginners in 2020 (maybe just a few lines of code), since there are already a lot of mature deep learning frameworks + reinforcement frameworks + test environment packages.

Today, we leave the code behind and try to feel the charm of the paper; And put yourself in your shoes: if you were a tech enthusiast in 2013, how shocked and delighted would you be to see this?

Article: Mnih V , Kavukcuoglu K , Silver D , et al. Playing Atari with Deep Reinforcement Learning[J]. Computer Science, 2013. DeepMind link: (https://deepmind.com/research/publications/playing-atari-deep-reinforcement-learning)

Use deep reinforcement learning to control Atari games

Front knowledge

Many programmers may not know much about reinforcement learning algorithms, which are, in fact, automatic control techniques. We can think of reinforcement learning as a black box: we input information about the environment into the black box (where you are in the game, where the enemies are, the geography of the surrounding area, etc.) and the black box immediately output what you should do (shoot, move 1 meter to the left, move 1 meter to the right, etc.). Reinforcement learning is based on this framework to expand, which has a lot of content worthy of study: how to efficiently train functions in the black box with a small amount of data (reinforcement learning data is based on interaction with the environment, a small amount of data)? How do multiple black boxes communicate and cooperate?

I simply abstracted the parameter iteration of reinforcement learning into the following framework.

for ep in range(epoch):
    After the environment is initialized
    # we get information about the current environment: OBS
    # and whether this environment is over now
    # (e.g. if the game is over) : done
    obs, done = env.reset()
    # Reinforcement learning and environment interaction after n_step
    Update the parameters once with the obtained data
    # Therefore have step_ count
    step_ = 0
    while not done:
        if step_ == n_step:
            # Buffer stores historical data
            # only with how to learn() is the math problem
            # There are many research achievements, such as TRPO and PPO
            agent.learn(buffer)
            step_ = 0
        step_ += 1
        Agent takes action based on the current environment
        action = agent.act(obs)
        How does the environment proceed to the next frame according to the agent action
        obs, reward, done, _ = env.step(action)
        Store the obtained data into buffer
        buffer.add(obs, reward, done)
        "" A reward is the value of a reward given for reinforcement learning based on the current situation. For example, if the reinforcement learning operation is bad and you die playing a shooter, a reward is a negative value. Reinforcement learning is known through algorithm iteration: "I did a bad job earlier, this action won't work." conversely, reinforcement learning does a good job, kills a lot of players and the reward might be a positive number. Reinforcement learning learns through algorithm iteration that "I did a good job in this game.
Copy the code

In this article, we will see almost the first “useful” and “transcendent” general-purpose reinforcement learning algorithm DQN. The article is my artificial translation, welcome correction! Here we go!

Abstract

We propose the first deep reinforcement learning model that can directly learn control strategies from success in high-dimensional scenarios. (Piper nest egg notation: note the word directly, the graphic input neural network directly, as if directly with the human eye to see the image, completely imitate the operation game scene) model is composed of convolution neural network, based on the variation of Q – learning training, input is unprocessed pixels, while the output is the value of the future rewards evaluation function. (Piper’s note: Output is the value of each action, and if you want optimal control, choose the action with the highest value.) We applied our approach to seven Atari 2600 games in Arcade learning without any changes to the algorithm structure. We found six games that outperformed existing methods, and three of them were beyond the control of a human expert.

1 introduction

It is one of the long-standing problems in reinforcement learning to directly learn agent control methods from high latitude scene output like visual or language. Most successful reinforcement learning applications in this field rely on artificially constructed eigenvalues and linear value functions or strategy design. In short, the performance of such systems is highly dependent on the quality of feature design. (Piper’s Egg Nest notes: Artificial feature extraction, which means different designs for different problems, is not “intelligent”)

Recent advances in deep learning have made it possible to extract higher-order features directly from scene information, with breakthroughs such as computer vision and speech recognition. These methods utilize a large number of neural network models, including convolutional neural networks, multilayer perceptrons, constrained Boltzmann machines and cyclic neural networks, and play a role in both supervised and unsupervised learning. Following this line of thinking, it was natural to consider whether we could combine reinforcement learning with scene data based on similar techniques.

However, the combination of reinforcement and deep learning faces several challenges. First, most successful deep learning applications today require a lot of hand-tagged training data. On the other hand, reinforcement learning algorithms must take a scalar reward information as the learning object. However, this reward information tends to be sparse, noisy, and delayed. The delay between an action and its consequences can be thousands of steps long. In the case of delay, it is difficult for supervised learning to make connections between inputs and goals. Another problem is that most deep reinforcement learning assumes that data samples are independent, whereas in reinforcement learning, the sequences to be dealt with tend to be highly correlated states. In addition, in reinforcement learning, data distribution will change as the algorithm learns new behaviors, which conflicts with deep learning methods based on stable distribution.

This paper proves that convolutional neural networks can overcome these problems and successfully learn control strategies from video data in complex reinforcement learning environments. This network structure is trained by a variant of Q-learning, and at the same time uses a gradient descent strategy to update the weights. In order to alleviate the problem of relevant data and unstable distribution, we use an experience pool mechanism, which can randomly sample previous state transitions, and then achieve the effect of smooth training distribution on the basis of a large number of old behaviors. (Piper’s Nest note: Almost all deep reinforcement learning methods today use experience pools, otherwise data is wasted/underutilized)

We tested our approach on a large number of Atari 2600 games in The Arcade learning environment. The Atari 2600 is challenging for reinforcement learning tests, offering high-dimensional graphics input (210×160210\times160210×160 RGB 60Hz video) for agents and a rich variety of interesting tasks for human players that are difficult to do. Our goal is to create a single neural network agent that can successfully learn to control as many games as possible. We don’t provide the network with any other special information about the game or artificial eigenvalues, nor does the network know the state inside the simulator. The only information available for online learning is video input, reward values, stop signals, and a set of possible actions — information available to human players. In addition, all the hyperparameters of the network structure and training are unchanged from game to game. The network model has so far outperformed all previous reinforcement learning algorithms in six of the seven games tested, and outperformed human experts in three. Figure 1 shows five screenshots of the game used for training.

2 background

We consider the task as an agent interacting with the environment ξ\xiξ, an Atari simulator, serialized actions, observations and rewards. In each step, the agent proceeds from the action set A={1… , A = K} \ {\} 1 \ ldots, K A = {1,… ,K} select an action ata_tat. The actions are passed to the simulator, and the simulator’s internal state and game score change. Generally the environment ξ\xiξ can be random. The internal state of the simulator is not observed by the agent; Instead, there is a picture xT ∈Rdx_t\in R^ DXT ∈Rd from the emulator, which consists of a vector of pixels representing the current screen information. It is worth noting that, in general, the score for a game depends on the entire sequence of actions and observations that have occurred before; Feedback about an action only shows up after a few thousand steps.

Since the agent can only observe the picture on the current screen, the task can only be partially observed and the state of many simulators is perceptual biased. For example, it is not possible to fully understand what is going on just from the current screen state xTX_txt. Therefore, we consider serializing actions with observations, st=x1, A1,x2… , at – 1, xts_t = x_1, a_1, x_2, \ ldots, a_ {1} t -, x_tst = x1, x2, a1,… ,at−1, XT, and learn game strategies from these sequences. All policies in the simulator are assumed to terminate in a finite number of steps. This form constructs a large and finite Markov decision process in which each sequence is a deterministic state. Thus, we can apply standard reinforcement learning methods to this Markov decision process by simply taking the complete sequence STs_TST as the state of the TTT moment.

The agent interacts with the environment and makes decisions to maximize the long-term reward value. We made basic assumptions: Future rewards are reduced by the parameter gamma \gamma gamma, And defines the discount returns t moment Rt = ∑ t ‘= tT gamma t’ – TRT ‘R_t = \ sum_ ^ t ^ \ prime = {t}, {t} \ gamma ^ ^ \ prime – t {t} r_ t ^ \ prime} {Rt = ∑ t’ = tT gamma t ‘- TRT’, in which t is the end of time. We define the optimal action value function Q∗(S,a)Q^\ast\left(S,a\right)Q∗(S,a) as the maximum expected feasible action return under any strategy. Specifically, after observing the state sequence SSS, select an action AAA, According to Q ∗ (s, a) = Max ⁡ PI E [Rt | st = s, the at = a, PI] Q ^ \ \ ast left (s, a \ right) = \ \ max_ \ {PI E left [R_t \ middle | \ S_t = s, a_t = a, \ \ PI right]} Q ∗ (s, a) = Max PI E [Rt ∣ st = s, the at = a, PI], including PI \ PI PI is a sequence to action (or action) the distribution of the mapping.

The optimal action value function obeys the important criterion of Behrman’s equality. To be specific, If the optimal value of the current sequence s’s ^ primes’ Q∗(S ‘,a ‘)Q^ ast\left(S ^ prime,a^ prime right)Q∗(S ‘,a ‘)Q ‘a ^ primea ‘, Then the optimal strategy is to select the optimal action a ‘a ^ \ primea’ to make maximum expected value, namely the gamma r + Q ∗ (s’ a ‘) r +, gamma Q ^, ast, left (s ^ \ prime, a ^ \ prime \ right) r + gamma Q ∗ (s’ a ‘),


Q ( s . a ) = E s   Is deduced [ r + gamma max a Q ( s . a ) |   s . a ] Q^\ast\left(s,a\right)=E_{s^\prime~\xi}\left[r+\gamma\max_a^\prime{Q^\ast\left(s^\prime,a^\prime\right)}\middle|\ s,a\right]

This is the basic idea behind many reinforcement learning algorithms, which aim to estimate action value functions by using The Behrman equation as a basis for iterative updates. What I want to show is that the iterative form of reinforcement learning is established in this way, Because it has a solid mathematical basis to prove it works.) Qi + 1 (s, a) = E/r + gamma Max ⁡ a ‘Qi (s’ a’) | (s, a)] Q_ (I + 1} \ left (s, a \ right) = E \ left [r + \ gamma \ max_ {^ \ a prime} {Q_i \ left (s ^ \ prime, a ^ \ prime \ r D.light)} \ middle | \ left (s, a, right), right] Qi + 1 (s, a) = E/r + gamma maxa ‘Qi (s’ a’) ∣ (s, a)]. This value iteration algorithm focuses on the optimal action value function, Qi→Q∗Q_i\rightarrow Q^\astQi→Q∗ where I →∞ I \rightarrow\inftyi→∞. In practice, this basic approach is very unrealistic, because the action value function is estimated separately from each sequence without generalization. Instead,a common approach is to use a function approximation to estimate the value of the action function, Q(s,a; Theta) material Q ∗ (s, a) Q \ left (s, a; \theta\right)\approx Q^\ast\left(s,a\right)Q(s,a; Theta) material Q ∗ (s, a). Linear function approximators are commonly used in the reinforcement learning community, but nonlinear function approximators are sometimes used instead, such as a neural network. We use a neural network function approximation as q-network, where the weights are θ\theta theta. Q-network can be trained by minimizing the loss function Li(θ I)L_i\left(\ theta_I \right)Li(θ I) sequence, for each generation III,


L i ( Theta. i ) = E s . a   rho   ( ) [ ( y i Q ( s . a ; Theta. i ) ) 2 ] L_i\left(\theta_i\right)=E_{s,a~\rho\ \left(\cdot\right)}\left[\left(y_i-Q\left(s,a;\theta_i\right)\right)^2\right]

Among them, the yi = Es’ factor [r + gamma Max ⁡ a ‘Q (s’,’; a theta I – 1) | s, a] y_i = E_ ^ \ {s prime ~ \ xi} \ left [r + \ gamma \ max_ \ {a ^ \ prime} {Q left (s ^ \ prime, a ^ \ prime; \ theta_ {1} I – \} \ middle right) | \ s, a \ right] yi = Es’ factor [r + gamma maxa ‘Q (s’, a’; θ I −1)∣ s,a] are the targets of the third iteration, ρ(s,a)\rho\left(s,a\right)ρ(s,a) are the sequence SSS and action AAA probability distributions that we describe in terms of behavior distributions. When the loss function Li(θ I)L_i\left(\ theta_I \right)Li(θ I) is optimized, the previous generation parameter θ I −1\theta_{I -1}θ I −1 remains unchanged. Note that the objective function depends on the network weight; This is different from the goal value of supervised learning, where the goal value is fixed before learning begins. Based on the weight derivation of the loss function, we get the following gradient,


Theta. i L i ( Theta. i ) = E s . a   rho ( ) ; s   Is deduced [ ( r + gamma max a Q ( s . a ; Theta. i 1 ) Q ( s . a ; Theta. i ) ) Theta. i Q ( s . a ; Theta. i ) ] \nabla_{\theta_i}L_i\left(\theta_i\right)=E_{s,a~\rho\left(\cdot\right); s~\xi}\left[\left(r+\gamma\max_{a^\prime}{Q\left(s^\prime,a^\prime;\theta_{i-1}\right)}-Q\left(s,a; \theta_i\right)\right)\nabla_{\theta_i}Q\left(s,a;\theta_i\right)\right]

Instead of calculating the expected value based on the gradient formula above, we often optimize the loss function by random gradient descent. If the weights are optimized at each step and the expected value is updated by a single sample of the behavior distribution ρ\rhoρ and the emulator ξ\xiξ, we are using the familiar q-learning algorithm. (Piper’s Egg Nest note: This formula is not too hard to understand. If you are familiar with Q-Learning, researchers have come up with more efficient iterative formulations in recent years. See TRPO, PPO2, etc.)

It is worth noting that this algorithm is model-free: in the algorithm, only the samples in the simulator ξ\xiξ are directly used to solve the reinforcement learning task, and there is no need to construct an additional valuation function for ξ\xiξ. At the same time, this algorithm is also off track: the algorithm from greedy strategy a= Max ⁡aQ(s,a; Theta) a = \ \ left max_a {Q (s, a; \theta\right)}a=maxaQ(s,a; θ), this greedy strategy follows a behavior distribution that ensures appropriate exploration of the state space. In practice, the behavior distribution follows the selection of the ϵ− Greedy \epsilon-greedyϵ−greedy strategy, where 1−ϵ1-\epsilon1−ϵ selects the behavior distribution with probability, and there is a random selection strategy with probability \epsilon x. (This is also one of the basic ideas of reinforcement learning. In the vernacular, we can’t just try something, get a taste of it, and then go in that direction. At first, we should be reckless, try everywhere, find the best direction, and learn; Later on, we are more conservative and less greedy.)

3 Related work

Perhaps the best known example of successful reinforcement learning is TD-Gammon, a program that plays backgammon and learns entirely from reinforcement learning and self-gaming. In this case, reinforcement learning takes on a superhuman level. Td-gammon uses a modelless reinforcement learning algorithm similar to Q-Learning and a multilayer perceptron approximate value function with a hidden layer (in fact, TD-Gammon uses an approximate state value function V(s)V\left(s\right)V(s), Instead of the action value function Q(s,a)Q(s,a)Q(s,a) Q(s,a), and in the process of self-gaming, learning is in the same orbit). (Piper’s Egg Nest note: Here is the literature review section of the paper, and you can see that many of the ideas in this paper are not original, but stand on the shoulders of giants.)

However, early attempts at TD-Gammon, which involved playing chess in the same way as described above, the Go and Checkers methods were not as successful. This has led to the widespread belief that tD-Gammon is a special case that only works in backgammon. Perhaps because the randomness of the dice helps explore the state space and makes the value function especially smooth.

In addition, the combination of model-free reinforcement learning algorithms such as Q-learning with nonlinear approximators or the use of off-track learning has been proved to cause skewness of Q-network. Therefore, the main research of reinforcement learning has been placed on linear function-based approximation to ensure better convergence.

Recently, the combination of deep learning and reinforcement learning has become a craze. Deep neural network is used to predict the environmental changes of ξ\xiξ. Constrained Boltzmann machines are used to predict value functions or strategies. In addition, the deviation problem of Q-learning is partially solved by the method of gradient time series difference. These methods are proved to converge when a nonlinear function approximation is used to estimate a fixed strategy, or when q-learning-based iterative frameworks are controlled by a linear function approximation. However, these methods have not been extended to nonlinear control.

Probably the most similar to our previous work is neural adaptive Q-learning (NFQ). NFQ optimizes the loss function sequence in Equation 2 and uses RPROP algorithm to update parameters in Q-network. NFQ, however, uses a batch of data to update, and the computational effort of each iteration is proportional to the size of the data set. In contrast, the random gradient update we designed has a very low cost of constant update computation and is suitable for large data sets. NFQ has also been successful in real life control tasks, using pure image input, learning a low latitude task description through deep self-coding techniques, and then applying it to NFQ. In contrast, reinforcement learning is applied end-to-end, taking input directly from images; Finally, reinforcement learning can directly obtain characteristic values related to significant action values. Q-learning also applies experience playback and a simple neural network in NFQ, but only receives a low-latitude state input instead of the original image information.

The use of the Atari 2600 simulator as a reinforcement learning platform was influenced by [Marc G Bellemare, Yavar Naddaf, Joel P, and Michael Bowling. The arcade learning environment: An evaluation platform for General Agents. Journal of Artificial Intelligence Research, 47:253-279, 2013. This study uses a standard reinforcement learning algorithm based on linear function approximators and general image features. Subsequently, the results improved based on using a large number of eigenvalues and applying hash quicksort randomly to the eigenvalues to reduce the spatial dimension. HyperNEAT’s iterative structure was applied to the Atari platform, where the use of neural network output strategy training was also used (distributed training for each game). After repeated training using sequences generated by fixed game configurations, these strategies were able to exploit design flaws in several Atari games.

4. Deep reinforcement learning

Recent images in computer vision and language recognition have relied on effective training of deep neural networks based on large data sets. The most successful work has been to train directly with unprocessed input data using a lightweight update method based on stochastic gradient descent. By feeding enough data into a deep neural network, it is often possible to get better learning results than by manually making features. These results inspire us to apply reinforcement learning. Our goal is to combine a reinforcement learning algorithm with a deep neural network, in which RGB images are input directly and the training process is performed efficiently through stochastic gradient descent.

Tesauro’s TD-Gammon model provided a lot of inspiration for our initial work. This model updates the parameters of the network of estimated value functions, directly using the sample of the experience of the same orbit strategy, ST, AT, RT, ST +1, AT +1s_t, A_T, R_T, S_ {t+1}, A_ {t+1} ST, AT,rt, ST +1,at+1, These samples come from algorithms interacting with their environment (or playing with themselves in backgammon). Considering that this approach can outperform the best human backgammon players at a young age of 20, and that hardware has evolved over the past 20 years, the use of modern deep neural network architecture and scalable RL algorithms may yield significant results.

Unlike TD-Gammon and similar online algorithms, we use a technique called “experiential playback.” We will agent at each experience et = (st, at, rt, st + 1) e_t = \ left (s_t a_t, r_t, s_ (t + 1} \ right) et = (st, at, rt, st + 1) stored in the data set D = e1,… , eND = e_1, \ ldots, e_ND = e1,… In eN, many acts of experience are placed in the experience pool. In the internal loop of the algorithm, we randomly sampled from the experience pool E De~De D, applied newer q-learning or used small batch updates. After the experience is played back, the agent selects and executes actions according to the ϵ\ Epsilon ϵ greed strategy. Considering the difficulty of using some specified historical information input into the neural network, our Q function works on a fixed historical information generated by Phi \Phi Phi function. The earning algorithm is called deep Q-learning, as shown in Algorithm 1.

This approach has several advantages over standard online Q-learning. First, each step in the experience pool has a chance to be used multiple times for weight updates, making the data more efficient. Second, it is inefficient to learn directly from a continuous sample because the samples are highly correlated. Random sampling breaks this correlation and therefore reduces the variance of updates. Third, the current parameters in online learning determine the data samples to be used for training parameters later. For example, if the optimal movement is to move to the left, the training sample will be sampled in the left part; If the optimal action is to the right, then the distribution of the training changes. It is easy to surmise that the resulting feedback loop in this case would be unsatisfactory and stuck in a poor local solution, perhaps even catastrophically non-convergent. By using the empirical replay technique, the distribution of the behavior is usually over many of the states produced in the past, which smooths our learning and avoids the wobble or non-convergence of the parameters. It is worth noting that it is necessary to use an off-track strategy when learning through empirical replay (because our current parameters are different from those used to generate the data), so we chose Q-Learning. (Piper’s Note: Off-policy is when we don’t immediately use the data we just created to train, but instead use the data from other habits after we have improved/improved some habits.)

In practice, our algorithm stores only the experience tuples of the last N steps in the experience pool and uniformly samples randomly from DDD for updates. This approach is limited in some ways because the experience pool does not distinguish important tuples and overwrites the old with the nearest tuple because of the limited space of the experience pool NNN. Similarly, uniform sampling gives equal importance to all tuples in the experience pool. A more randomized strategy might be to sample from the sample we can learn the most from, similar to preferential sampling.

4.1 Pretreatment and model structure

It takes a lot of computational power to work directly from an Atari frame image with a size of 210×160210\times160210×160 pixels and 128,128,128 colors, so we did a basic preprocessing to reduce the input dimension. The raw frame image is first converted to grayscale image by its RGB value 1, and the graphics of 110×84110\times84110×84 are downsampled, and the final input is 84×8484\times8484×84 size image, which presumably includes the player area. We just need the final image, Because we used [Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information 2D convolution in Processing Systems 25, Pages 1106-1114, 2012. This convolution requires square input. In the algorithm of this paper, the function φ \Phi φ in algorithm 1 is to preprocess the last four frames of animation, and stack them as the input of QQQ function.

There are several ways to parameterize Q using neural networks. Considering that Q is the mapping of historical action data to action value, historical data and action data can be used as the input of neural network, and these methods have been used by others. The main disadvantage of this structure is the need to calculate the value of each action as you move forward, resulting in a huge computational cost proportional to the number of actions. The output of the estimated value of a single action is related only to the input state. The main benefit of this approach is the ability to calculate the value of all actions at a given state through a neural network.

Now let’s discuss the structure that applies to all of atari’s games. The input of the neural network consists of a picture of 84×84×484 × times84 × times484× 4 generated by a Phi Phi function. The first hidden layer is input by a 168×8168 × times8168×8 filter with a step size of 444. The activation function is a nonlinear rectifier. Second hidden layer 324×4324\times4324×4 size filter input 222 step, activation function is also nonlinear rectifier. The last hidden layer is the full connection layer, composed of 25,256,256 rectifier units. The output layer consists of linear layers, with each output corresponding to a single action. The range of action varies from 444 to 181818, depending on the game. We call our convolutional neural network training method Deep Q-Networks (DQN).

Five experiments

So far we have run our experiments on seven popular Atari games: Beam Rider, Breakout, Enduro, Pong, Q* Bert, Seaquest, and Space Invaders. In seven games, we used the same network structure, learning algorithms, and hyperparametric Settings. Our approach is robust enough across different games without the need for additional information about the game. When it came to iterating on a real, fixed game, we made only one change, which was to the game’s rewards, and only during training. Considering that score mechanics vary widely between games, we set all positive rewards to 111, all negative rewards to −1-1−1, and everything else to 000. Setting the reward value in this way limits unexpected errors and allows the learning rate to be applied to different games. At the same time, this can affect our agent performance, because there is no difference in the number of questions awarded.

In these experiments, we used the optimization algorithm of RMSProp, and the mini-batch of training was 323232. In training, the behavioral strategy uses the greed strategy, in which ϵ\epsilonϵ decreases linearly from 1 to 0.1 in the first 1 million steps, and is eventually fixed at 0.1. We trained a total of 10 million steps, with an experience pool size of 1 million.

In addition to using the above method to control atari games, we also used a simple frame-jumping technique. More precisely, an agent observes and selects frames at every KKK step, not every step. The last action selected by an agent is repeated on the frame it skipped. This technique takes into account the fact that the simulator requires less computing resources to step than the agent does to select an action, so it allows the agent to play the game approximately K times more than normal in the same computing time. We set k= 4K = 4K =4 for all games except Space Invaders; In Space Invaders, we note that k= 4K = 4K =4 causes the laser to become invisible as it continues to blink. We used k=3k=3k=3 to make the laser visible, and this only caused a small hyperparameter change between games.

5.1 Training and stability

In supervised learning, we can easily track model performance by evaluating the model on training sets and test sets. In reinforcement learning, it is challenging to accurately evaluate agents in training. Based on Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The Arcade Learning Environment: An evaluation platform for General Agents. Journal of Artificial Intelligence Research, 47:253-279, 2013. This metric is the total return generated by an agent for a one-act game or game average, which is calculated periodically during training. Average total return is a very volatile metric because a small change in the weight of a strategy can lead to a large change in the distribution of states. The left two subgraphs of Figure 2 show the change in average total return when training Seaquest and Breakout. These two average returns are indeed very unstable, and it seems that the learning algorithm may not be a stable process. A more stable indicator is the strategy-to-action valuation QQQ function, which is used to estimate how much discount the agent can get in any state with the current strategy. We collected a fixed set of states based on a random strategy before the training began, and based on this set of states, collected the average value of the optimal action corresponding to these states as the training proceeded. As you can see in the two subgraphs on the far right of Figure 2, the average action estimation increases much more smoothly than the average total reward value. The same is true for the other five games. Furthermore, we observed that the improvements we made related to QQQ valuation smoothing did not result in non-convergence in the experiment. This shows that, despite the lack of theoretical convergence, we can stably train a large neural network by using reinforcement learning features and stochastic gradient descent.

5.2 Value function visualization

Figure 3 shows a visualization of the value in the Seaquest game. The image shows A sudden increase in estimated value when an enemy appears from the left side of the screen (point A). The agent then torpedoes this enemy, at which point the estimated value peaks (point B). Finally, when this enemy disappears, the value drops dramatically (point C). Figure 3 depicts how our method is able to observe the value function changing with a rationalized sequence of events.

5.3 Main Evaluation

We compared our results with those that performed best in the reinforcement learning literature. The Sarsa method uses the Sarsa algorithm to learn linear strategies on several different artificial feature sets, and it learns on the Atari task, where our method achieves the best performance. The Contingency method uses the same base scheme as the Sarsa method, but expands the set of features with an agent-based approach. Note that both methods combine prior knowledge of visual problems by using background preprocessing and treating 128 colors as separate channels. Since many Atari games use a different color for each type of object, having each color as a separate channel is similar to generating a separate binary mapping encoding for each type of object. Instead, our agent only receives raw RGB screenshots and must learn to detect objects on its own.

In addition to recording the data of our agents, we also recorded the performance of human experts and random action selection. The performance of the human players was averaged two hours after they had played each game. Note that our human players scored much higher than Bellemare et al. For our trained agents, we followed Bellemare’s iterative strategy and recorded the average score of fixed steps under ϵ=0.05\ Epsilon =0.05ϵ=0.05 greedy strategy. In the first five rows of Table 1, the average score for all games is shown. Our method (DQN in this case) performed significantly better than any other learning method in seven games, and that was with little prior processing of the input.

We also considered a partnership with [Matthew Hausknecht, Risto Miikkulainen, And Peter Stone. A neuro-evolution approach to General Atari Game Playing. 2013. We recorded two result sets for this work. Among them, HNeat Best reflects the result of the algorithm using manual inspection of object position and object type. HNeat Pixel reflects the algorithm effect of classifying objects using eight color channels.

This approach relies heavily on finding a sequence that successfully exploits the vulnerability. Learning in this way is unlikely to generalize to random disturbances. Therefore, we only take the screen with the highest score for algorithm evaluation. Instead, our algorithm evaluates on a greedy control sequence and is therefore capable of evaluating in a variety of ways. In the end, we found that, with the exception of Space Invaders, our algorithm performed best on both optimal scores (line 8) and average results (line 4). If you’re on a roll, you don’t have to admit that you’re doing great on all your experiments. DQN wasn’t on the best way to Invade Space.

Finally, our algorithm outperformed human experts on Breakout, Enduro, and Pong, and came close to human performance on Beam Rider. Q* Bert, Seaquest, and Space Invaders are far worse than humans, and controlling them is even more challenging because they require a network model to find a strategy that works for a long time.

6 the conclusion

This paper introduces a new deep learning model based on reinforcement learning and describes its powerful control over atari 2600 video games with only unprocessed pixel input. We propose a variant of online Q-learning that combines random small-batch updates based on experience pool techniques to train the deep network of reinforcement learning in a light amount. Without adjusting hyperparameters, our method provided SOTA-level results for the six games tested.

Piper’s Egg Nest: An interesting and juicy original piece on technology