A comparative analysis of DQN and DDPG, two marriages of deep learning and reinforcement learning

This article was first published on:Walker AI

Q-Learning algorithm is a time-series differential Learning method with different strategies. DQN is a method that uses neural network to approximate the value function in Q-learning and make improvement for practical problems. DDPG can be regarded as an extension of DQN to continuous action prediction. In this paper, DQN and DDPG will be compared and analyzed from the definition to better understand the difference and connection between them.

This paper first introduces the common concepts involved in DQN and DDPG, then analyzes and understands its algorithm process from DQN, and then further analyzes DDPG, and finally summarizes the differences and connections between the two. This paper is mainly divided into the following three parts:

(1) Introduction to related concepts

(2) Algorithm analysis of DQN

(3) Algorithm analysis of DDPG

1. Introduction to related concepts

DQN and DDPG deal with different problems. DQN deals with discrete action problems, while DDPG extends it to deal with continuous action problems. So first we need to understand the difference between continuous motion and discrete motion, and how they are implemented in engineering.

1.1 Discrete Action

Simple understanding, discrete action is the movement can be classified, such as up, down, fire, cease-fire, etc.; In a real project, we use activation functions for categorizing types to represent them, such as Softmax:

As shown in the figure above, after input X passes through any neural network, the last network layer uses softmax activation function to divide the network output into N action classes. So you can output discrete actions.

1.2 Continuous action

Discrete action can be classified, so continuous action is a continuous value, such as distance, Angle, strength and so on to represent the exact value. Continuous actions cannot be classified, so in practice we use activation functions of return value types to represent them, such as TANh:

As shown in the figure above, after input X passes through any neural network, the last network layer uses TANH activation function to output the network as an interval value value. This will output continuous actions.

2. DQN

2.1 Problems faced by DQN

DQN is a method that uses neural network to approximate the value function in Q-learning and make improvement for practical problems. But we can’t make a simple substitution, such as defining a classification neural network:

Then define a Loss function similar to Q-learning, such as: Q (s, a) please Q (s, a) + alpha (r + gamma Max ⁡ a ‘Q (s’ a’) – Q (s, a)) Q (s, a) \ leftarrow Q (s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)-Q(s, A) \ right) Q (s, a) please Q (s, a) + alpha (r + gamma maxa ‘Q (s’ a’) – Q (s, a)), and then direct optimization. Such an approach will not work.

In practical engineering, the implementation of DQN will encounter many difficulties, among which the most obvious ones are:

(1) Low sample utilization rate

(2) The value obtained by training is unstable

Among them, problem (1) is a common problem of sequential decision making. Since sequential decision making has a tight context, a long sample can only be counted as one sample, resulting in a low sample utilization rate.

Problem (2) Because the Q value output by the network will participate in the selection of action, and the selected action will generate a new state after interacting with the environment and continue to be sent to the Q network training; This results in the output targets of network parameters continuing to participate in the training of network parameters. This causes the Q network output to be unstable.

2.2 Solutions

In the face of the above two problems, DQN respectively adopts the following two solutions:

(1) Experience Replay, that is, to build a Replay Buffer to remove data correlation; An experience pool is a data set composed of an agent’s recent experiences.

(2) Freezing Target Networks, which means that the parameters of the Target are fixed in a period of time (or within a fixed number of steps) to stabilize the learning Target.

The functional structure of the entire DQN can be expressed as follows:

First, input the data composed of action, status, reward and end mark into Q network, and the network outputs a predictive value Q_predict. Then, action is selected according to this value into the environment for interaction, and the new status value S ‘is obtained, which is continued to be sent to training.

At the same time, the results of each interaction with the environment are stored in a fixed length of experience pool; A Target_Q network with the same structure and parameters is copied from the Q network every C step to stabilize the output target. The Target_Q network samples data from the experience pool. Q_target = r+γQtarget(s, S ‘,a, R)r+\gamma Q_{target}\left(\ boldSymbol {s}, \ boldSymbol {s}^{\prime}, \ boldSymbol {a}, \ boldSymbol {r}\right)r+γQtarget(s,s’,a,r), \ boldSymbol {r}\right)r+γQtarget(s, S ‘,a,r) Qtarget (s, s’, a, r) Q_ {target} \ left (\ boldsymbol {s}, \ boldsymbol ^ {s} {\ prime}, \ boldsymbol {a}. \ boldSymbol {r}\right)Qtarget(s, S ‘,a,r) is the output value of Target_Q network.

The loss function of the whole DQN directly takes the mean square error of two predicted values Q_predict and Q_target.

The detailed algorithm flow is as follows:

3. DDPG

On the basis of known DQN algorithm, it is very easy to look at DDPG. Essentially the DDPG idea has not changed, but the application has; Compared with DQN, DDPG mainly solves the problem of continuous action prediction. From the introduction above, we can see that the implementation difference between continuous and discrete actions is only in the choice of final activation function. Therefore, DDPG has made some improvements on the algorithm inherited from DQN.

Structure of direct up algorithm:

By comparing the algorithm structure diagram of DQN, it is easy to find that DDPG adds a Policy network and Policy_target network on the basis of DQN to output a continuous value. The continuous value is actually the continuous action. The rest of the idea is pretty much the same as DQN.

The difference is that although the final loss function still calculates the mean square error of the two predicted values Q_predict and Q_target, However, the values of Q_predict and Q_target are obtained separately from the output of the Policy network and Policy_target network. Therefore, loss functions of the two policy networks need to be embedded in Q_predict and Q_target, as shown in the figure above.

Compared with DQN, a more detailed DDPG algorithm flow can be obtained by slightly changing its algorithm:

After the extension of DDPG over DQN, let’s talk about inheritance. Obviously, DDPG inherits Experience Replay and Freezing Target Networks to solve the same problem.

4. To summarize

This paper analyzes DQN and DDPG algorithms respectively from a comparative perspective, and it can be seen that:

(1) Both use Experience Replay and Freezing Target Networks to solve the problem of unstable samples and Target values.

(2) The algorithm structure of DDPG and DQN is very similar, both of them are the same process, but DDPG has some operations of Policy series network on the basis of DQN.

(3) The loss functions of the two are essentially the same, except that DDPG adds Policy network to output continuous action values, so it is necessary to embed the Loss function of Policy network into the original MSE.

In summary, this paper concludes that DDPG is essentially an extended algorithm of DQN in continuous action. Through comparison, it can also be seen that DDPG and DQN have a very high algorithm similarity, not just as the DDPG paper said it was derived from DPG algorithm.

This article on DQN and DDPG algorithm comparison understanding to here, the next article we will start from the code, to the two algorithms for implementation, please look forward to!

Reference 5.

[1] Qiu Xipeng, NNDL

[2] Continuous control with deep reinforcement learning

[3] Playing Atari with Deep Reinforcement Learning

PS: more dry technology, pay attention to the public, | xingzhe_ai 】, and walker to discuss together!