The last article introduced the reinforcement learning — actor-critic algorithm in detail and combat introduced actor-critic, this article will introduce DDPG algorithm, DDPG stands for Deep Deterministic Policy Gradient, in which PG is the Policy Gradient that we introduced earlier. The derivation of Policy Gradient has been discussed in reinforcement learning 10. What is the deterministic strategy Gradient?

1. Deterministic strategy

The stochastic strategy corresponds to the deterministic strategy, that is, the neural network outputs the distribution of actions. When determining each step of action, we need to sample the strategy distribution obtained. For some high-latitude continuous valued actions, frequent sampling of actions in high-dimensional space consumes much computing power.

Also, for DQN algorithm that applies only to low dimension, the problem of discrete action, to the problem of continuous motion DQN to compute the probability of all possible actions, and calculate the possible action, the value of the number of action as the number of degrees of freedom exponentially, that need to be very sample size and the amount of calculation, so there will be a deterministic strategy to simplify the problem.

As a random strategy, in the same strategy, in the same state, the action adopted is based on a probability distribution, that is, it is uncertain. Deterministic strategies are much simpler, that is, using the same strategy, in the same state, the action is uniquely deterministic:


a t = mu ( s Theta. mu ) a_t = \mu(s|\theta^\mu)

Second, the DDPG

The first thing to note is that DDPG sounds like a strategy gradient (PG) algorithm in name, but it’s actually more like DQN, or DDPG is an algorithm that uses the actor-Critic architecture to solve the problem that DQN can’t handle continuous motion control. Now, why is that

1. From Q-Learning to DQN

Let’s first recall the algorithm flow of Q-learning. Q-learning algorithm has been introduced in detail in reinforcement Learning 4 — Timing Differential Control Algorithm. We know that first of all, based on the state StS_tSt, we select the action AtA_tAt and execute it using ϵ−\epsilon-ϵ− greedy method, enter the state St+1S_{t+1}St+1, and get the reward RtR_{t}Rt,



,a,r,s>
,a,r,s>
,a,r,s>
,a,r,s’>


A = m a x a Q ( S . a ) A’ = max_{a’} Q(S’,a’)

A’ a ‘a’ is selected to update the value function with the maximum aaa of Q(St+1,a)Q(S_{t+1}, a)Q(St+1,a). The corresponding action in the figure above is to select one of the three black circle actions at the bottom of the figure that maximizes Q(S’, a)Q(S’, a)Q(S’, a) as a ‘a’.

Since Q-learning uses Q tables to store all the action values of each state, it is not competent for continuous state if there are too many states, so we use the method of function approximation and use neural network to replace Q tables, and other processes remain unchanged, thus obtaining DQN algorithm. DQN algorithm has been introduced in detail in reinforcement learning 7 — DQN algorithm, and the following is a simple recall of the DQN algorithm process:

As can be seen, DQN uses neural network to replace Q table. Loss function is the gap between the current output of neural network and target, and then takes derivative of loss function to update network parameters.

2. From DQN to DDPG

R +γ maxa ‘q^(s’,a’,w)r + \gamma\; Max_ {a’}\hat{q}(s’,a’,w)r+γmaxa ‘q^(s ‘,a’,w)r+γmaxa’ q^(s ‘,a’,w) Because we can calculate the Q value for all actions (this also results in actions that maximize the Q value). However, when the motion space is continuous, we cannot exhaust all possibilities, which is why DQN cannot handle continuous motion control tasks.

So how to solve this problem? We know that DQN uses neural network to solve the continuous state space problem that Q-learning cannot solve. Can we also use a neural network instead of maxaQ(s,a)max_aQ(s,a)maxaQ(s,a)? Of course you can, this is what DDPG does, by replacing this process with a function:


m a x a Q ( s . a ) material Q ( s . mu ( s Theta. ) ) max_aQ(s,a) \approx Q(s,\mu(s|\theta))

Where θ\thetaθ is the parameter of the policy network. According to the AC algorithm mentioned above, it can be associated with using the policy network as Actor and using the value network in DQN as Critic (note that the estimated value here is Q, not V, not to be confused with the AC algorithm). The Actor part does not use the weighted gradient update with Reward (PG algorithm update mode), but uses the gradient update of action to Actor with Critic network.

Since DDPG comes from DQN, it of course also has experience playback and dual network, so add a target network to calculate the target Q value in the Critic section, but also need to select actions to estimate the target Q value. Since we have our own Actor strategy network, add another Actor target network in a similar way. So the DDPG algorithm consists of four networks, which are: Actor current network, Actor target network, Critic current network, Critic target network, two actors network structure is the same, two Critic network structure is the same, the functions of the four networks are as follows:

  • Actor Current network: Responsible for the iterative update of the policy network parameter θ\thetaθ, responsible for selecting the current action A according to the current state S, used to generate S’ and R with the environment.
  • Actor Target network: responsible for selecting the optimal next action A’ based on the next state S’ sampled in the empirical playback pool, with network parameters θ\thetaθ periodically copied from θ\theta theta.
  • Critic current network: is responsible for the iterative update of value network parameter WW, is responsible for calculation is responsible for the calculation of the current Q value Q(S,A, W)Q(S,A, W)Q(S,A, W). Target Q yi = R + gamma Q ‘(S’, A ‘w’) y_i = R + \ gamma Q ‘(S’, A ‘w’) yi = R + gamma Q ‘(S’, A ‘w’).
  • Critic network goals: responsible for calculating target Q Q ‘(S’, A ‘w’) Q ‘(S’, A ‘w’) Q ‘(S’, A ‘w’), network parameters w ‘w’ w ‘from WWW copy on A regular basis.

Note: There are two inputs to the Critic network: action and state, which need to be entered into the Critic together

It is worth noting that DDPG replication from the current network to the target network is not the same as DQN we talked about earlier. In DQN, the parameters of the current Q network are directly copied to the target Q network, that is w’ =ww’ =ww’ = W. Such update mode is hard update, and the corresponding is soft update. DDPG is soft update, that is, each parameter is updated a little bit, that is:


w please tau w + ( 1 tau ) w Theta. please tau Theta. + ( 1 tau ) Theta. w’ \leftarrow \tau w + (1-\tau)w’ \\ \theta \leftarrow \tau \theta + (1-\tau)\theta’

τ\tauτ is the renewal coefficient, which is usually smaller. Meanwhile, in order to add some randomness to the learning process and explore potential better strategies, noise is added to action through Ornstein-Uhlenbeck Process (OU process), and the final output action A is:


A = PI. Theta. ( s ) + O U n o i s e A = \pi_\theta(s) + OU_{noise}

For the loss function, the Critic part is similar to DQN, which uses mean square error:


J ( w ) = 1 m i = 1 m ( y i Q ( S i . A i . w ) ) 2 J(w) = \frac{1}{m}\sum_{i=1}^m(y_i – Q(S_i, A_i, w))^2

As for the loss function of Actor part, the definition of the original paper is complicated, so we adopt A simple way to understand it here. As we know, the role of Actor is to output an action A, which can obtain the maximum Q value after it is input to Critic. Therefore, the loss of Actor can be simply understood as the larger the feedback Q value obtained, the smaller the loss will be, and the smaller the feedback Q value obtained, the greater the loss will be. Therefore, we just need to take a negative sign for the Q value returned by the state estimation network, namely:


J ( Theta. ) = 1 m i = 1 m Q ( s i . a i . w ) J(\theta) = -\frac{1}{m}\sum_{i=1}^mQ(s_i, a_i, w)

The complete process of DDPG algorithm is as follows:

Three, code implementation

The code uses a Pendulum- V0 continuous environment and tensorFlow learning framework

1. Build a network

The Actor network

def get_actor(input_state_shape) :
    input_layer = tl.layers.Input(input_state_shape)
    layer = tl.layers.Dense(n_units=64, act=tf.nn.relu)(input_layer)
    layer = tl.layers.Dense(n_units=64, act=tf.nn.relu)(layer)
    layer = tl.layers.Dense(n_units=action_dim, act=tf.nn.tanh)(layer)
    layer = tl.layers.Lambda(lambda x: action_range * x)(layer)
    return tl.models.Model(inputs=input_layer, outputs=layer)
Copy the code

Action_bound = env.action_space. High Is used to specify the range of actions in a contiguous environment.

If the actor outputs actions out of range, it will cause program exceptions, so use the tanh function at the end of the network to map the output between [-1.0, 1.0]. Lamda expressions are then used to map the actions to the appropriate scope.

Critic network

def get_critic(input_state_shape, input_action_shape) :
    state_input = tl.layers.Input(input_state_shape)
    action_input = tl.layers.Input(input_action_shape)
    layer = tl.layers.Concat(1)([state_input, action_input])
    layer = tl.layers.Dense(n_units=64, act=tf.nn.relu)(layer)
    layer = tl.layers.Dense(n_units=64, act=tf.nn.relu)(layer)
    layer = tl.layers.Dense(n_units=1, name='C_out')(layer)
    return tl.models.Model(inputs=[state_input, action_input], outputs=layer)
Copy the code

In DDPG we input both state and action into the Critic network to estimate Q(s,a)Q(s,a)Q(S,a). So define two input layers, then connect them, and finally the input part of the model defines two inputs.

2. Main process

for episode in range(TRAIN_EPISODES):
    state = env.reset()
    for step in range(MAX_STEPS):
        if RENDER: env.render()
        # Add exploration noise
        action = agent.get_action(state)
        state_, reward, done, info = env.step(action)
        agent.store_transition(state, action, reward, state_)

        if agent.pointer > MEMORY_CAPACITY:
            agent.learn()

        state = state_
        if done: break
Copy the code

As you can see, the DDPG process is basically the same as DQN, reset the state, select the action, interact with the environment, get S’, R, and save the data. If the amount of data is enough, the data is sampled and network parameters are updated. Then start updating S to start the next loop.

Here’s a look at the get_action() function:

def get_action(self, s, greedy=False) :
    a = self.actor(np.array([s], dtype=np.float32))[0]
    if greedy:
        return a
    return np.clip(
        np.random.normal(a, self.var), -self.action_range, self.action_range)
Copy the code

The get_action() function selects an action and then interacts with the environment. To better explore the environment, we added noise to the movement during training. The original DDPG authors recommended adding time-dependent OU noise, but more recent results show that Gaussian noise performs better. Because the latter is simpler, it is more commonly used.

We use the latter here to add gaussian noise to the action: here our Actor outputs a as the average value of a normal distribution, and then adds VAR as the variance of the normal distribution to construct a normal distribution. Then, an action is randomly selected from the normal distribution. We know that the normal distribution has a high probability of sampling points near the average value. Therefore, certain exploration behaviors can be realized by using this method. In addition, we can control the size of VAR to control the probability of exploration.

When testing, there is no need to explore when selecting the action, because then the Actor needs to input the action with the maximum Q value and can output the action directly. So you use a greedy parameter in the get_action() function to control both cases.

3. Network update

Critic update

As shown in the figure above, the update for the Critic section is the same as for the DQN, using TD-error. The target is constructed from the target network, then the MSE loss is calculated against the q output of the current network, and then the network parameters are updated.

with tf.GradientTape() as tape:
    actions_ = self.actor_target(states_)
    q_ = self.critic_target([states_, actions_])
    target = rewards + GAMMA * q_
    q_pred = self.critic([states, actions])
    td_error = tf.losses.mean_squared_error(target, q_pred)
critic_grads = tape.gradient(td_error, self.critic.trainable_weights)
self.critic_opt.apply_gradients(zip(critic_grads, self.critic.trainable_weights))
Copy the code

The Actor to update

with tf.GradientTape() as tape:
    actions = self.actor(states)
    q = self.critic([states, actions])
    actor_loss = -tf.reduce_mean(q)  # maximize the q
actor_grads = tape.gradient(actor_loss, self.actor.trainable_weights)
self.actor_opt.apply_gradients(zip(actor_grads, self.actor.trainable_weights))
Copy the code

DDPG uses gradient rise method, the role of Actor is to output an action, this action input into the Critic network can get the maximum Q value. As it is in the opposite direction of gradient descent, a minus sign is required before the Loss function.

DDPG algorithm Tensorflow2.0 implementation also hope to give a star, again thank you very much

Four,

DDPG introduces Actor network on the basis of DQN algorithm to solve the problem of continuous control by referring to AC architecture. It can be regarded as an improved algorithm of DQN on continuous problem.

The next part will introduce the evolutionary version of DDPG algorithm, which is TD3 algorithm.