1. Define the calculation diagram
Define the input and output of the evaluation network
Our evaluation network input is the state at a moment in time, and the output is q-EVAL for every action that can be selected at that state.
Our estimate of network loss is the mean square error of the target Q_target and the current Q Q_eval at this time.
Where, the current Q value Q_eval is the output of the evaluation network, and the target Q value needs to be calculated according to the Q-learning mechanism, so it is first defined as the input of the calculation graph:
Self. s = tf.placeholder(dtype="float", shape=[None, 80, 80, 4], self.s = tf.placeholder(dtype="float", shape=[None, 80, 80, 4], Self.q_target = tf.placeholder(dtype="float", shape=[None,self.n_actions], name='q_target')Copy the code
Our evaluation network is defined as follows:
With tf.variable_scope('eval_net'): with tf.variable_scope('output'): self.q_eval = self._define_cnn_net(self.s,c_names=['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES]) with tf.variable_scope('loss'): Self.cost = tf.reduce_mean(tf.squared_difference(self.q_target,self.q_eval))# Use the squares between the predicted reward value and the current position to obtain the loss value Tf.summary.scalar ("loss", self.cost) # Monitor this variable with tF.variable_scope ('train'): self.trainStep = tf.train.AdamOptimizer(self.learn_rate).minimize(self.cost)Copy the code
Define the inputs and outputs of the target network
Because our Nature DQN algorithm uses a dual network mechanism, which reduces the association between data and makes the algorithm more robust in Q learning of data, we need to define a target network.
The network has the same structure as the evaluation network, but the parameters do not need to be updated by back propagation, so there is no need to define loss values and training operations:
Our target network is defined as follows:
With tf.variable_scope('target_net'): with tf.variable_scope('output'): self.q_next = self._define_cnn_net(self.s, c_names=['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES])Copy the code
3. Define parameter update operations for the two networks
Since the parameters of the target network are copied periodically from the current network, we need to continue to define parameter update operations in the calculation diagram:
t_params = tf.get_collection('target_net_params') e_params = tf.get_collection('eval_net_params') self.replace_target_op = [tf.assign(t, e) for t, e in zip(t_params, e_params)]Copy the code
Define the learning mechanism
The learning mechanism of the algorithm is basically three steps:
1. Batch sampling from the empirical replay pool
2. Use single-step Q-learning formula to calculate the target Q value according to batch sampling data
3. Input the target Q value and status into the evaluation network, train and update the evaluation network and the target network
1. Batch sampling from the empirical replay pool
The batch sampling code is closely related to our empirical replay pool definition, and here our batch sampling code is as follows:
State_batch = [data[0] for data in minibatch] state_Batch = [data[0] for data in minibatch] [80, 80, 4] action_batch = [np.argmax(data[1]) for data in minibatch] That is, the upward index is [1, 0], and the downward index is [0, NextState_batch = [data[3] for data in minibatch] [80, 80, 4] terminal_Batch = [data[4] for data in minibatch]Copy the code
2. Use single-step Q-learning formula to calculate the target Q value according to batch sampling data
(1) Use the target network to obtain the successor Q value of the successor state:
q_next_batch = self.q_next.eval(feed_dict={self.s: nextState_batch})
(2) Calculate the target Q value of batch data according to the single-step Q-learning formula:
Q_eval_batch = self.sess.run(self.q_eval, {self.s: Q_target_batch = q_eval_batch.copy() # The target Q value and the current Q value have the same matrix structure, For I in range(0, self.batch_size): terminal = terminal_batch[I] if terminal: Q_target_batch [I, action_Batch [I]] = reward_batch[I] + self.gamma * np.max(q_next_Batch [I]) # Target Q = current reward + discount factor * Subsequent Q else: Q_target_batch [I, action_Batch [I]] = reward_batch[I] # Target Q value = current rewardCopy the code
3. Input the target Q value and status into the evaluation network, train and update the evaluation network and the target network
(1) Training update evaluation network: backpropagation and loss calculation can be performed directly by using computational graphs:
_, cost = self.sess.run([self.trainStep, self.cost],
feed_dict={self.s: state_batch,
self.q_target: q_target_batch})
Copy the code
(2) Update target network regularly: directly perform parameter update operation using calculation graph (hard update mechanism) :
if self.learn_step_counter % self.replace_target_iter == 0: Self.sess. run(self.replace_target_op) print(' ntarget_net parameter updated \n')Copy the code
【 Game Development 】