Algorithm is introduced

In addition to the ACTION value method of DQN, there is also a direct learning method for Policy Gradients. In this way, it is not necessary to record all Q values in the table. In the face of numerous continuous tasks with actions, it is more efficient to directly select actions from the action distribution by using [Parameterized Policy].

Algorithm principle

Reinforcement Learning: An Introduction chapter 13: Policy Gradients

Basically implement the [REINFORCE algorithm] in the book.

Since the REINFORCE algorithm uses the full return value, thus belonging to the Monte Carlo algorithm and only applicable to episodic form.

Then the following is a great time to give the pseudo-code, and there are only details of the differences in the book:

Here’s the difference between Policy gradient (REINFORCE) and DQN:

  • Instead of using the Q value, use the distribution of actions and select the action to take directly from the distribution
  • A episod is not updated until after the end of the two pseudocodes aboveandA total return

Programming to realize

build_net

The following is the structure diagram of the neural network

Differences with DQN:

  • Full connection layer FC
  • Softmax layer output probability

Tf.nn.sparse_softmax_cross_entropy_with_logits (logits= ALL_ACT,labels= self.tF_acts) This function operates on all_ACT in two steps:

  1. Softmax

  1. Computing Cross – Entropy

Understanding cross entropy

tf.one_hot(self.tf_acts, self.n_actions)

One_hot introduction

The tf.one_hot() function converts the input into one-hot type data output, which is equivalent to combining multiple values together as multiple vectors of the same type, which can be used to represent their respective probability distributions. It is usually used as the output of the final FC layer in classification tasks, and sometimes translated into "single-hot" coding.

♥ About loss:

    neg_log_prob=tf.nn.sparse_softmax_cross_entropy_with_logits(logits=all_act,labels=self.tf_acts)# Softmax and cross entropy are performed simultaneously,
    Loss := -σ ylogy
    # In terms of supervised learning classification, logits is the predicted value, labels is the real value;
    # Cross entropy Loss is to improve the parameters of the neural network according to the real label; Neg_log_prob is error
    # Here is the expansion form of the above, broken down to make it easier to understand:
    # neg_log_prob = tf.reduce_sum(-tf.log(self.all_act_prob)*tf.one_hot(self.tf_acts, self.n_actions), axis=1)
    loss = tf.reduce_mean(neg_log_prob*self.tf_vt) #vt = total return G_t (vt = reward + attenuated future reward)
    To make sure that this action is really "labeled correctly", our loss is multiplied by vt in the original cross-entropy form to tell us whether the gradient calculated by the cross-entropy is a trustworthy gradient.
    # If vt is small, or negative, then the gradient descent is in the wrong direction and we should update the parameters in the other direction,
    If this vt is positive, or large, vt will compliment the gradient out of the cross-entropy and gradient down in that direction.
Copy the code

choose_action

action = np.random.choice(range(prob_weights.shape[1]), p=prob_weights.ravel())

p:The probabilities associated with each entry in a. If not given the sample assumes a uniform distribution over all entries in a.

Examples are as follows:


import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

x = tf.constant([[100..88..98.]])
tf.global_variables_initializer()
with tf.Session() as sess:
    prob_weights=sess.run(tf.nn.softmax(x))

print(prob_weights.shape[1]) # 3 3 columns
print(prob_weights.shape[0]) # 1 1 row
print(prob_weights)# [[8.8079226 e-01 e-06 5.4117750 1.1920227 e-01]]
print(prob_weights.ravel())#[8.8079226E-01 5.4117750E-06 1.1920227E-01
#.ravel() pulls the vector back to the one-dimensional list

action = np.random.choice(range(prob_weights.shape[1]), p=prob_weights.ravel())
print(action)0, 1, 2: action number
Copy the code