In the previous article, we derived the Policy Gradient formula:


Theta. J ( Theta. ) material 1 m i = 1 m R ( tau i )    t = 0 T 1 Theta.    l o g    PI. Theta. ( a t i s t i ) \nabla_\theta J(\theta) \approx \frac{1}{m}\sum_{i=1}^mR(\tau_i)\; \sum_{t=0}^{T-1}\nabla_\theta\; log\; \pi_\theta(a_t^i|s_t^i)

Where R(τ I)R(\tau_i)R(τ I) represents the sum of all the rewards in the i-track.

For this formula, we got it based on MC sampling. There is no bias for the trajectory sampled by MC. However, because it is sampling, the reward obtained by each track is very unstable, resulting in relatively high variance. In order to reduce variance, there are two ways: 1. Use temporal causality. 2. Introduce Baseline

First, reduce variance

1. Use temporal causality

Using temporal causality can eliminate many unnecessary items


Theta. E tau [ R ] = E tau [ ( t = 0 T 1 r t ) ( t = 0 T 1 Theta.    l o g    PI. Theta. ( a t s t ) ) ] \nabla_\theta E_\tau[R] = E_\tau \left[\left(\sum_{t=0}^{T-1}r_t\right) \left( \sum_{t=0}^{T-1}\nabla_\theta\;log\; \pi_\theta(a_t|s_t)\right) \right]

Rt ‘r_{t’}rt’ can be expressed as:


Theta. E tau [ r t ] = E tau [ r t t = 0 t Theta.    l o g    PI. Theta. ( a t s t ) ] \nabla_\theta E_\tau[r_{t’}] = E_\tau\left[r_{t’}\sum_{t=0}^{t’}\nabla_\theta\;log\;\pi_\theta(a_t|s_t)\right]

Then add up the derivatives of all the points on a trajectory:


Theta. J ( Theta. ) = Theta. E tau ~ PI. Theta. [ R ] = E tau [ t = 0 T 1 r t t = 0 t Theta.    l o g    PI. Theta. ( a t s t ) ] = E tau [ t = 0 T 1 Theta.    l o g    PI. Theta. ( a t s t ) t = t T 1 r t ] = E tau [ t = 0 T 1 G t Theta.    l o g    PI. Theta. ( a t s t ) ] \ begin} {aligned \ nabla_ \ theta J (\ theta) = \ nabla_ \ theta E_ {\ tau ~ \ pi_ \ theta} [R] & = E_ \ tau \ left [\ sum_ {t ‘= 0} ^ {1} t – r_ {t}’ \sum_{t=0}^{t’}\nabla_\theta\;log\;\pi_\theta(a_t|s_t)\right] \\ & = E_\tau\left[\sum_{t=0}^{T-1}\nabla_\theta\;log\; \pi_\theta(a_t|s_t) \sum_{\color{red}t’=t}^{T-1}r_{t’} \right] \\ & = E_\tau \left[\sum_{t=0}^{T-1}G_t\cdot \nabla_\theta\;log\;\pi_\theta(a_t|s_t) \right] \end{aligned}

Including 1 Gt = ∑ t ‘= tT – rt’ G_t = \ sum_ {t} t ‘= ^ {1} t – r_ {t’} Gt = ∑ t ‘= tT – 1 rt’ said trajectory step t get back for a combined rewards.

If the above formula is hard to understand, we can think of it this way: We all know that the present moment cannot affect what has happened in the past. This is temporal causality. Similarly, for a trajectory, a strategy at time t’t ‘does not affect the rewards earned before time t’t’. So you just add up all the rewards after t’, regardless of what happened before t’. Therefore, the Policy Gradient Estimator can be expressed in the following form:


Theta. E [ R ] material 1 m i = 1 m t = 0 T 1 G t Theta.    l o g    PI. Theta. ( a t i s t i ) \nabla_\theta E[R] \approx \frac{1}{m}\sum_{i=1}^m\sum_{t=0}^{T-1}G_t\cdot \nabla_\theta\; log\; \pi_\theta(a_t^i|s_t^i)

From the above we get a very classic algorithm in Policy Gradient, REINFORCE:

Williams (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning: introduces REINFORCE algorithm

2. Join Baseline

For a sampled trajectory, its reward GtG_tGt will have a high variance. We can reduce the variance by subtracting GtG_tGt from a value (the Baseline). It is easy to prove that adding the Baseline will reduce the variance without changing the expected value of the whole. This will make the training process more stable.


Theta. E tau ~ PI. Theta. [ R ] = E tau [ t = 0 T 1 ( G t b ( s t ) ) Theta.    l o g    PI. Theta. ( a t s t ) ] \ nabla_ \ theta E_ {\ tau ~ \ pi_ \ theta} [R] = E_ \ tau \ left [\ sum_ {t = 0} ^ {t – 1} {\ color {red} (G_t – b (s_t))} \ cdot \ nabla_ \ theta \; log \; \pi_\theta(a_t|s_t) \right]

One way to do this is to take the reward expectation as Baseline, which is to subtract its average from GtG_tGt:


b ( s t ) = E [ r t + r t + 1 + + r T 1 ] b(s_t) = E[r_t+r_{t+1}+\ldots+r_{T-1}]

The Baseline can also be fitted with parameters, expressed as BW (ST) b_W (s_t) BW (ST), and parameters WWW and θ\thetaθ are optimized simultaneously during optimization.

REINFORCE algorithm

The sample code uses the Cartpole-V1 discrete environment. First, let’s look at the overall flow of the algorithm.

1. Overall process

First, we set up the policy network model, initialize the parameter θ\thetaθ, and then use the model to collect data. Then we use the collected data to update the network parameter θ\thetaθ. Then we have a new policy network, and then use the new policy network to interact with the environment to collect new data. Update the policy network and repeat until a good model has been trained. Note that the data collected can only be used once at a time and should be discarded because the policy network changes with each update of theta \theta theta, so you cannot use the data collected from the old network to match new network parameters.

The process is shown below. During the interaction with the environment, we store the relevant data for each step to calculate the GtG_tGt reward.

for episode in range(TRAIN_EPISODES):
    state = env.reset()
    episode_reward = 0
    for step in range(MAX_STEPS):  # in one episode
        if RENDER: env.render()
            action = agent.get_action(state)
            next_state, reward, done, _ = env.step(action)
            agent.store_transition(state, action, reward)
            state = next_state
            episode_reward += reward
            if done:break
                agent.learn()
Copy the code

2. Calculate rewards

    def _discount_and_norm_rewards(self) :
        # discount episode rewards
        discounted_reward_buffer = np.zeros_like(self.reward_buffer)
        running_add = 0
        for t in reversed(range(0.len(self.reward_buffer))):
            running_add = running_add * self.gamma + self.reward_buffer[t]
            discounted_reward_buffer[t] = running_add

        # normalize episode rewards
        discounted_reward_buffer -= np.mean(discounted_reward_buffer)
        discounted_reward_buffer /= np.std(discounted_reward_buffer)
        return discounted_reward_buffer
Copy the code

The function is divided into two parts, one is to calculate the G value, the other is to normalize the G value. The discounted_reward_buffer calculated here is the reward for each action until the end of the episode, which is GtG_tGt in the formula. Notice that you work backwards from the last state and add the rewards for each step to the list. Then normalize the calculated reward list data, and the training effect will be better.

3. Gradient update

We know that every time we collect data to update the network parameter theta \theta theta, how does the network parameter update?

We can think of it as a supervised learning classification process, as shown in the figure below. For the environment input to the strategic network, the final network output is three actions: left, right, and fire. On the right is label. The loss function is to output the cross entropy between the action and label. The objective of minimizing the cross entropy is to increase the probability of the occurrence of the action or reduce the probability of the occurrence of the action with new network parameters.


H = i = 1 3 y ^ i l o g    y i M a x i m i z e : l o g    y i = l o g P ( l e f t s ) H = – \sum_{i=1}^{3}\hat{y}_i log\; y_i \\ Maximize: log\; y_i = logP(“left”|s) \\

Theta. please Theta. + eta l o g P ( l e f t s ) \theta \leftarrow \theta + \eta\nabla logP(“left”|s)

For each step we collect, state, action, we can think of state as training data, and action as label. Then minimize its cross entropy, as shown in the code below. Look, in the REINFORCE algorithm, the cross entropy is multiplied by GtG_tGt (discounted_reward), which means the update is adjusted to GtG_tGt, if GtG_tGt is high, Then the occurrence probability of the corresponding action will be greatly increased. If the GtG_tGt obtained by an action is negative, the occurrence probability of the action will be correspondingly reduced, which is the weighted gradient descent. For this process, tensorLayer has a built-in function, cross_entropy_reward_loss, which can be implemented directly, as described in the code comments section.

def learn(self) : 
    discounted_reward = self._discount_and_norm_rewards()
    with tf.GradientTape() as tape:
        _logits = self.model(np.vstack(self.state_buffer))
        neg_log_prob = tf.nn.sparse_softmax_cross_entropy_with_logits(
            logits=_logits, labels=np.array(self.action_buffer))
        loss = tf.reduce_mean(neg_log_prob * discounted_reward)
        # loss = tl.rein.cross_entropy_reward_loss(
        # logits=_logits, actions=np.array(self.action_buffer), rewards=discounted_reward)
    grad = tape.gradient(loss, self.model.trainable_weights)
    self.optimizer.apply_gradients(zip(grad, self.model.trainable_weights))
Copy the code

For the understanding of this part, you can directly watch teacher Li Hongyi’s video, which is very clear. REINFORCE algorithm, give me a star at hand, thank you…

Look, please REINFORCE this

Strategy gradient gives us a window to solve reinforcement learning problems, but our Monte Carlo strategy gradient reinforce algorithm above is not perfect. Since MC sampling is used to obtain data, we need to wait until the end of each episode to do algorithm iteration. Since MC is relatively slow, can TD be used? Of course, it is possible, is our next to introduce the Actor-Critic algorithm.