This article was first publishedWalker AI

The introduction

According to the different learning of agent, it can be divided into value-based method, policy-based method and actor-critic method. Q-learning, Saras and DQN we introduced before are all value-based learning. Although such reinforcement learning method has been widely applied in many fields, its limitations are obvious. First of all, these algorithms basically deal with discrete actions and establish a simple Q table. It is difficult to process continuous actions, let alone establish a huge Q table. Secondly based on the method using value function to approximate the value of Q values, while higher efficiency to solve the problem of continuous state space, but the action is still discrete, action selection strategy is also won’t change, is usually a deterministic strategies, but there are some practical problems need the optimal strategy is not deterministic, but random strategy, such as stone scissors cloth, If you punch according to a deterministic strategy, you will lose when others catch the pattern. Therefore, we need to introduce a new method to solve the above problems, such as the strategy gradient method.

1. 4. Smart water bottle

Before we move on to Policy Gradient (PG), let’s take a look at the Monte Carlo algorithm. First, let’s take a quick story:

In the Era of Naruto, or xia Ren in order to improve their ability, from the Kanha Ninja task center to take a C-level task, while doing the task, suddenly was a masked man trapped in the illusion. Naruto can’t get out of the math-like dreamland (the exit is the gate of light, which can be understood as the door with light). Although there are many exits in this dreamland, only the nearest exit can get out of the dreamland through the gate of light, while other exits have the gate of light, but can’t get through. It is possible that Naruto is also proficient in mathematics, and he uses his learning technique of multiple incarnations to separate out multiple incarnations (assuming there are enough of them, don’t ask if naruto Chakra is enough, in case of emergency, he can borrow from the nine-o inside him), and then starts telling each one what to do:

Note: The dopes all start at the starting point, and each step they take, they get a chakra energy boost or a chakra energy drop (bonus, positive or negative). And the choice of each dopant facing the fork intersection is equal. Ignore the discount factor of the reward.

  1. Each of you will need to find an exit, far and near, and each step along the way will need to record on the scroll the path you have taken and how much chakra you have acquired;
  2. Record the total chakra taken by this path, the original path back to the starting point;
  3. Average the total rewards each of you received and report back to me as your starting point.
  4. Then change the starting point to the next optional departure point, and repeat 1 to 3.

After getting the values of all the junctions, Naruto selects the junction with the highest value at each junction, and finally naruto successfully walks out of the illusion.

The above story is similar to the Monte Carlo algorithm, as follows;

The Monte Carlo algorithm is a sample-based approach that, given a strategy π\ PI π, allows an agent to interact with the environment and yields many trajections. Each trajectory has a corresponding return. If we average the return of each trajectory, we can know the value of the corresponding state under a certain strategy. This sentence can be broken down to correspond to the above story:

  1. Monte Carlo is based on sampling. (There are enough dopes of Naruto in the story)
  2. You need to give a strategy π\ PI π.
  3. The agent interacts with the environment and gets many trajectories. (Corresponding to the process of each doppelganger looking for an exit in the fantasy land in the story, each doppelganger corresponds to a track)
  4. Each trajectory has a payoff. (The total reward for each doppelganger in the story)
  5. Average the returns for each trajectory to get the value of the corresponding state. (For Naruto, average the total bonus for each dopant)

2. Strategy gradient algorithm

2.1 introduction

In reinforcement learning, there are three components: actor, environment and reward function. The environment and reward functions are not under our control and are given in advance before we start learning. There is a strategy in an actor that determines what the actor does. A strategy is given an external input that outputs the action the actor should now perform. The only thing we can do is adjust the strategy of the actors so that the actors can get the maximum reward.

When we combine deep learning with reinforcement learning, the strategy PI is a network with θ\thetaθ parameters for π\ PI π. Take the example of the above fantasy, the input is the bifurcated intersection where the current dopant is located, assuming that it can go up, down and left, and the output is the probability of three actions that can be selected after passing through the strategy network. The actor then decides which action to take based on this probability distribution, and depending on the probability distribution, the actor will take different actions. Simply put, the network output of a strategy is a probability distribution from which the actor samples to determine which action to actually take.

In fact, PG is an algorithm combining Monte Carlo and neural network. PG no longer outputs Q value like Q-learning and DQN, but directly outputs the probability that all actions can be adopted in the current state within a continuous interval.

When we explained the value-based method before, we used the value function approximation to change the Q table update problem into a function fitting problem. Similar states get similar output actions, as shown in the formula below. Function F approaches the optimal Q value by updating parameter W.


Q ( s . a ) material f ( s . a . w ) Q(s, a) \approx f(s, a, w)

In the PG algorithm, since the strategy is a probability and cannot be used to iterate directly, we take a similar approach and convert it into a functional form, as shown in the following equation. In this case, our aim is to approximate the strategy using a function with θ\thetaθ parameters, and to approximate the optimal strategy by updating the parameters θ\thetaθ.


PI. ( a s ) material PI. Theta. ( s . a ) = P ( a s . Theta. ) PI (a | s) material PI theta (s, a) = P (a | s, theta)

Now that we have the strategy function, our goal is of course to optimize it, so how do we know the advantages and disadvantages of the optimized strategy function? You might think that you need a target function that you can optimize, and we’re going to optimize the target function toward the maximum, and the main thing it does is measure how good or bad the strategy is.

2.2 Algorithm Content

First of all, we can think of the environment as a function. The function starts with a state. If this state is the screen of the game, then the actor sees the screen s1s_{1}s1 and selects the action a1a_{1} A1. The environment takes a1a_{1} A1 as its input and spits out s2s_{2}s2, which spits out the new game screen. The actor sees a new screen and takes a new action. The environment looks at A2A_ {2} A2 and spits out S3S_ {3}s3. This process continues until the environment feels it should stop.

In a game, the actors through constant interaction with the environment finally ended, in accordance with the above instructions, you can get a trajectory, expressed as tau \ tau tau, SSS said state, aaa says action, s1s_ {1} s1, a1a_ {1} a1 said actor in state 1 choose the action 1, at the back of the and so on. As shown in the following formula.


tau = { s 1 . a 1 . s 2 . a 2 . . s t . a t } \tau=\left\{s_{1}, a_{1}, s_{2}, a_{2}, \cdots, s_{t}, a_{t}\right\}

Then, assuming that the current actor’s strategy network parameter is theta \theta theta, we can calculate the probability of this trajectory occurring. It depends on two parts, the action of the environment and the action of the agent, as shown in the following equation, which represents the probability of a trajectory occurring.


p Theta. ( tau ) = p ( s 1 ) p Theta. ( a 1 s 1 ) p ( s 2 s 1 . a 1 ) p Theta. ( a 2 s 2 ) p ( s 3 s 2 . a 2 ) = p ( s 1 ) t = 1 T p Theta. ( a t s t ) p ( s t + 1 s t . a t ) \begin{aligned} p_{\theta}(\tau) &=p\left(s_{1}\right) p_{\theta}\left(a_{1} \mid s_{1}\right) p\left(s_{2} \mid s_{1}, a_{1}\right) p_{\theta}\left(a_{2} \mid s_{2}\right) p\left(s_{3} \mid s_{2}, a_{2}\right) \cdots \\ &=p\left(s_{1}\right) \prod_{t=1}^{T} p_{\theta}\left(a_{t} \mid s_{t}\right) p\left(s_{t+1} \mid s_{t}, a_{t}\right) \end{aligned}

The action of the environment is what the parameters or the rules inside the function of the environment look like, which we can’t control, and we’ve already written that p(st+1∣st,at)p\left(s_{t+1} \mid s_{t}, a_{t}\right) P (ST +1∣st,at) is the environment. P θ(at∣ ST)p_{\theta}\left(a_{t} \mid s_{t}\right) P θ(at∣ ST) stands for intelligence, given a state stS_ {T}st, What action an actor takes at _{t}at depends on the actor’s parameter theta \thetaθ, so this part is under the actor’s own control. Each trajectory has a different probability of occurrence, depending on the actor’s movements.

In addition to environment and actors, there are reward functions. Give it s1s_{1}s1, a1a_{1}a1, and it tells you you get R1r_ {1}r1. Give it s2s_{2}s2, a2a_{2}a2, and it tells you you get R2r_ {2}r2. Adding up all the RRRS, we get R(τ)R(tau)R(τ), the reward for a particular trajectory τ\tau tau.

In a certain game, we’re going to get RRR. By adjusting the parameters θ\thetaθ inside the actor, the greater the value of RRR is, the better, which is the optimization goal of PG algorithm. But the reward isn’t just a scalar, the reward RRR is a random variable, because it’s random what the actor will do given the same state. Given the same observation, what kind of action the environment will take and what kind of observation it will produce is also stochastic, so RRR is a random variable. Then we can calculate the expected value of RθR_{\theta}Rθ given a set of parameters θ\theta. The expected value is given by the following formula.


R ˉ Theta. = tau R ( tau ) p Theta. ( tau ) = E tau …… p Theta. ( tau ) [ R ( tau ) ] \bar{R}_{\theta}=\sum_{\tau} R(\tau) p_{\theta}(\tau)=E_{\tau \sim p_{\theta}(\tau)}[R(\tau)]

We need to exhaust all the possible trajectories τ\tauτ, each of which has a probability and a total reward RRR. Can also from the distribution of p theta (tau) p_ {\ theta} (\ tau) p theta (tau), sampling a track tau \ tau tau, calculation of R (tau) R (\ tau) R (tau) expectations, is the expected reward. All we have to do is maximize the expected reward.

So how do we maximize the expected reward? Well, since we maximize the expected reward, we can update the parameters in a gradient way to maximize the expected reward. ˉ ˉ \ bar {R} R to R gradient, there only p theta (tau) p_ {\ theta} (\ tau) p theta (tau) is associated with theta \ theta theta. The whole strategy gradient formula is shown in the figure below.

∇ p, in case of theta (tau) \ nabla p_ {\ theta} (\ tau) ∇ p theta (tau) using ∇ f (x) = f (x) ∇ log ⁡ f (x) \ nabla f (x) = f (x) \ nabla \ log f (x) ∇ f (x) = f (x) ∇ logf (x), Get ∇ p = p theta. Theta (tau) (tau) ∇ log ⁡ theta (tau) p \ nabla p_ (\ tau) = p_ {\ theta} {\ theta} (\ tau) \ nabla \ log p_ {\ theta} (\ tau) ∇ p = p theta. Theta (tau) (tau) ∇ logp theta (tau) this ∇ f (x) = f (x) ∇ log ⁡ f (x) \ nabla f (x) = f (x) \ nabla \ log f (x) ∇ f (x) = f (x) ∇ logf (x) we can translate this understanding into a fixed formula, remember.

As shown in the following type of tau \ tau tau summation of the R (tau) R (\ tau) R (tau) and the log ⁡ p theta (tau) \ log p_ {\ theta} (\ tau) logp theta (tau) both use p theta (tau) p_ {\ theta} (\ tau) p theta (tau) weighted, Since using p theta (tau) p_ {\ theta} (\ tau) p theta (tau) weighted, they can be written in the form of expectations. Is you from p theta (tau) p_ {\ theta} (\ tau) p theta (tau) inside the distribution, sampling tau \ tau tau, to compute R (tau) R (\ tau) R (tau) on the log ⁡ p theta (tau) \ log p_ {\ theta} (\ tau) logp theta (tau), The sum over all possible τ\tauτ is the expected value.


R ˉ Theta. = tau R ( tau ) p Theta. ( tau ) = tau R ( tau ) p Theta. ( tau ) p Theta. ( tau ) p Theta. ( tau ) = tau R ( tau ) p Theta. ( tau ) log p Theta. ( tau ) = E tau …… p Theta. ( tau ) [ R ( tau ) log p Theta. ( tau ) ] \begin{aligned} \nabla \bar{R}_{\theta} &=\sum_{\tau} R(\tau) \nabla p_{\theta}(\tau) \\ &=\sum_{\tau} R(\tau) p_{\theta}(\tau) \frac{\nabla p_{\theta}(\tau)}{p_{\theta}(\tau)} \\ &=\sum_{\tau} R(\tau) p_{\theta}(\tau) \nabla \log p_{\theta}(\tau) \\ &=E_{\tau \sim p_{\theta}(\tau)}\left[R(\tau) \nabla \log p_{\theta}(\tau)\right] \end{aligned}

The expectations have no way to calculate actually, so we are with sampling approach to sampling NNN pen tau \ tau tau, to calculate each sum of the values, add it all up, the gradient can be calculated out. We can update the parameters, we can update the agent, as follows.


E tau …… p Theta. ( tau ) [ R ( tau ) log p Theta. ( tau ) ] material 1 N n = 1 N R ( tau n ) log p Theta. ( tau n ) = 1 N n = 1 N t = 1 T n R ( tau n ) log p Theta. ( a t n s t n ) \begin{aligned} E_{\tau \sim p_{\theta}(\tau)}\left[R(\tau) \nabla \log p_{\theta}(\tau)\right] & \approx \frac{1}{N} \sum_{n=1}^{N} R\left(\tau^{n}\right) \nabla \log p_{\theta}\left(\tau^{n}\right) \\ &=\frac{1}{N} \sum_{n=1}^{N} \sum_{t=1}^{T_{n}} R\left(\tau^{n}\right) \nabla \log p_{\theta}\left(a_{t}^{n} \mid s_{t}^{n}\right) \end{aligned}

∇ log ⁡ theta (tau) p \ nabla \ log p_ {\ theta} (\ tau) ∇ logp theta (tau) concrete calculation process, as shown in the following type


log p Theta. ( tau ) = ( log p ( s 1 ) + t = 1 T log p Theta. ( a t s t ) + t = 1 T log p ( s t + 1 s t . a t ) ) = log p ( s 1 ) + t = 1 T log p Theta. ( a t s t ) + t = 1 T log p ( s t + 1 s t . a t ) = t = 1 T log p Theta. ( a t s t ) = t = 1 T log p Theta. ( a t s t ) \begin{aligned} \nabla \log p_{\theta}(\tau) &=\nabla\left(\log p\left(s_{1}\right)+\sum_{t=1}^{T} \log p_{\theta}\left(a_{t} \mid s_{t}\right)+\sum_{t=1}^{T} \log p\left(s_{t+1} \mid s_{t}, a_{t}\right)\right) \\ &=\nabla \log p\left(s_{1}\right)+\nabla \sum_{t=1}^{T} \log p_{\theta}\left(a_{t} \mid s_{t}\right)+\nabla \sum_{t=1}^{T} \log p\left(s_{t+1} \mid s_{t}, a_{t}\right) \\ &=\nabla \sum_{t=1}^{T} \log p_{\theta}\left(a_{t} \mid s_{t}\right) \\ &=\sum_{t=1}^{T} \nabla \log p_{\theta}\left(a_{t} \mid s_{t}\right) \end{aligned}

Note that p (s1) p \ left (s_ {1} \ right) p (s1) and p (st + 1 ∣ st, at) p \ left (s_ (t + 1} \ mid s_ {t}, a_ {t} \ right) p (st + 1 ∣ st, at) from the environment,

P theta (at ∣ st) p_ {\ theta} \ left (a_ {t} \ mid s_ {t} \ right) p theta (at ∣ st) is from the agent. P (s1) p \ left (s_ {1} \ right) p (s1) and p (st + 1 ∣ st, at) p \ left (s_ (t + 1} \ mid s_ {t}, a_ {t} \ right) p (st + 1 ∣ st, at) is determined by the environment, so it has nothing to do with theta \ theta theta, Therefore ∇ log ⁡ p (s1) = 0 \ nabla \ log p \ left (s_ {1} \ right) = 0 ∇ logp (s1) = 0 ∇ t = 1 ∑ tlog ⁡ p (st + 1 ∣ st, at) = 0 \ nabla \ sum_ {t = 1} ^ {t} \ log P \ left (s_ (t + 1} \ mid s_ {t}, a_ {t} \ right) = 0 ∇ t = 1 ∑ tlogp (st + 1 ∣ st, at) = 0

We can intuitively understand the formula derived from Figure 1, that is, in the sampled data, the sampled state sts_{t}st should perform a certain action ata_{t}at, Sts_ {t}st and ata_{t}at are pairs of states and actions within the entire locus τ\tauτ. Suppose we perform ata_{t}at at sts_{t}st and find that τ\tau tau tau reward is positive, we have to increase the probability of performing ata_{t}at at sts_{t}st. Conversely, ata_{t}at at sts_{t}st causes the tau tau tau reward to be negative, and we have to reduce the probability of this term.

To calculate the above formula, we first need to collect a large number of SSS and AAA pairs, and know how much reward these SSS and AAA will get when interacting with the environment. We’re going to take an intelligence, theta \theta theta, and we’re going to interact with the environment, and when we do, we’re going to get a bunch of records of the game.

So we can take the sample and plug it into the gradient formula and figure out the gradient. That is, take in each PAIR of SSS and AAA in the sampled data, calculate its logarithmic probability, that is, calculate the logarithmic probability of taking a certain action in a certain state, and apply a gradient to it, which will be multiplied by a weight, and the weight is the reward of the game. Once you have that, you update the model.

After updating the model, we need to collect data again and update the model. Note that the usual strategy gradient sampling data will only be used once. The data is sampled and then taken to update the parameters, and the data is lost. The data is then resampled to update the parameters. But there is a solution, and we’ll show you how.

2.3 skills

(1) Add baselines

  • In many games, the reward is always positive, or as low as zero. Since the probability of taking action and sum is 1, when all rewards are positive, it may be that after normalization, the RRR (weight) with a large value will increase more, while the RRR with a small value will decrease after normalization. For example, let’s say there are three actions in a state, aaa, BBB, and CCC, and the rewards are all positive. According to the formula ∇Rˉθ\ Nabla \bar{R}_{\theta}∇Rˉθ, we hope to pull the probability and logarithmic probability of the three actions higher, but the weight RRR in front of them is not the same, there are big and small, so the weight is significant, rising a little more; Because the logarithmic probability is a probability, and the sum of the three actions should be 0, then after normalization, the one that rises more will rise, and the one that rises less will fall.

  • Sampling should be an expectation, summing up all possible pairs of SSS and AAA. But in real learning, I only sampled a small amount of SSS and AAA pairs. Some actions may never be sampled. Again, suppose you have aaa, BBB, CCC in a certain state, but you might sample only action BBB or you might sample only action CCC, but you might not sample action AAA. Now the rewards of all actions are positive, and according to the formula ∇Rˉθ\ Nabla \bar{R}_{\theta}∇Rˉθ, every probability of it should rise. Because AAA is not sampled, if the probability of all the other actions is going up, the probability of AAA is going down. But AAA isn’t necessarily a bad move, it’s just not being sampled. But the probability goes down, which is obviously a problem, so we don’t want the reward to always be positive.

To solve the problem that the reward is always positive, subtract the reward by BBB, as shown below.


R ˉ Theta. material 1 N n = 1 N t = 1 T n ( R ( tau n ) b ) log p Theta. ( a t n s t n ) \nabla \bar{R}_{\theta} \approx \frac{1}{N} \sum_{n=1}^{N} \sum_{t=1}^{T_{n}}\left(R\left(\tau^{n}\right)-b\right) \nabla \log p_{\theta}\left(a_{t}^{n} \mid s_{t}^{n}\right)

The term R(τn)−bR\left(\tau^{n} right) -br (τn)−b can be either positive or negative if BBB is subtracted. So if the total reward R(τn)R\left(\tau^{n} right)R(τn) is greater than BBB, let the probability rise. If this total reward is less than BBB, even if it’s positive, it’s not good if it’s a small positive, you have to let the probability of this term go down. BBB is usually put \ tau tau n ^ {n} tau n take expectation value, calculate \ tau tau n ^ {n} tau n average, namely b material E (tau) [R] b \ approx E b material E (\ tau) [R] [R] (tau). In practice, the score of R(τn)R\left(tau^{n} right)R(τn) is recorded continuously, and the average value of R(τn)R\left(tau^{n} right)R(τn) is calculated continuously as BBB.

(2) Allocate appropriate points

  • In the same game or turn, all pairs of states and actions are weighted with the same rewards, which is not fair because some actions may be good and some actions may not be good in the same game. Assuming that the outcome of the whole game is good doesn’t mean that every action is right, and vice versa. For example, if the game is short and a1a_{1} A1’s reward r1r_{1} R1 is 5 and a2a_{2} A2’s reward r2r_{2}r2 is 0 in s1s_{1}s1, The reward r3r_{3}r3 for performing a3a_{3} A3 on S3S_ {3} S3 is -2. When the whole game is over, the total reward is 3. But it does not mean that performing a2a_{2}a2 on s2s_{2}s2 is good, because the positive score mainly comes from performing a1a_{1} A1 on s1s_{1}s1 and a2a_{2}a2 on s2s_{2}s2. Maybe doing a2a_{2}a2 on S2s_ {2}s2 is bad because it leads you to s3S_ {3}s3, which is penalized, so the result of the whole game is good, not every action is right. So when you’re training, every pair of states and actions is multiplied by three.

In an ideal world, this problem can be solved if enough samples are taken. The problem is that we don’t sample enough, so when we calculate the rewards for this state-action pair, we don’t add up all the rewards for the entire game, we just count the rewards for the action since it was performed. Since what happens in the game before an action is performed has nothing to do with the action being performed, the amount of rewards received before the action is not credited to the action. Anything associated with the action, only all the rewards that occur after performing the action add up, is the true contribution of the action. The following type.


R ˉ Theta. material 1 N n = 1 N t = 1 T n ( t = t T n gamma t t r t n b ) log p Theta. ( a t n s t n ) \nabla \bar{R}_{\theta} \approx \frac{1}{N} \sum_{n=1}^{N} \sum_{t=1}^{T_{n}}\left(\sum_{t^{\prime}=t}^{T_{n}} \gamma^{t^{\prime}-t} r_{t^{\prime}}^{n}-b\right) \nabla \log p_{\theta}\left(a_{t}^{n} \mid s_{t}^{n}\right)

The discount factor gamma \gammaγ, γ∈[0,1]\gamma \in[0,1]γ∈[0,1] γ, is usually set to 0.9 or 0.99. If gamma \gammaγ=0, this means that we only care about immediate rewards; If gamma \gamma gamma =1, this means that future rewards are equivalent to immediate rewards.

Which would you rather have, a hundred dollars now or a hundred dollars in 10 years’ time? A hundred dollars in 10 years’ time might be worth as much as ten dollars today. In other words, is a dollar in the ’60s worth the same as a dollar today?

3. Code implementation

Case study: Simulate the landing of a lunar lander on the lunar surface. The goal of the mission was to land the lunar craft safely on flat ground between two yellow flags. Test environment: Lunarlander-V2

Obs: This game environment has eight observations: horizontal coordinate X, vertical coordinate Y, horizontal velocity, vertical velocity, Angle, angular velocity, leg 1 touching ground, leg 2 touching ground;

Action: Agent can take four discrete actions, namely, do nothing, start the left engine injection, start the main engine downward injection, and start the right engine injection.

Bonus: -100 Reward for boat crash 100~140 points for a successful landing between two yellow flags; -0.3 points per injection of main engine fire downward; Another 100 points are awarded when the boat finally comes to a complete standstill; Each leg gets 10 points for landing.

Although we use discrete action space here, the overall code is not different, interested students can try continuous action space.

Define the network structure:

class PolicyNet(nn.Module) :
    def __init__(self, n_states_num, n_actions_num, hidden_size) :
        super(PolicyNet, self).__init__()
        self.data = []  # Store trace
        # input for vector of length 8 output for 4 actions
        self.net = nn.Sequential(
            # Two linear layers, connected in the middle using Relu activation function, and finally connected to Softmax to output the probability of each action
            nn.Linear(in_features=n_states_num, out_features=hidden_size, bias=False),
            nn.ReLU(),
            nn.Linear(in_features=hidden_size, out_features=n_actions_num, bias=False),
            nn.Softmax(dim=1))def forward(self, inputs) :
        Shape of state input S is vector: [8]
        x = self.net(inputs)
        return x
Copy the code

Define the PG class:

class PolicyGradient() :

    def __init__(self, n_states_num, n_actions_num, learning_rate=0.01, reward_decay=0.95 ) :
        State is an 8-dimensional vector, which are horizontal coordinate x, vertical coordinate Y, horizontal velocity, vertical velocity, Angle, angular velocity, leg 1 touching the ground, leg 2 touching the ground
        self.n_states_num = n_states_num
        # Action is 4-d, discrete, i.e. do nothing, start left engine, start main engine, start right engine.
        self.n_actions_num = n_actions_num
        Vector #
        self.lr = learning_rate
        # gamma
        self.gamma = reward_decay
        # network
        self.pi = PolicyNet(n_states_num, n_actions_num, 128)
        # the optimizer
        self.optimizer = torch.optim.Adam(self.pi.parameters(), lr=learning_rate)
        # Store the track in the following way: (reward, probability of action)
        self.data = []
        self.cost_his = []

    # Store track data
    def put_data(self, item) :
        R # record, log_P (a | s) z
        self.data.append(item)

    def train_net(self) :
        Calculate gradients and update policy network parameters. Tape is a gradient recorder
        R = 0  The initial return for the terminal state is 0
        policy_loss = []
        for r, log_prob in self.data[::-1] :# reverse take
            R = r + gamma * R  # Calculate the return on each timestamp
            # Calculate a gradient for each timestamp
            loss = -log_prob * R
            policy_loss.append(loss)
        self.optimizer.zero_grad()
        policy_loss = torch.cat(policy_loss).sum(a)# sum
        # print('policy_loss:', policy_loss.item())
        # Backpropagation
        policy_loss.backward()
        self.optimizer.step()
        self.cost_his.append(policy_loss.item())
        # print('cost_his:', self.cost_his)
        self.data = []  # Clear track

    The state is fed into the neural network to select actions according to probability
    def choose_action(self, state) :
        Translate state to tensor and dimension to [8]->[1,8]
        s = torch.Tensor(state).unsqueeze(0)
        prob = self.pi(s)  # action distribution :[0,1,2,3]
        Shape: [1] torch. Log (prob), 1
        
        # create category distribution with parameter prob as standard, sample is from "0... An integer of k-1 ", where K is the length of the prob argument. That is, given the probability given in the proB passed in,
        # Sampling at the corresponding location, sampling returns the integer index of that location. It's not the largest, it's the index of the one sampled by probability, sampled to that
        m = torch.distributions.Categorical(prob)  # Generate distribution
        action = m.sample()
        return action.item(), m.log_prob(action)
    def plot_cost(self, avage_reward) :
        import matplotlib.pyplot as plt
        plt.plot(np.arange(len(avage_reward)), avage_reward)
        plt.ylabel('Reward')
        plt.xlabel('training steps')
        plt.show()
Copy the code

Training model:

   for n_epi in range(1000):

        state = env.reset()  # return to the beginning of the game, return s0
        ep_reward = 0
        for t in range(1001) :# CartPole-v1 forced to terminates at 1000 step.
            Select actions based on the state afferent neural network
            action, log_prob = policyGradient.choose_action2(state)
            # print(action, log_prob)
            # Interact with the environment
            s_prime, reward, done, info = env.step(action)
            # s_prime, reward, done, info = env.step(action)
            if n_epi > 1000:
                env.render()
            Record action A and reward R generated by action
            policyGradient.put_data((reward, log_prob))
            state = s_prime  # refresh state
            ep_reward += reward
            if done:  # End of current episode
                break
            After the end of episode, train a network
        running_reward = 0.05 * ep_reward + (1 - 0.05) * running_reward
        avage_reward.append(running_reward)
        Learn after the interaction is complete
        policyGradient.train_net()
        if n_epi % print_interval == 0:
            print('Episode {}\tLast reward: {:.2f}\tAverage reward: {:.2f}'.format(
                n_epi, ep_reward, running_reward))
        if running_reward > env.spec.reward_threshold:  # Exit the game when the game exceeds the maximum threshold
            print("Solved! Running reward is now {} and "
                  "the last episode runs to {} time steps!".format(running_reward, t))
            torch.save(policyGradient.pi.state_dict(), 'pg.pkl')
            break
    policyGradient.plot_cost(avage_reward)
Copy the code

Finally, the curve of Reward is shown as follows, which tends to converge:

The final animation looks like this:

4. To summarize

Strategy gradient can solve the scene with continuous action space well, and can learn some random strategies, sometimes the optimal strategy. There may be good convergence, but it may also converge to the local optimum rather than the global optimum, and the evaluation process of the strategy may also be relatively inefficient with large variance. But it’s generally good, and we’ll talk about better algorithms to deal with these weaknesses.

5. References

[1] Reinforcement+Learning: An+Introduction

[2] blog.csdn.net/baidu_41871…

We are walker AI, and we are moving forward in “AI+ games”.

Go to the public account [walker AI], reply [Tiger Step Dragon Line], you can receive a limited amount of red envelope cover. Come and discuss technical issues with us!