Reinforcement learning scenarios

An agent observes the environment (input S) and takes actions (output action). According to the action, the environment selects rewards and returns a new state (ST +1) to the Agent.

Several categories

Policy-based

Common definition

In current deep reinforcement learning, actors are usually neural networks that input various forms of state (picture, language, etc.) and output actions (multiple categories).

Some of the concepts

  • Episode: Run a game round with actor.
  • Trajectory: Indicates the trajectory sequence obtained in an episode. The format is S1, A1, R1, S2, A2, R2…
  • Measure the quality of an actor: Run several rounds with actor to obtain several tracks, and calculate the reward sum R of these tracks. R. R. Pay attention to the probability weight of each trace when calculating. The approximate weight of ownership is the same here.

The optimization goal

Parameters are optimized by gradient rise, and the direction of parameter update is the sum of multiple trace rewards obtained by actor sampling
R R
The direction of increase.

This simplifies to this top expression.

The essence of optimization is to increase the probability of all actions in a trajectory that rewards large numbers. So the optimization goal was simplified from improving the total Reward to improving each action in the trajectory.

Why do I subtract b? If ABC should be promoted, but BC is the only one in the sampling trajectory, then BC is promoted and A is reduced. So instead of just increasing all the actions, subtract a constant (such as the mean) from the strength of the increase, so that even if only BC is sampled, BC does not necessarily increase.

However, even for all actions in the same track, their weights should not be the same, that is, they should not be simply multiplied by the total reward of the track. Because each action has different effects on the track, specifically, each action affects its reward. So, rewriting the formula as shown above, the weight of each action is the sum of the rewards since it started.

Let me rewrite it here. The reward after each action is multiplied by a decay factor as it is added. That is, the further the reward is from the current action, the less relevant it is to the current action.

The whole process

  1. Use actors to run some tracks.
  2. Train the network with data from these tracks and update the actor. The data format is (ST, at), R(ST)
  3. Use the updated actor to continue the track.

On the policy – and off – the policy

The training method of reinforcement learning is to use actors to create track data and then train and update themselves with the reward of track. The downside of this is that once you update yourself, the previous trajectory data is no longer your own data. Therefore, two actors are usually defined, one of which has fixed parameters and is specifically used to create trajectory data, and the other uses these trajectory data to update itself.

Of course, if one actor uses data created by another actor with fixed parameters, it will be biased to use the data directly, because they have different parameters. So it’s always going to be times the quotient of the distribution of the two.

The final objective function is reduced to the equation above. (A refers to the reward weight of st and AT pairs obtained by fixed actor samples.

The parameters of the two actors should not be too different, so the KL divergence is used to control them. Control the difference between the two parameters, specifically by controlling the difference in the distribution of the action output from the two.

Q-Learning

Critic

Evaluation of the actor.

V(s)

V(s) is to see the state and print what the cumulative reward can be from that actor starting with that state.

The way you train V of S

One of the training methods is shown in the figure above, Monte Carlo. Play the game with actor to get the track sequence s1, A1, R1, S2, A2, R2… , and then enter a state. The output label should be the cumulative reward from this state.

As shown above, the second training method is TD. Since Monte Carlo needs to run a track to train, TD only needs to sample the middle section.

Q(s, a)

Q(s, a) is the cumulative reward that this actor gets after executing an action in the state process when it sees state and an action.

Training process: use actor to get some data samples, use these samples to learn Q function, use the new Q function as actor decision.

The new actor here is actually using the new Q function to select the action with the highest score as the output.

Training mode: Target network: indicates the target network.

In TD training, it is necessary to define a fixed target network Q, so that the output of the target network Q can be used as the label of Fixed, so that the Q network that needs to be updated can learn.

Exploration

In order to avoid sampling, actors always choose the current best action and fall into local optimization, a certain exploration mechanism is needed. For example, instead of choosing the action with the highest score at a certain probability, choose the action at random.

Algorithm process

DOUBLE-DQN

To prevent the function Q from printing too high a value, we use the action with the highest probability, which will be updated with Q argmax, and then enter this action into the target function Q. Thus, the selection of action and the output estimate are two-network. Reduces the chance of high action.

Continuous Q

The previous Q learning method is mainly applied to the discrete action situation. When the action is continuous, use the pattern shown above. The input a is a continuous meaningful vector.

Actor-Critic

Limitations of the policy approach

When the previous policy method was updated, the weight of reward was highly random because it was sampled. Using a state for the same action may result in different rewards.

The introduction of critic

Therefore, the critic method is introduced here. Q(s, a) represents that the actor takes a’s in state SearningsIs consistent with the meaning of the original weight. While the original B is a constant (mean value), V(s) represents the income of actor in state without considering action, which can represent the original meaning of B.

In this case, having both Q and V is not a good training. So Q is also approximated in terms of V, so there’s only one more V network.

process

The process is to obtain trajectory data by sampling with actor first, train V(S) with trajectory data, and then update actor with V(S) to assist in calculating gradient rise to obtain new actor.

pathwise policy gradient

The critic directly directs the actor to what action to take.

The actor’s job is to solve the argmax problem, similar to a GAN.

When training this actor, the goal is to score the pinned Q as high as possible (like GAN).

The training process

Compared with traditional Q learning, this method has a real actor. Instead of optimizing the maximum reward, the actor is updated so that its output can be scored high by Q.