This is the 18th day of my participation in the August Genwen Challenge.More challenges in August

Offline-RL

concept

Offline RL is Offline reinforcement learning. Offline RL is also called Batch RL, and its basic setting is: there is a data set obtained in the reinforcement learning environment,a quad (ST, AT, RT, ST +1)(S_T, A_T, R_T,s_{t+1})(ST, AT, RT, ST +1). Our goal is to learn the best strategy π\ PI π from this data set alone, without interacting with the environment.

Virtually all off-policy algorithms can be used to make offline-RL. Off-policy methods such as DQN and DDPG have a replay buffer to store previously collected data. When the replay buffer is large enough, it can be considered offline-RL.

And imitative learning

When the data quality is good enough, such as the track is the data generated by expert strategies, imitation learning can be directly conducted. Offline-rl differs from imitation learning in that offlineRL can theoretically obtain the optimal strategy by using off-line data sampled by any policy, while imitation learning must imitate the data sampled by expert policy.

Difference between online and off-policy

Online RL: the policy πk+1\pi_{k+1}πk+1 is updated based on the stream data obtained from the environment (i.e. Si, AI, RI, SI ‘s_i, A_i, R_I, S ‘_isi, AI, RI, Si’)

Off-policy RL: add an experience pool replay buffer to online RL to store previously sampled policies π0… , PI k \ pi_0,… , \ pi_k PI 0,… πk includes the observation series under each strategy, and all the data will be used for πk+1\ PI_ {k+1}πk+1 update.

Offline RL: Use an offline data set DDD containing πβ\pi_{\beta}πβ (unknown policy). The data set is collected only once and is not changed during training. The training process does not interact with the MDP, and policies are deployed only after full training. Offline RL can use large, pre-collected data sets. The Offline RL sequence is fixed.

Current challenges

The distribution of offset

Current challenges of off-line reinforcement learning: Off-line reinforcement learning tries to learn different strategies from off-line data than from observed data, which creates a problem of offset distribution. That is, the distribution of offline data sets differs greatly from that of real observation data sets.

Simply put, we sampled πβ\pi_{\beta}πβ from offline data, but the actual policy we wanted to learn was πθ\pi_{\theta}πθ. Thus, the distribution of the two strategies may differ greatly.

In order to overcome this problem, there are currently important methods such as sampling, which will be further studied and discussed later.