• Mastering Atari, Go, Chess and shogi by Planning with a Learned Model

The problem solved?

Planning planning has always been in the field of artificial intelligence research, people chasing a difficult point of research, the planning algorithm based on tree, such as AlphaGo algorithm has been a huge success, however, the planning algorithm based on tree model need a perfect environment model, this condition is hard to meet in the real world.

background

Model-based reinforcement learning method first learns an environment model, and then plans based on the learned environment model to solve the problem of not interacting too much in the real environment. In the past, classical programming algorithms often rely on the model of the controlled object, which is a great obstacle to the actual landing. The best model-based reinforcement learning algorithms do not focus on the reconstruction of the entire environment, that is to say, they generally do not reconstruct the entire observation sequence. Methods such as Abstract MDP and Latent Space can be used to estimate value functions efficiently.

The method adopted?

MuZero is an improved version of AlphaZero. MuZero extends AlphaZero to single agent domains as well as control objects that are not terminating rewards.

The main idea of the algorithm is to predict the future, mainly to predict the data that can be directly used to plan the future, such as the value function to predict the future. Planning can then be done based on the predicted data.

  • MuZero algorithm

The model takes an observation (say, a frame of pixels from a game) and encodes it into the hidden state space. In hidden state space, learning and training can be carried out by means of given actions and autoregressions. At each step, the model needs to generate (or predict) a strategy, a value function (predicted cumulative rewards), and a prediction of immediate rewards (predicted rewards for the current step). The model is trained end-to-end, and the author does not adopt the dreamer and Planet learning environment model, believing that it is not necessary for the model to have the ability to recover from the hidden state to the original pixel. Hidden states only need to be able to correctly estimate strategies, value functions, and immediate rewards.

! [] (img – blog. Csdnimg. Cn / 20210117160… =700x)

Figure A: 1. Given a hidden state SK − 1S ^{k-1}sk−1 and a candidate action AKA ^{k}ak, the dynamic model GGG needs to generate an immediate reward RKR ^{k}rk and a new hidden state SKS ^{k}sk. 2. Strategy PKP holds ^ {k} pk and value function VKV ^ {k} vitamin k by the prediction function FFF through input SKS ^ {k} sk calculated vitamin k = f theta (sk) v ^} {k = f_ {\ theta} \ left (s ^ {k} \ right) vitamin k = f theta (sk). Action at+1a_{t+1}at+1 is sampled from the search strategy πt\pi_{t}πt. The initial state s0s_{0}s0 is obtained from past observations input into the representation function HHH, such as the input of an empty checkerboard. Figure B: Action at+1a_{t+1}at+1 is generated by the search strategy πt\pi_{t}πt. When the environment receives the action, it generates a new observation OT +1o_{t+1} OT +1 and an immediate reward ut+1u_{t+1}ut+1. Figure C: MuZero trains the whole model. Based on steps A and B, we are close to sampling some data. With this data in hand, we can train the model: the strategy model PK ≈πt+ KP ^{k} \approx \pi_{t+k} PK ≈πt+k; Value function vk≈ Zt + kV ^{k} \approx z_{t+k}vk≈zt+k; And reward model RK =≈ UT +kr^{k} = \approx u_{t+ K}rk=≈ut+ K.

Given a time step TTT, k = 0, for each step.., Kk = 0, \ \ cdots, Kk = 0,…, k steps, a model with theta \ theta theta parameter mu theta \ mu_ {\ theta} mu theta, Based on observation data of the given the past condition o1,…, oto_ {1}, \ \ cdots, o_ {t} o1,…, ot and future action at + 1,…, at + ka_ {t + 1}, \ \ cdots, A_ {t+k}at+1,… at+k(where k >0K>0K>0) to predict the future:

strategy


p t k material PI. ( a t + k + 1 o 1 . . o t . a t + 1 . . a t + k ) p_{t}^{k} \approx \pi\left(a_{t+k+1} \mid o_{1}, \ldots, o_{t}, a_{t+1}, \ldots, a_{t+k}\right)

Value function


v t k material E [ u t + k + 1 + gamma u t + k + 2 + o 1 . . o t . a t + 1 . . a t + k ] v_{t}^{k} \approx \mathbb{E}\left[u_{t+k+1}+\gamma u_{t+k+2}+\ldots \mid o_{1}, \ldots, o_{t}, a_{t+1}, \ldots, a_{t+k}\right]

Instant rewards


r t k material u t + k r_{t}^{k} \approx u_{t+k}

Where uuu is the reward for truly observing, π\ PI π is the strategy, and γ\gammaγ is the discount factor.

To put it bluntly, you take past observations, encode them into the current hidden state, and then, given future actions, you can plan in the hidden state space.

  1. To achieve the above functions, like the model-based algorithm, two steps; Learn environmental modeling and strategic planning.

The environment model needs to provide: 1. State transition; 2. Actions that each node allows to search (reducing search space); 3. Terminate the node. The environment model is actually composed of two parts, the representation model and the dynamic model gθg_{\theta}gθ :


r k . s k = g Theta. ( s k 1 . a k ) r^{\mathrm{k}}, s^{k}=g_{\theta}\left(s^{k-1}, a^{k}\right)

Denotes that the function hθh_{\theta}hθ codes past observations s0=hθ(o1,… , ot) s ^ {0} = h_ {\ theta} \ left (o_ {1}, \ ldots, o_ {t} \ right) s0 = h theta (o1,… ,ot), get the current root node s0s^{0}s0. Given such a model, for a hypothetical future trajectory A1,… , aka ^ {1} \ ldots, a ^ {k} a1,… , AK, and given past observations O1… , oto_ {1} \ ldots, o_ {t} o1,… The ot.

  1. The strategy uses andAlphaGo ZeroThe sameMCTSPolicy, need to search for a policy
    PI. t = P [ a t + 1 o 1 . . o t ] \pi_{t}=\mathrm{P}\left[a_{t+1} \mid o_{1}, \ldots, o_{t}\right]
    And a value function
    v t = E [ u t + 1 + gamma u t + 2 + o 1 . . o t ] v_{t} = \mathbb{E} \left[u_{t+1}+\gamma u_{t+2}+\ldots \mid o_{1}, \ldots, o_{t}\right]
    .lossIt consists of three parts: strategy, value and rewardlossComposition:


l t ( Theta. ) = k = 0 K l p ( PI. t + k . p t k ) + k = 0 K l v ( z t + k . v t k ) + k = 1 K l r ( u t + k . r t k ) + c Theta. 2 l_{t}(\theta)=\sum_{k=0}^{K} l^{\mathrm{p}}\left(\pi_{t+k}, p_{t}^{k}\right)+\sum_{k=0}^{K} l^{\mathrm{v}}\left(z_{t+k}, v_{t}^{k}\right)+\sum_{k=1}^{K} l^{\mathrm{r}}\left(u_{t+k}, r_{t}^{k}\right)+c\|\theta\|^{2}

The result?

  • The results were impressive anyway!

Published information? Author information?

Julian Schrittwieser, Google Brain Software Engineer! AlphaGo and AlphaZero project team members.