A late catch: self-playing AlphaGo Zero

Author: Li LiCopy the code

Introduction: “No human knowledge required” is achieved because of the model + MCTS hoist training method. On the basis of using the model, MCTS elevator is always stronger than the model itself, thus pointing out the direction of model improvement. The enhancement of the model further enhances the capability of the MCTS hoist. This creates a positive cycle. A lifter that is always stronger than the model is the key to establishing a positive loop.

AlphaGo Zero [1] has been out for a while. AlphaGo Zero should have written popular science as soon as it came out, but I was too lazy. Wait until now to update.

The highlight of AlphaGo Zero is that it was able to achieve more powerful moves than previous versions without using any human knowledge at all. The main methods are as follows: 1) using Monte Carlo tree search to build a model elevator; 2) in the process of self-playing, using the elevator to guide the model upgrade, which further improves the ability of the elevator.

1. Monte Carlo tree search introduction

Monte Carlo Tree Search (MCTS) is a Tree Search technique with a Tree structure as shown below.

Each node s in the tree represents a go board with two numbers. One is the number of visits N(s), and the other is the quality Q(s). Access Count N(s) Indicates the number of times a node is accessed in the search. MCTS searches a disk repeatedly, so a node may be accessed repeatedly, as discussed below. The quality degree Q(s) represents the dominance degree of AlphaGo under this node, and its calculation formula is shown as follows.

This formula means: 1) For non-leaf nodes, the mass degree is equal to the mean mass degree of all existing child nodes in the tree of the node. 2) For leaf nodes, the quality degree is related to the winning probability vθ(sL) estimated by the value network, and also to the result zL obtained from the following matches simulated by fast walking. The mass degree of the leaf node is equal to a weighted mixture of the two, where the mixing parameter λ is between 0 and 1.

With the structure of MCTS, we can move on to how MCTS does search. When an opponent loses a piece, AlphaGo quickly reads the current board as the root of the search. The MCTS search process is shown in the following figure, which is divided into four steps:

Selection: Starting from the root node R, recursively select a child node until you reach the leaf node L. When we are at a node S, how do we choose the child node SI? We should not randomly select children, but should choose those high-quality children. The way to select child nodes in AlphaGo is as follows.

The p (si) | s is strategy network output. An interesting point is that the more times a node is visited, the less likely it is to be selected as a child node, for the sake of search diversity.

Extension: If go does not end on node L, it is possible to create a node C.
Simulation: calculate the mass degree of node C.
Backpropagation: update the quality of C’s father, grandfather, and ancestors according to its quality.

The above search steps are repeated until some termination condition is reached. After the search, MCTS selects the child node with the highest quality of the root node as AlphaGo’s move.

2. Network structure and training methods

AlphaGo Zero’s network structure is different from previous versions. The network architecture of AlphaGo Zero uses the ResNet network, while previous versions used the traditional CNN network. At the same time, AlphaGo Zero combines the policy network and value network, and one network outputs different action probabilities and estimated win rates at the same time, as shown below.

With the network structure defined, let’s take a look at how AlphaGo Zero conducts self-play training. Plug the above model into the MCTS, and the MCTS can strategically search for the probabilities of different actions on the current disk. Since MCTS has been searched, the action probability output is definitely better than the action probability output by the model itself, so MCTS can be regarded as the model’s elevator. Self-play starts from the initial go surface; MCTS input the current plate S1 and output different action probability P1, and sample an action according to this probability as a player’s shot. MCTS as the opponent input the current disk S2 output the probability of different actions P2, according to the probability of sampling an action as the opponent’s shot; Keep executing until the winner is determined. Collect data (S1, P1, Z)… , as the training data training model. The entire training process is shown below.

Here, PERSONALLY, I have a question. This training method is obviously different from the reinforcement learning based on Markov Decision Process (MDP), but it is still called reinforcement learning. Is there a broader definition of difficulty reinforcement learning?

3. Experimental results

3.1 Comparison of different network structures

There are two changes in AlphaGo Zero network structure :1) Resnet is used to replace traditional CNN, and 2) Policy network and Value network are combined. As you can see in the figure below, these two changes can improve AlphaGo Zero’s effect (SEP indicates policy and Value separate, dual indicates together; Res for Resnet network, CNN for traditional CNN).

3.2 Comparison of different versions of AlphaGo

As you can see from the chart below, AlphaGo Zero is better than previous versions without human knowledge. In addition, it can be seen from the figure below that after the training, the MCTS elevator + model is still better than the model.

4. To summarize

Other teams are trying to improve the game in old ways. DeepMind had no idea that it needed no human knowledge to do such a big research job. “No human knowledge required” is achieved because of the model + MCTS hoist training method. On the basis of using the model, MCTS elevator is always stronger than the model itself, thus pointing out the direction of model improvement. The enhancement of the model further enhances the capability of the MCTS hoist. This creates a positive cycle. A lifter that is always stronger than the model is the key to establishing a positive loop.

Many we-media have begun to trumpet this as an important step towards universal intelligence. This is not true. In Go, with clear rules and complete information, we found MCTS, a model lifter that is always better than the model. But in more general domains, such model lifters are harder to find.

In this paper, starting from the blog: www.algorithmdog.com/alphago-zer… And AlgorithmDog on wechat

reading

TensorFlowLite has been flooded by TensorFlowLite

Learning Notes DL002:AI, Machine learning, presentation learning, Deep Learning, the First Great Recession

Study Notes DL001: Mathematical notation, concepts of deep learning

This article has been published by Tencent Cloud Technology community authorized by the author

The original link: https://cloud.tencent.com/community/article/192908?fromSource=gwzcw.631407.631407.631407

Massive technical practical experience, all in Tencent cloud community