In October 2017, AlphaGo Zero was born completely from scratch. It was invincible in the world just by playing chess on its own. Instantly, it flooded wechat moments, and various talents were divided to read it, marveling at its simplicity of thought and magic of effect. Soon an open source version of AlphaGo Zero was released, but there was only code, not a trained model, because according to The gods, it would take 1,700 years to train AlphaGo Zero on a consumer computer! DeepMind’s paper on AlphaGo Zero, however, only emphasized the need for four TPus at the time of operation, and did not mention at all that the biggest computational requirement of the training process was to generate self-play data, which caused a little controversy. Fortunately, less than two months later, in early December, DeepMind quietly released the more generic AlphaZero paper on Arxiv. AlphaZero’s feat of conquering Go, chess and The Japanese game board in just a few hours once again amazed the world, but DeepMind’s generous and open self-play phase of 5000 TPU also made everyone sigh that “poverty limits our imagination”.

Going a bit far, let’s go back to the main topic of this article: AlphaZero combat, training an AI from zero by ourselves, to experience the key ideas and some important technical details behind the success of AlphaZero’s self-learning of chess. We chose gobang as the object of practice, because gobang is relatively simple and familiar to us. In this way, we can focus more on the training process of AlphaZero, and at the same time, we can feel the process of gradually strengthening the AI trained by ourselves by playing against each other. Through practice, it is found that in the case of playing 4 pieces on a 6*6 board, it takes about 500~1000 rounds of self-play training (2 hours) to train a more reliable AI. In the case of playing 5 pieces on an 8*8 board, you can also get a more reliable AI after about 2,000 to 3,000 games of self-training (2 days). So although poor, but we can go to experience the charm of the most cutting-edge achievements! The complete code and 4 trained models have been uploaded to github: github.com/junxiaosong…

Let’s first look at the situation of two games of trained AI models (3,000 games of self-play training) playing chess, and have a simple feeling:




Perform 400 MCTS simulations per move
Perform 800 MCTS simulations per move

As can be seen from the game sample above, AI has learned how to play backgammon, know when to block, how to win, according to my own experience against AI, to win AI is not easy, often draw, sometimes a little careless will lose. One thing to note here is that in the two games shown above, the AI only performed 400 and 800 MCTS simulations for each move, respectively. Further increasing the number of simulations could significantly improve the AI’s power, as shown in Figure 2 of the AlphaZero paper. AlphaZero performs only 800 MCTS simulations per move in training, but performs hundreds of thousands or even millions of MCTS simulations per move when evaluating performance).

Below, I will introduce the whole training process from two aspects of self-matchmaking and strategics-value network training, as well as some observations and experiences in the experiment process, in combination with the AlphaZero algorithm itself and the specific implementation on Github.

Self-play

Diagram of the self-play process

Learning evolution entirely based on self-play is AlphaZero’s biggest selling point, as well as the most critical and time-consuming part of the entire training process. Here are a few key points to make:

1. Which model is used to generate self-play data?

In AlphaGo Zero, we need to save both the current latest model and the historical optimal model obtained through evaluation. Self-play data is always generated by the optimal model for continuous training and updating of the current latest model, and then evaluate the advantages and disadvantages of the current latest model and the optimal model every once in a while. Decides whether to update the historical best model. In the AlphaZero version, this process is simplified, we only save the current latest model, and the self-play data is directly generated by the current latest model and used for training and updating itself. Intuitively, we may feel that the self-play data generated by using the current optimal model may have higher quality and better convergence. However, after trying two schemes, we find that in the case of playing four pieces on a 6*6 board, Using the latest model directly to generate self-play data would take about 500 rounds of training to get a better model, while maintaining the optimal model and generating self-play data from the optimal model would take about 1500 rounds of training to achieve similar results. This is also consistent with the result of AlphaZero’s paper that AlphaZero trained for 34 hours beat AlphaGo Zero trained for 72 hours. Using the latest model to generate self-play data may also be an effective means of exploration. First, the current latest model is generally not much worse than the historical best model, so the quality of the data is also relatively guaranteed. At the same time, the constant change of the model enables us to cover more typical data, thus speeding up the convergence.

2. How to ensure the diversity of data generated by self-play?

An effective strategic value model needs to be able to accurately evaluate the advantages and disadvantages of the current situation and the relative advantages and disadvantages of each action in the current situation in various situations. To train such a strategic value model, it is necessary to cover various situations as much as possible in the process of self-play. As mentioned earlier, continuous use of the latest models to generate self-play data may help to cover more scenarios to a certain extent, but such model differences alone are not enough, so in reinforcement learning algorithms, there is usually a deliberately designed means of exploration, which is crucial. In the AlphaGo Zero paper, for the first 30 moves of each self-play game, the action is sampled according to a probability proportional to the number of visits to each branch at the root node of the MCTS (as shown in the self-play diagram above), similar to the stochastic strategy gradient approach), and subsequent exploration is done directly by adding Dirichlet noise (, which is similar to the exploration method in the deterministic strategy gradient method. In our implementation, each step of self-play uses both exploration modes, and Dirichlet Noise takes the parameter 0.3, i.eAnd at the same time .

3. Always save the self-play data from the current Player perspective

In self-play, we’re going to collect a bunch ofThe data,Is the situation,Is the probability calculated based on the number of visits to each branch at the root node of MCTS,Is the result of self-play, whereSpecial care needs to be taken to represent each step from the current Player perspective. Such asThe first matrix represents the positions of the chess pieces in the current player, and the second matrix represents the positions of the chess pieces in the other player. That is to say, the first matrix alternately represents the positions of the chess pieces in the first and second hands of the playerWho is the current player in the situation.It’s similar, but you can’t determine each one until you’ve finished a full gameIn theIf the winner isThe current player in the situation, thenIf the final loser isThe current player in the situation, then, if the final tie, then .

4. Expansion of self-play data

Go is equivalent to rotation and mirror flip, and in fact gobang has the same property. In AlphaGo Zero, this property is fully utilized to expand self-play data and improve the reliability of situation evaluation when MCTS evaluates leaf nodes. In AlphaZero, however, this property was not taken advantage of either, as both chess and chess, which do not satisfy rotation equivalence, were considered. In our implementation, since the generation of self-play data itself is the bottleneck of calculation, in order to collect data and train the model as soon as possible in the case of very weak computing power, we will rotate and mirror flip the data of each game after the end of self-play. Store all the data of the 8 equivalent cases into a self-play data buffer. This data expansion of rotation and flip can also improve the diversity and balance of self-play data to a certain extent.

Strategic value network training

Schematic diagram of strategic value network training

The so-called strategic value network is given the current situation, returns the probability of each feasible action in the current situation and the model of the current situation score. The data collected by the previous self-play is used to train the strategic value network, and the updated strategic value network will be immediately applied to MCTS for subsequent self-play, so as to generate better self-play data. The two nested and promoted each other, constituting the whole training cycle. The following points are explained separately:

1. Situation description

In AlphaGo Zero, a total of 17 were usedIs used to describe the current situation. The first 16 planes describe the chess positions of players of both sides corresponding to the last 8 steps, and the last plane describes the chess colors corresponding to the current player, which is actually the successive hands. In our implementation, the description of the situation is greatly simplified toFor example, we only used fourWhere the first two planes represent the positions of the current player and the opponent player respectively. The position with a chess piece is 1, and the position without a chess piece is 0. Then the third plane represents the position of the opponent’s most recent step. That is, only one of the planes is 1, and all of the rest are 0. The fourth and final plane tells us whether the current player is a first-handed player, if it is a first-handed player then all of the plane is 1, otherwise all of it is 0. In fact, at the very beginning, I only used the first two planes, namely the positions of the chess pieces on both sides, because these two planes were enough to express the whole situation intuitively. However, after the latter two feature planes were added, the training effect was significantly improved. My personal guess is that in backgammon, the position of our next move is usually near the position of the opponent’s previous move, so the third plane added has a great indication significance for the strategic network to determine which position should have a higher move probability, which may be helpful for training. At the same time, because the first hand is actually very dominant in the game, in the case of similar chess positions, the merits of the current situation is very relevant to whether the current player is the first hand or the last hand. Therefore, the plane of the fourth indicating the succession of hands may be of great significance to the value network.

2. Network structure

In AlphaGo Zero, the input scenario first passes through 20 or 40 residual network modules based on convolution, and then connects to two or three layers of network respectively to get the strategy and value output. The whole network has more than 40 or 80 layers, and the training and prediction are very slow. Therefore, in our implementation, this network structure is greatly simplified. At the beginning, it is a common three-layer full convolutional network, using 32, 64 and 128, respectivelyFilter, using ReLu activation function. The output is divided into policy and value. In the policy end, four are used firstFilter for dimensionality reduction, and then a full connection layer, using softmax nonlinear function directly output the probability of each position on the board; On the value side, use two firstFilter for dimensionality reduction, and then a full connection layer of 64 neurons, and finally a full connection layer, using TANH nonlinear function directly outputScore between situations. The depth of the whole strategic value network is only 5~6 layers, and the training and prediction are relatively fast.

3. Training objectives

As mentioned earlier, the input to the strategic value network is the current situation description, the output is the probability of each possible action in the current situationAnd the rating of the current situation“, and what we use to train the strategic value network is a series of information collected during self-playThe data. According to the strategy value network training diagram above, the goal of our training is to make the strategy value network output action probabilityThe probability of getting closer to the MCTS outputLet the strategy value network output situation scoreCan more accurately predict the real outcome of the game. From the optimization point of view, we are constantly minimizing the loss function on the self-play dataset:, where the third term is the regular term used to prevent overfitting. Since we are minimizing the loss function, during the training, if it is normal, we will observe that the loss function decreases slowly. The picture below shows aIn the process of backgammon training on the board, the loss function changes with the number of self-play games. A total of 3050 games were played in this experiment, and the loss function gradually decreased from the initial 4 points to about 2.2.


During training, in addition to observing that the loss function is slowly decreasing, we generally pay attention to the change of entropy in the strategy output of the strategy-value network (the sub-probability distribution of the output). Normally, our policy network starts out with a uniform probability of random outputs, so entropy is high. As the training process progresses, the strategy network learns which positions should have a higher probability of landing in different situations. In other words, the distribution of landing probabilities is no longer uniform, so the entropy becomes smaller. It is precisely because of the bias of the output probability of the strategy network that MCTS can conduct more simulation in more potential locations in the search process, so as to achieve better performance in less simulation times. The figure below shows the observed change of entropy for the output policy of the policy network during the same training.


And, of course, the best thing we want to see during the long training sessions is that the AI we’re training is getting better. So although regular evaluation is no longer needed to update the optimal strategy in AlphaZero’s algorithm process, our implementation still evaluates the current AI model every 50 self-play matches. The evaluation is done by playing 10 rounds with the current latest AI model and pure MCTS AI (based on random rollout). Pure MCTS AI started with 1000 simulations per step, and when it was beaten by our trained AI model 10:0, pure MCTS AI upgraded to 2000 simulations per step, and so on, and so on, The AlphaZero AI model we trained always used only 400 simulations per step. In the above 3050 rounds of self-matched training, we observed:

  • After 550 games, AlphaZero VS pure_MCTS 1000 reached 10:0 for the first time
  • After 1300 games, AlphaZero VS pure_MCTS 2000 reached 10:0 for the first time
  • After 1750 games, AlphaZero VS pure_MCTS 3000 reached 10:0 for the first time
  • After 2,450 games, AlphaZero VS pure_MCTS 4000 has 8 wins, 1 draw and 1 loss
  • After 2,850 games, AlphaZero VS pure_MCTS 4000 is 9-1.

OK, here the whole AlphaZero combat process is basically introduced, interested partners can download my github code to try. In order to make it easy for you to directly play against the trained model, I have implemented a pure NUMpy version of the strategy value forward network, so as long as you install Python and NUMpy can directly play against the machine, wish you have fun! ^_^

References:

  1. AlphaZero: Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
  2. AlphaGo Zero: Mastering the game of Go without human knowledge