Start the parallel version of AlphaZero backgammon

preface

Project link: github.com/hijkzzz/alp…

The AlphaZero algorithm has been around for over a year now, and GitHub has a variety of implementations, ranging from a low-performance single-threaded version of a thousand lines of Python code to a distributed version of tens of thousands of lines of C++ code. But none of these implementations satisfy the average algorithm enthusiast’s need for a simple, stand-alone, runnable high-performance AlphaZero algorithm.

Decryption of AlphaZero

First, let’s take a look at the principle of AlphaZero algorithm through a graph

It can be seen that the algorithm flow of AlaphaGo Zero is divided into:

Since playing chess (using monte Carlo tree search) N games to generate chess spectrum
Using the generated chess score training network
Evaluate the newly trained network

Analysis of the

For the Python version of AlphaZero, which is usually limited to GIL, the most time-consuming self-play phase of the process (see figure below) cannot be parallelized, so the most direct optimization is to implement the low-level details in a high-performance language like C++.

The solution

The thread pool

Github.com/hijkzzz/alp…

To parallelize the self-play process, we first need to implement a C++ thread pool. There are plenty of resources available on the web about thread pools, but I won’t cover them here.

Root Parallelization

As can be seen from the algorithm flow chart, the self-play process is implemented using Monte Carlo Tree search, so there are two dimensions that can parallelize the self-play: Root Parallelization and Tree Parallelization. Root Parallelization refers to the simultaneous opening of N games, with each thread responsible for one game. Tree Parallelization refers to the Parallelization of Monte Carlo Tree searches (MCTS) in single-game games. Root Parallelization is easy to achieve with N threads, and Tree Parallelization is discussed below.

Tree Parallelization

First, analyze the operation process of Monte Carlo tree Search (MCTS) :

For each move, MCTS performs M simulation of moves, and each simulation is a recursive process, as follows:

Select, if the current node is not a leaf node, through specific UCT algorithm (explore – using the algorithm, through the neural network prediction of winning percentage value (q) and prior probability calculation selection probability, the higher the odds/prior probability to choose the greater the chances) to find out the optimal position of the next move later, search into the next layer, until the current node is a leaf node.
Expand and evaluate, if the current node is a leaf node, there are two cases:
- When the game of the current node ends and one party wins, Backup will be performed to update the win rate value of the parent node
- If the game is not over, the neural network is used to predict the win rate of the current node and the prior probability of the next layer, which is used to expand the node, and then Backup is performed to update the win rate value (Q value) of the parent node.
Backup, each node stores a win value (q value), which is equal to the number of wins/number of visits, and Backup updates this value and the number of visits from the end state.
Play. In the actual game, the child node with the most visits under the root node can be selected when the child is dropped (because the node with a larger Q value has a higher probability of select and more visits).

Therefore, we can simulate M’ (less than M) times at the same time, so we need to lock some key data, such as parent-child node relationship of Monte Carlo tree, access times, Q value and so on. Some lock-free algorithms have also been developed [5]. However, due to the relationship of pre-allocated tree nodes, it takes up a large amount of memory, which makes it impossible for ordinary machines to run, so the parallel Monte Carlo tree search with locking version is used here.

Virtual Loss

For Tree Parallelization, if we simply parallelize Monte Carlo searches (MCTS), we run into a problem: M’ threads often search the same node, and our Parallelization becomes meaningless because searching the same node means repeating work. Therefore, in the UCT algorithm, when a node is accessed by a thread, we add a penalty of Virtual Loss, so that other threads are unlikely to choose this node for search.

LibTorch

Since the process of MCTS needs to use neural network to predict the winning rate and prior probability, C++ needs to call the neural network prediction method implemented by Python, but this will return to the origin. That is, Pyhton’s GIL limitation causes parallelized self-play to be forced to be serialized. So we used PyTorch’s C++ front end LibTorch to implement the neural network.

CUDA Stream

After work, for the neural network running on GPU, in fact, our program is not really parallelized. This is because LibTorch’s prediction execution is limited to the Default CUDA Steam, which is serial by Default, causing multithreaded call prediction to block. There are two ways to avoid this problem: 1. Use multiple CUDA Streams 2. Merge forecast requests. The method we use here is to merge multiple predictions with buffer queues and push them to GPU at one time, thus preventing thread blocking caused by contention of GPU workflow.

SWIG

www.swig.org/tutorial.ht…

Finally, we wrapped the C++ code in SWIG as a Python interface for the main program to call. Although this incurs some performance overhead, it greatly improves the efficiency of development.

test

After testing, the training efficiency after parallelization has been improved by at least 10 times. For a simple calculation, assuming that each MCTS uses 4 threads to play 4 rounds simultaneously, 4×4= 16x, the order of magnitude increase is reasonable considering the overhead of locking and buffering queues and Python interfaces. And as long as the GPU is powerful enough, increasing the number of threads can continue to improve performance. In the end, I spent a day training a standard 15×15 backgammon algorithm on a GTX1070 and was able to beat the average player.

reference

Mastering the Game of Go without Human Knowledge
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
Parallel Monte-Carlo Tree Search
An Analysis of Virtual Loss in Parallel MCTS
A Lock-free Multithreaded Monte-Carlo Tree Search Algorithm