Video Introduction: Menger for Large-scale distributed reinforcement learning

In the past decade, reinforcement learning (RL) has emerged as one of the most promising research areas in machine learning, and in solving complex real-world problems (such as chip placement and resource management) as well as challenging games (e.g., go, Dota 2, and hide-and-seek). Simply put, reinforcement learning infrastructure is a data collection and training cycle where participants explore the environment and collect samples, which are then sent to learners for training and updating models. Most current RL techniques require multiple iterations of batches of millions of samples in the environment to learn the target task (for example, Dota 2 learns from batches of 2 million frames every 2 seconds). Therefore, the RL infrastructure should not only scale efficiently (for example, increasing the number of participants) and collect large numbers of samples, but also be able to iterate these large numbers of samples quickly during training.

Today we introduce Menger 1, a large-scale distributed RL infrastructure with localized reasoning that scales to thousands of participants in multiple processing clusters (such as Borg cells), thereby reducing the overall training time for chip placement tasks. In this article, we describe how we implemented Menger using Google TPU Accelerator for fast training iterations and demonstrated its performance and scalability in challenging chip placement tasks. Menger reduced up to 8.6 training time X achieved compared to baseline.

Menger system design

There are various distributed RL systems, such as Acme and SEED RL, each focused on optimizing a single specific design point in the distributed reinforcement learning system space. For example, while Acme uses local reasoning for each participant and frequently retrives models from learners, SEED RL benefits from a centralized reasoning design by allocating a portion of the TPU kernel to perform bulk calls. The tradeoff between these design points is (1) the cost of paying for the communication to send/receive observations and actions to/from the centralized inference server or to pay for the communication to retrieve the model from the learner and (2) the cost of reasoning to the participant (e.g., CPU) versus the accelerator (e.g., TPU/GPU). Due to the requirements of our target application (for example, size of observations, actions, and model sizes), Menger uses local reasoning in a similar way to Acme, but pushes the extensibility of the participants to almost infinite limits. Key challenges to achieving large-scale scalability and rapid training on accelerators include:

1. Providing a large number of read requests from participants to learners for model retrieval easily limits learners and quickly becomes a major bottleneck as the number of participants increases (e.g., significantly increasing convergence time). 2.TPU performance is often limited by the efficiency of the input pipes feeding training data to the TPU computing core. As the number of TPU computing cores increases (such as TPU Pods), the performance of the input pipeline becomes more important to the overall training run time.

Efficient model retrieval

To address the first challenge, we introduced transparent and distributed caching components between learners and participants optimized in TensorFlow and supported by Reverb (a similar approach used in Dota). The primary responsibility of the caching component is to balance the large number of requests from participants with the work of learners. Adding these caching components not only significantly reduces the pressure on the learner to service read requests, but also further distributes actors across multiple Borg cells with minimal communication overhead. In our study, we show that for a 16 MB model with 512 participants, the introduced cache component reduces the average read latency by about 4.0x leading to faster training iterations, especially for policy-based algorithms such as PPO.

High throughput input pipeline

To provide a high-throughput input data pipeline, Menger uses Reverb, a recently opened source data storage system designed for machine learning applications that provides an efficient and flexible platform to implement a variety of on-policy/off experiential playback -policy algorithms. However, in a distributed RL setup with thousands of actors, using a single Reverb replay buffer service does not currently scale well and becomes inefficient in terms of write throughput from actors.

To better understand the efficiency of replay buffers in distributed Settings, we evaluated various payload sizes ranging from 16 MB to 512 MB and average write latency for participants ranging from 16 to 2048. We were in the replay buffer and the actors were placed in the same Borg cell. As the number of participants increases, the average write latency increases significantly. Expanding the number of actors to 16 ~ 2048, by ~ 6.2 times the average write latency increased by X and ~ 18.9 X for payload sizes of 16 MB and 512 MB, respectively. An increase in write latency negatively affects data collection time and leads to an overall inefficiency in training time.

To mitigate this situation, we used sharding capabilities provided by Reverb to increase throughput between participant, learner, and replay buffer services. Sharding balances the write load from a large number of participants across multiple replay buffer servers, rather than limiting a single replay buffer server, and also minimizes the average write latency per replay buffer server (because fewer participants share the same server). This enables Menger to scale effectively to thousands of participants in multiple Borg cells.

Case study: Chip layout

We investigate the advantages of Menger for complex tasks with large netlist chip layouts. Using 512 TPU cores, Mengele achieved a significantly improved training time (up to ~ 8.6x, reducing the training time from ~ 8.6 hours down to just one hour in the fastest configuration) compared to a stronger baseline. While the Menger is optimized for TPU, the key factor for this performance increase is architecture, and we expect to see similar improvements when customizing for Gpus.

We believe that the Menger infrastructure and its promising results in complex chip placement tasks demonstrate innovative paths to further shorten the chip design cycle and make it possible to achieve further innovation not only in the chip design process, but also in other challenging real-world tasks.

Update note: first update wechat public number “rain night blog”, later update blog, after will be distributed to each platform, if the first to know more in advance, please pay attention to the wechat public number “rain night blog”.

Blog Source: Blog of rainy Night