Ray: Distributed execution framework for AI applications

The next generation of AI applications will need to constantly interact with the environment and learn from those interactions. This puts new demands on the performance and flexibility of the system, and most of the existing machine learning computing frameworks cannot meet these requirements. To this end, UC Berkeley developed A new Distributed Framework called Ray, and recently published A paper on Arvix: Ray: A Distributed Framework for Emerging AI Applications.

The paper’s first authors are Philipp Moritz and Robert Nishihara, PhD students at UC Berkeley AMP Lab, and Michael I. Jordan and Ion Stoica are also listed.

Michael I. Jordan is a distinguished Professor in the Department of Electrical Engineering and Computer Science and the Department of Statistics at UC Berkeley. He is a member of the National Academy of Sciences, the National Academy of Engineering, and the American Academy of Arts and Sciences. In 2016, he was named “Most Influential Computer Scientist” by Semantic Scholar.

Ion Stoica: Professor of computer Science at UC Berkeley, co-founder of AMPLab, core author of Chord, Spark, and Mesos.

There are shortcomings in the current computing framework

Most AI applications today are developed based on a more limited paradigm of supervised learning, in which models are trained offline and then deployed to servers for online predictions. As the field matures, machine learning applications need to operate more in a dynamic environment, respond to changes in the environment, and take a series of actions to accomplish a given goal. These requirements are naturally built into the Reinforcement Learning (RL) paradigm, i.e. continuous Learning in an uncertain environment.

RL applications differ from traditional supervised learning applications in three ways:

1) RL applications rely heavily on simulation to explore the state and operation results. This requires a lot of computation, and in real life, an application might run hundreds of millions of simulations. 2) The computational graph applied by RL is heterogeneous and dynamically changing. A simulation may take a few milliseconds to a few minutes, and the results of the simulation determine the parameters of future simulations. 3) Many RL applications, such as robot control or autonomous driving, require quick action to respond to changing environments.

Therefore, we need a computing framework that can support heterogeneous and dynamic graphs while processing millions of tasks per second with millisecond delay. Current computing frameworks are either unable to meet the latency requirements of common RL applications (MapReduce, Apache Spark, CIEL) or use static computing diagrams (TensorFlow, Naiad, MPI, Canary).

RL applications require flexibility, performance and ease of development, and the Ray system is designed to meet these requirements.

The sample

Classical RL training application pseudocode

Sample Python code implemented with Ray

In Ray, you declare remote functions and actors via @ray.remote. When remote and actor Methods are called, a future (object ID) is immediately returned. Using Ray.get (), the object corresponding to that ID can be obtained synchronously, which can be passed to subsequent remote and actor Methods to encode task dependencies. Each actor has an environment object, self.env, that shares state between tasks.

The above is the task diagram for calling train_policy.remote(). Remote functions and Actor methods invoke tasks in the corresponding task diagram. There are two actors in the figure, and stateful edges between each actor mean that they share mutable state. There are control edges between the Train_policy and the task it invokes. To train policies in parallel, train_policy.remote() can be called several times.

The principle of

To support the heterogeneous and dynamic workload requirements of RL applications, Ray uses a dynamic task graph computing model similar to CIEL. In addition to CIEL’s parallel task simplification, Ray provides code simplification at the top of the execution model, enabling state structures such as third-party emulation.

Ray system architecture

In order to meet stringent performance requirements while supporting dynamic computational graphs, Ray adopts a new distributed architecture that can be scaled horizontally. The structure of Ray consists of two parts: application layer and System layer. The Application layer implements apis and computing models to perform distributed computing tasks. The System layer is responsible for task scheduling and data management to meet performance and fault tolerance requirements.

Ray system architecture

The structure is based on two key ideas:

1) Global Control Store (GSC). All control state of the system is stored in the GSC so that other components of the system can be stateless. Not only does it simplify support for fault tolerance (when an error occurs, a component can read the most recent state from the GSC and restart), but it also allows other components to scale horizontally (copies or fragments of that component can be shared through the GSC state).

2) Bottom-up distributed scheduler. Tasks are submitted to the Local scheduler from the bottom up by the driver and worker. The local scheduler can choose to schedule tasks locally or pass them to the global scheduler. Task latency is reduced by allowing local decisions, and system throughput is increased by reducing the burden on the global scheduler.

Bottom-up distributed scheduler

performance

1) Scalability and performance

End-to-end scalability. The main advantage of GCS is enhanced horizontal scalability of the system. We can observe an almost linear increase in task throughput. Ray can achieve throughput of over 1 million tasks per second on 60 nodes and linearly over 1.8 million tasks per second on 100 nodes. The data point on the far right shows that Ray can process 100 million tasks in less than a minute (54s).

The main responsibility of the global scheduler is to maintain load balance throughout the system. The Driver submits a 100K task on the first node, which is balanced by the global scheduler among the 21 available nodes.

Object storage performance. For large objects, the single client throughput exceeds 15GB/s (red). For small objects, the object storage IOPS reaches 18K (cyan). Each operation takes about 56 microseconds.

2) Fault tolerance

Recovering from object failure. As the worker node is terminated, the active local scheduler automatically triggers the reconstruction of the lost object. During the rebuild, the tasks originally submitted by the Driver were put on hold because their dependencies could not be met. But overall task throughput remains stable, fully utilizing available resources, until the lost dependencies are rebuilt.

Fully transparent fault tolerance for distributed tasks. The dotted line represents the number of nodes in the cluster. The curve shows throughput for new tasks (cyan) and re-executing tasks (red). By 210s, as more and more nodes are added back to the system, Ray can fully recover to its original task throughput.

Recovering from actor failure. By encoding method calls from each actor into a dependency diagram, we can reuse the same object refactoring mechanism.

When t=200s, we stop 2 out of 10 nodes, resulting in 400 out of 2000 actors in the cluster needing to be recovered on the remaining nodes. (a) Shows the extreme case where no intermediate node state is stored. Methods that call the missing actor must be re-executed serally (t = 210-330s). Lost roles are automatically distributed on available nodes, and throughput is fully restored after rebuilding. (b) Checkpoint storage is automatically performed for every 10 method calls per actor under the same workload. After a node fails, the state of the actor is reconstructed by performing checkpoint tasks (t = 210-270s).

GCS replication consumption. To make GCS fault tolerant, we copy each database shard. When a client writes to a fragment of the GCS, it copies the write to all copies. By reducing the number of fragments in the GCS, we artificially make the GCS a bottleneck in the workload, with two-way replication costing less than 10%.

3) RL application

We implemented two RL algorithms with Ray, and compared to systems designed specifically for these two algorithms, Ray can catch up to and even surpass specific systems. In addition, using Ray to distribute these algorithms across a cluster requires very few lines of code changes in the algorithm implementation.

ES Algorithm (Evolution Strategies)

The time required for Ray and reference system to implement ES algorithm to achieve 6000 points on Humanoid V1 task is compared.

The ES algorithm implemented on Ray scales well up to 8192 cores, while a purpose-built system fails after 1024 cores. On 8192 cores, we achieved a median performance of 3.7 minutes, twice as fast as the current best.

Proximal Policy Optimization

To evaluate Ray’s performance on a single node and smaller RL workloads, we implemented the PPO algorithm on Ray compared to the algorithm implemented by OpenMPI.

The time required for MPI and Ray to achieve 6000 points of PPO algorithm on Humanoid V1 task was compared.

The PPO algorithm implemented with Ray surpasses the special MPI implementation and uses less GPU.

Control simulation robot

Experimental results show that Ray can achieve the soft real-time control requirements of the simulated robot. Ray’s drivers can run simulated robots and take action at fixed intervals, from 1 millisecond to 30 milliseconds, to simulate different real-time requirements.

The future work

Given the ubiquity of workloads, specific optimizations can be difficult. For example, scheduling decisions must be taken without full knowledge of computational graphs. Ray’s scheduling decisions may require a more complex setup. In addition, the storage pedigree for each task needs to enforce a garbage collection policy to limit storage costs in the GCS, which is currently being developed.

When GCS consumption becomes a bottleneck, the global scheduler can be extended by adding more fragmentation. Currently, the number of GCS fragments and global schedulers needs to be set manually, and adaptive algorithms will be developed to automatically adjust their number in the future. Considering the advantages of GCS architecture, the author considers that centralized control state is the key design element of future distributed system.

Ray: A Distributed Framework for Emerging AI Applications

The open source project site: ray. Readthedocs. IO/en/latest/I…

Thanks to CAI Fangfang for proofreading this article.

Ray: Distributed execution framework for AI applications

There are shortcomings in the current computing framework

The sample

The principle of

Ray system architecture

performance

1) Scalability and performance

2) Fault tolerance

3) RL application

The future work

Related Posts

Redis advanced usage – Message queue

Android uses BLE(Bluetooth Low Energy)

Analysis of the implementation of dictionaries in the tool library