Optimization and practice of Ray in ant mass generation landing

Dynamic graph calculation

Fusion flow calculation, graph calculation of two methods

Risk control, data analysis, graph machine learning and other scenarios

Online machine learning

Run the entire online learning pipeline in one system; Flow computing, model training, model services.

Applied to all kinds of recommendation and advertising systems.

Online calculation

Implement distributed computing in online services.

Applied to financial decision making systems

Distributed operations planning

Allows users to quickly develop highly available, highly reliable distributed operations algorithms

Applies to online resource allocation problems.

Mass data processing (WIP) Python github.com/marsproject/mars (Mars)

Distributed Numpy, Pandas, and SciKit-learn.

Summary core: Ray serves as the unified chassis for all distributed systems!

Take a look at Ray’s scale at Ant Financial

The 2018 Ray architecture is shown below:

The next four parts will explain Ray’s challenges and optimization in ants!

Challenge 1. Actor Call performance

At present, online learning needs to be used in many scenarios, Ray needs high-performance Actor Call in streaming data processing in online learning

However, all actor tasks need to go through raylets, and all Raylet performance bottlenecks become challenges in online learning. In this case, we optimize the actor Call using the Direct mode, as shown below:

Actors communicate directly with each other through gRPC.

Caller obtains Callee’s address through Redis Pub/Sub.

The RPC layer is implemented using C++

In later versions, the Ray community also changed normal Task to Direct mode

The performance improvement is shown below:

Actor call throughput benchmarks (java)

When making other RPC improvements, such as Ray Call core link code optimization; Reduce tasJ spec copies; Java JNI cache; Using a separate IO thread results in the following performance improvements:

Challenge 2. Actor failure recovery

In dynamic graph computing systems, when one Actor fails, all actors need to be restarted

Requirements: Fast, reliable Actor failover mechanism.

As shown above:

Actors can be created multiple times, especially in large clusters or poor network environments, and with Workaround:, increasing timeout compromises Actor restart speed.

Actor management mechanism based on GCS

For the above problems, we use GCS Service for optimization, and the architecture design is shown in the figure below:

New GCS architecture after design = RPC Service + plug-in backend storage.

Fast and reliable actor fault recovery mechanism is realized

100% restart success rate;

Restart speed: 1 actor: ~1.5 seconds; 10k actors: ~70s

GCS tolerance:

The GCS Service can recover data from back-end storage.

Companies are free to choose reliable back-end storage.

Other GCS service-based features:

Node management.

The Job management

Placement group.

Optimize actor scheduling policies.

Challenge 3: Scalability and stability

In a production environment, it is common to fail due to environmental problems or occasional bugs. As the cluster grows, Ray becomes unstable. The number of clusters increases, causing o&M difficulties.

System fault tolerance

Ray itself should be fault-tolerant:

GCS: GCS Service recovers from the back end.

Raylet: The daemon monitors and restarts Raylet.

Node: When a node fails, the K8s operator automatically replenishes the node.

With Ray, users should easily write fault-tolerant code:

Tasks/ Actors: Retry /restart API.

Placement Group: Implements an automatic failover mechanism.

We have implemented libraries for several common troubleshooting scenarios. For example, multiple actors live and die together.

Improved scalability of a single Ray cluster

Scalability optimization:

Separate heartbeat and resource reporting requests.

Optimize data structures to reduce the size of RPC and PubSub messages

Reduce the number of connections between components

Allows multiple Java actors to share a JVM process.

Currently, a single Ray cluster can support thousands of nodes and thousands of actors.

Deployment pattern optimization

Challenge 4. Debuggability and ease of use

Question:

Distributed systems are often difficult to debug.

The Ray Core team often spends a lot of time helping users debug

How can I make Ray easier to use?

The new dashboard

Various state information.

Resource usage, node/actor status, cluster/job configuration, etc.

Log/events

(WIP) Automatically analyzes errors based on log and event data

Integrate common debug tools

C++/Python/ Java stack tools, profiling tools, memory analysis tools, etc.

Job submission

Simplify the Job submission process:

Users can submit jobs using the Web platform or RESTful interfaces.

When submitting the user, the user can directly fill in the dependency, Ray will automatically download to the cluster.

Dependency caching is implemented for Python/Java. Submitting and starting a Job typically takes only a few seconds.

Integrating with the community’s Runtime Env functionality.

The above isRay is applied in Ant FinancialThe explanation content! Feel good, like, watch, share the triple combo, thank you!! !

End

Find all kinds of big data technology articles and interviews, come

<3 minutes and seconds to understand big data >

Update Internet big data component content at any time

Technical blog posts for learners

Fast and the side of the small partners together to pay attention to us!

About the Author: Against the Current Mr Li, Qiu Zhao7Offer,CSDN Blog: blog.csdn.net/weixin\_382…

Optimization and practice of Ray in ant mass generation landing

Related Posts

Talk about debezium’s BinlogReader

The Redis interview

Pygame writes basketball games – Matchstick man dribbles to avoid defensive jump shots