This is the 22nd day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021
The purpose of artificial intelligence systems
Provide more efficient programming languages, frameworks, and tools.
- More expressive and concise neural network computing primitives and programming languages
- More intuitive editing, debugging and experimentation tools
- System issues throughout the deep learning lifecycle: model compression, inference, security, privacy protection, etc
- Provide a comprehensive learning system: reinforcement learning, automatic machine learning, etc
Provides more powerful and scalable computing power.
- Automatic compilation optimization algorithm
- Automatically derive the calculation chart
- Automatic parallelization for different architectures
- Automatically distributed and extended to multiple compute nodes
- Continuously optimize the model effect
Explore and solve problems of system design, implementation and evolution under new challenges.
Basic components of artificial intelligence systems
- Experience: Have end-to-end AI user experience – models, algorithms, pipelines, experiments, tools, lifecycle management
- Framework: includes programming interface, computation graph, automatic gradient computation, IR (intermediate representation), compiler infrastructure
- Runtime: includes deep learning runtime – optimizer, planner, and executor
- Architecture (single node and cloud) : includes hardware apis (GPU, CPU, FPGA, ASIC), resource management/scheduler, scalable network stack (RDMA, IB, NVLink)
Artificial intelligence system ecology
The first is the core system hardware and software, which mainly includes:
- Deep learning to run tasks and optimize the environment
- Universal resource management and scheduling system
- New hardware and associated high performance networking and computing stacks
On top of this, we have deep learning algorithms and frameworks, which mainly include:
- Efficient new general-purpose AI algorithms for a wide range of uses
- Support and evolution of multiple deep learning frameworks
- Deep neural network compiler architecture and optimization
Finally, the broader Al ecosystem includes:
- New Models for Machine Learning (RL)
- Automatic Machine Learning (AutoML)
- Security and Privacy
- Model derivation, compression and optimization
Therefore, if we want to solve the problem systematically, it means that we need to comprehensively and deeply understand the whole problem space, because any shortcomings of the dimension may affect the whole system.
A typical AI system platform example
Taking OpenPAI as an example, the basic functions of a typical AI platform are shown below:
- Provides interfaces and tools for users to submit training tasks
- Provide big data for file management system management training
- Provides a runtime environment for deep learning frameworks to access
- Computing resources: GPU, FPGA, ASIC, TPU, and CPU
- Network resources: IB/RDMA
- Storage resources: HDFS, NFS, and Ceph
- Efficient scheduling algorithm
- Allocate heterogeneous computing resources
- Error recovery and fault tolerant management
- Log and performance monitoring system
- User and security management