MindSpore's "Four Ways" will keep you flat for the challenges of full-scene AI framework deployment

Abstract:The so-called full-scene AI refers to the ability to quickly apply deep learning technology to hardware devices in different scenarios on the cloud side, including cloud servers, mobile terminals, IoT devices, etc., with efficient operation and effective collaboration.

This article is shared from Huawei Cloud Community “AI Framework Challenges and MindSpore Solutions”, by HWCloudai.

The challenge of unifying the AI framework for all scenarios

The so-called full-scene AI refers to the ability to quickly apply deep learning technology to hardware devices in different scenarios on the cloud side, including cloud servers, mobile terminals, IoT devices, etc., with efficient operation and effective collaboration. For the framework, there are three major challenges: rapid deployment, efficient operation, and end-to-cloud collaboration.

Rapid deployment

How to quickly deploy the trained model to cloud servers, mobile terminals and various IoT devices for reasoning and even incremental training?

Reasoning on cloud server is usually deployed in the form of Service, and the trained model is pushed to cloud server directly through remote interface call (GRPC /REST), and the user invokes the cloud reasoning Service interface for reasoning. For mobile terminals and IoT devices, due to hardware resource constraints, the model and reasoning operation framework on the cloud side are too large to be deployed directly. Therefore, the compression of the model and the lightweight of the operation framework become the key to deployment on mobile terminals and IoT devices.

For mobile terminals and IoT devices lightweight challenge, provide independent lightweight framework the end side of AI is a better solution, at the same time, this lightweight framework may be more than one form, such as similar to smart phones these rich terminal and earrings like the thin end of the challenges facing are different, rich terminal generally still more ample storage space, To have a certain amount of computing power; Thin terminals are much more demanding, with a minimum noise level of 100 K, so you can’t put a runtime in there, and at the same time you have to think about giving AI developers a universal solution.

Does a lightweight end-to-side framework and good model compression and transformation technology make it possible for rapid deployment? There are problems, because if we end side of the architecture and the architecture of cloud side is separation, implementation is not consistent, such as the definition of different model of IR, operator different API interfaces, reasoning, it is likely to lead to cloud side training model cannot smoothly to carry out the transition to the end of the side, the side of the cloud’s reason code on the focus.

The process from the cloud-side training model to the end-to-side deployment of the general framework is as follows:

There are some current problems with this approach:

The first problem is that it is difficult for the two sets of model definitions to be consistent. For example, one of the operators on the cloud side and the endside is often missing, leading to the failure of model transformation.
The second problem: functions required by both the cloud and the terminal will be repeatedly developed and may be inconsistent. For example, fusion optimization to improve reasoning performance needs to be done on both sides of the terminal cloud, and inconsistency in data processing will lead to precision problems.
The third problem: Cloud side trained models need relatively complex transformation to carry out end-to-side online training.

Can the inconsistencies of the detached end cloud framework be addressed by a standard such as ONNX? The reason is that with the rapid development of the AI industry and the rapid emergence of new operator types, standards are actually hard to keep up with, so the solution should be based on AI framework.

Run efficiently

The efficient operation of the whole scene can be decomposed into efficient operators, efficient runtime and efficient models to achieve the maximum computing power of heterogeneous hardware and improve the operating performance and energy efficiency ratio of AI algorithms.

The performance of operator needs to be optimized from many aspects of algorithm and low-level instruction optimization. For example, compared with IM2COL +GEMM, the Winograd algorithm has a good performance improvement on many classical convolutional neural networks.

However, the Winograd algorithm is not superior to IM2COL +GEMM in all scenarios. In the figure below, the performance of Winograd deteriorates when the shape is 224x224x3x64. Therefore, choosing the optimal algorithm under different conditions is crucial to performance.

Optimization at the algorithm level is more about reducing the number of calculations (multiplication) at runtime to improve performance, while optimization at the instruction level is about making full use of the computing power of the hardware. For CPU hardware, the key factors affecting the execution speed of instructions include the hit ratio of L1/L2 cache and the flow of instructions. The general optimization methods are:

Choose reasonable data arrangement, such as NHWC, NC4HW4, etc
Reasonable allocation of registers. Registers can be divided into feature map register, weight register and output register according to their purposes. Reasonable allocation of registers can reduce The Times of data loading.
Data access, through the prefetch/preload command, you can read the data in the cache in advance.
Reorder instructions to minimize pipeline stall instructions.
Vectorization calculation, using SIMD instructions, such as ARM NEON instructions, X86 SSE/AVX instructions, etc.

These optimizations require a deep understanding of the hardware architecture.

The performance of the end-to-side runtime is mainly faced with challenges of heterogeneity and asynchronous parallelism. From the perspective of models, most models seem to execute serially when reasoning. However, if the operator is opened internally and turned into a fine-grained kernel, the overall execution flow is still a dataflow graph, and there are many opportunities for asynchronous parallelism. At the same time, there are a large number of heterogeneous devices on the end side. If a model is executed using multiple types of devices, there are also different pipelines in the middle.

The performance of the model mainly depends on offline optimization and tuning, which has been practiced in the industry for many years. The general idea is to integrate the regular pass and the offline operator tuning.

The cloud synergy

End-cloud collaboration mainly involves three parts: cloud-side training-end-to-side reasoning, cloud-side training-end-to-side incremental training-end-to-side reasoning, and cloud-side/end-to-side federated learning

Cloud-side training – end-to-side reasoning, focusing on how to generate the best end-to-side model, including model compression and adaptive model generation. Model compression techniques have been described previously. For neural network automated search (NAS), which is often used to generate models that meet certain constraints (for example, extreme memory constraints on microcontrollers), the biggest problem with NAS is how to shorten the time to search the model.

Cloud-side training — incremental end-to-side training that focuses on solving the problem of efficient model transformation between the cloud and the end, as described in the previous section.

At present, there are two main schools of federated learning in the industry. One is horizontal federated learning, which aggregates data. A typical application scenario is the privacy protection problem on mobile devices. The second one is vertical federated learning, which focuses on the integration of dimensions and focuses on the cross-institutional and cross-organizational big data cooperation scenarios, especially the data security and privacy protection issues in the banking and financial scenarios.

Architecture for privacy protection on mobile devices

Cross-agency, cross-organization big data collaboration architecture

There are many technical challenges in federated learning, such as system heterogeneity across devices and communication during algorithm iteration, which will affect the efficiency and precision of the final federated aggregation. The model encryption method in the process of federated learning, because even through the weight can also be inferred part of the privacy information, as well as the client poisoning attack, anti-sample, etc.; Another challenge is primarily architectural. There is currently no unified framework for federated learning that supports both horizontal and vertical federated learning.

MindSpore’s solution for a unified architecture for all scenarios

The end cloud unifies the kernel

MindSpore carries out a hierarchical design on the framework design, decoupling the data structure and modules shared by the end cloud, and maintaining the consistency of the end cloud architecture while satisfying the end-to-side lightweight, so as to truly realize the seamless deployment of training and the common model of the end cloud training.

[Unified IR] The unified IR of MindSpore Core ensures the consistency of end-cloud model/operator definition, so that models trained on the cloud side can be seamlessly deployed on the end side. At the same time, for end-to-side training, the same IR can be used for model retraining as for the cloud side.

Unified IR defines the logical structure of the model and attributes of the operators, and is decoupled from the persistence of the model. ProtoBuffer and FlatBuffer are the most widely used methods for persisting data in open source projects. ProtoBuffer is more powerful and flexible to use, but correspondingly, it’s also heavier. FlatBuffer is lighter and faster to deserialize. MindSpore persists the logical data of the Unified IR into different physical forms, with ProtoBuffer on the cloud side and FlatBuffer on the end side, for consistency of data and lightweight deployment.

【 Common Pass 】 In order to improve the performance, the trained model needs to make some optimization methods in advance before reasoning, including fusion, constant folding, adjustment of data arrangement and so on. The optimization of end-cloud sharing is also included in the MindSpore Core module, but for cloud-side reasoning, these optimizations are performed during online reasoning, while for mobile terminals, these optimizations are completed offline before reasoning is performed.

[Unified Interface] MindSpore has designed a C++ interface that provides a unified end cloud. The usage of the unified C++ interface is as consistent as possible with the Python interface, which reduces the learning cost. With a unified interface, users can use a set of code to reason on different hardware.

Lightweight technology

【MindSpore for Micro 】 Compared with mobile terminals, IoT devices have more limited MCU chip resources. Therefore, how to deploy deep learning models on IoT devices will be more challenging.

In the above table, the left side shows the size of memory and storage on the cloud, mobile, and MCU, and the right side shows the storage and memory occupied by RESNET-50, MobileNet-v2, and INT8-quantized MobileNetV2.

MindSpore has designed the MindSpore for Micro solution for IoT devices.

Reasoning frameworks deployed on cloud servers and mobile terminals reason through model interpretation, which can support multiple models and across hardware platforms, but require additional runtime memory (the most expensive resource in the MCU) to store meta-information (such as model structure parameters). MindSpore for Micro’s CodeGen approach offloads the sequence of operators in the model from run time to compile time and generates only the code that executes the model. Not only does it avoid runtime interpretation time, but it also frees up memory usage to allow larger models to run. The resulting binaries are very light in size and therefore very efficient in storage.

MindSpore for Micro features will be open source in version 1.2.

“Quantitative”

MindSpore adaptive hybrid low bit quantization technology: automatically searches the corresponding layer quantization bits based on the model structure and target compression rate, without the need for deep involvement of quantization experts. The quantization factor can be trained, which can greatly improve the training efficiency and reduce the quantization loss in the low bit quantization scenario. The image classification/target detection model is used to verify the scene with 8~10 times compression, and the accuracy is better than the current industry quantization algorithm.

Mindspore post-training quantification technology: Post-training quantification has two distinct advantages over quantitative retraining: it does not require a large training data set, and it does not require retraining and can be quickly converted offline. Mindspore adopts the pipeline combination quantization method. In the first stage, the weight and activation values are quantized by conventional linear quantization methods. In the second stage, the quantization error is analyzed and the quantization model is corrected by statistical methods to compensate for the loss of accuracy caused by quantization.

Pipeline portfolio quantification

High running time

[End Cloud Unified Runtime] To provide a unified parallel running framework for end cloud training and reasoning, MindSpore designed the End Cloud Unified Runtime based on the Actor model.

AI training, or reasoning, ends up performing a DAG calculation graph where each node is an OP and each edge is a Tensor (or set of Tensors). In the figure below, a schematic of the Actor model is shown on the left and a schematic of an AI computational task is shown on the right.

We define an op as an actor that passes tensor between actors. In the actor model, messages are stateless and not multiplexable, but in AI computing tasks, tensor is often multiplexed for efficiency. To solve this problem, MINDRT introduced the Tensor Manager to manage Tensor uniformly, and all OPs get Tensor through the Manager.

Tensor Manager supports reference counting and memory management of Tensor. The End Cloud Unified Runtime will be open source in MindSpore 1.2/1.3.

[Soft and hard collaboration] MindSpore deeply combines native and end-to-side NPU chips to maximize the performance advantages of proprietary chips.

[Operator Optimization] On the CPU of mobile phone, Mindspore supports a variety of Convolution algorithms, such as Sliding Window, IM2COL +GEMM, Strassen, Winograd, Indirect Convolution, FFT, etc. How to choose the optimal convolution algorithm under different conditions, there are usually three ways:

Manual setup based on experience
Cost Model based on mathematical modeling
By using the machine learning algorithm model and using the existing data set to train it offline, a reliable convolution operator selector is finally obtained

Currently, MindSpore supports 2 and 3 ways to select the optimal convolution algorithm.

In addition to performance, the selection of the algorithm also needs to consider the memory limits in a particular scenario. For example, for hardware devices in IoT scenarios, if the most common IM2COL +GEMM algorithm is selected, the calculation process needs to level the input and convolution kernel in the memory, which accounts for a large amount of memory. Mindspore selects an Indirect Convolution algorithm that takes up less memory for this kind of scenario.

The federal study

MindSpore’s federated learning method supports both cross-device (TOC) and cross-silo(TOB) scenarios, and realizes multi-party joint modeling under the condition that data is not out of the domain, to help enterprise applications improve efficiency and reduce costs, and facilitate the intelligent upgrading of different industries. In terms of security, MindSpore provides a variety of model encryption methods that can be applied to large-scale stateless end devices, including differential privacy, secret sharing, security aggregation, etc. Users can customize the level of security.

After learning about the advantages of Mindspore’s AI framework, click the link and sign up now, and you can learn a classic case of Mindspore-based deep learning in ModelArts platform!

Click on the attention, the first time to understand Huawei cloud fresh technology ~

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

MindSpore’s “Four Ways” will keep you flat for the challenges of full-scene AI framework deployment

The challenge of unifying the AI framework for all scenarios

Rapid deployment

Run efficiently

The cloud synergy

MindSpore’s solution for a unified architecture for all scenarios

The end cloud unifies the kernel

Lightweight technology

“Quantitative”

High running time

The federal study

MindSpore’s “Four Ways” will keep you flat for the challenges of full-scene AI framework deployment

The challenge of unifying the AI framework for all scenarios

Rapid deployment

Run efficiently

The cloud synergy

MindSpore’s solution for a unified architecture for all scenarios

The end cloud unifies the kernel

Lightweight technology

“Quantitative”

High running time

The federal study

Related Posts

This post is all you need.

Deep learning training | spaCy how to install and use the cloud server?

Deep learning operator optimization -FFT