How to add a new hardware backend to MindSpore? Quickly build a test environment!

Abstract:Describes how to add a new hardware backend to MindSpore.

This article was shared from Huawei Cloud Community “How to add a new hardware backend to MindSpore? Quickly build a test environment!” , by HWCloudai.

Mindspore is a new generation of open-source computing framework for AI developed by Huawei. The full-scene deep learning framework that best matches the computing power of Shengteng AI processor provides data scientists and algorithm engineers with design friendly and efficient development experience, and promotes the flourishing development of artificial intelligence software and hardware application ecology.

MindSpore supports heterogeneous computing power. In addition to supporting the Ascend NPU of Huawei’s own DaVinci architecture, MindSpore also supports the operation of CPU(e.g. MKLDNN) and GPU(e.g. Cuda Kernels) operators. (Note: MindSpore supports the entire network to run on different hardware platforms and does not support different partitions of the same network to run on different hardware platforms, which is different from TensorFlow’s Graph Partitions.)

At present, the AI chip industry is “bustling”. At home and abroad, new and old manufacturers, big and small, are launching their own AI acceleration chips. It should be clear to everyone by now that the success of hardware depends on the support of the software stack and ecology. MindSpore not only serves to support Huawei’s AI hardware and software stack, but also wants to take its own place in the entire AI ecosystem.

Mindspore is still in the stage of promotion and development. This article will introduce how to add a new hardware backend to Mindspore, and also introduce the directory structure of Mindspore source code. It is hoped that it can provide some useful information and reference for domestic and foreign AI hardware manufacturers and developers who are interested. It allows people to use MindSpore as a framework for testing and docking AI chips, and to quickly build a test environment for the entire network model.

This paper is aimed at MindSpore r1.1 version of the source code: https://gitee.com/mindspore/m… For how to build from source and install MindSpore, as well as the demand for software version, please refer to: https://www.mindspore.cn/inst…

The test case

This article will focus on a simple Dense layer network: https://www.mindspore.cn/doc/… To demonstrate how to run this layer on a new hardware backend.

Note: this article is aimed at the basic of the static graph execution mode: https://www.mindspore.cn/doc/…

import mindspore import numpy as np import mindspore.nn as nn from mindspore import context, Tensor context.set_context(device_target="CPU", mode=context.GRAPH_MODE) # 32, 16 net = nn.Dense(32, 16, Weight_init ='ones', bias_init=1.2)#, activation='relu') # 32]).astype(np.float32), mindspore.float32) output = net(input_data) print(output.asnumpy())

Note: I have annotated the activation of RELU here, so the Dense layer is equivalent to a small network with only 2 nodes (MatMul + BiasAdd). The result of this use case is a two-dimensional matrix of 48 * 16, with each element having a value of 33.2.

This article describes the components that MindSpore needs to modify to support a new hardware backend in a top-down process. We will call the new hardware we need to support here XPU. The effect we want to achieve after changing the MindSpore code is to change the device_target in the above use case to XPU and make the Dense Layer run on the accelerator XPU. e.g.

context.set_context(device_target="XPU", mode=context.GRAPH_MODE)

Note: This article does not show the implementation details of specific classes and functions. For specific implementations, refer to the supported hardware backend implementations in the corresponding directory, such as CPU, GPU, and Ascend

First from the front ME Python layer we need to add new valid_targets:

https://gitee.com/mindspore/mindspore/blob/r1.1/mindspore/context.py def set_device_target (self, target) : Valid_targets = ["CPU", "GPU", "scend", "Davinci", "XPU"] # add new backend to list if not target in valid_targets: raise ValueError(f"Target device name {target} is invalid! It must be one of {valid_targets}") if target == "Davinci": target = "Ascend" self.set_param(ms_ctx_param.device_target, target) if self.enable_debug_runtime and target == "CPU": self.set_backend_policy("vm")

Then need to ms context in c + + components add new target:https://gitee.com/mindspore/m…

const int kGraphMode = 0; const int kPynativeMode = 1; const char kCPUDevice[] = "CPU"; const char kGPUDevice[] = "GPU"; const char kXPUDevice[] = "XPU"; // add new hardware target const char kascendDevice [] = "Ascend"; const char kDavinciInferenceDevice[] = "AscendInference"; const char kDavinciDevice[] = "Davinci"; const char KNpuLog[] = "_npu_log"; const unsigned int MAX_CALL_DEPTH_DEFAULT = 1000; // add new hardware to set const STD ::set< STD ::string> ktargetSet = {kCPUDevice, kGPUDevice, kXPUDevice, kascendDevice, kCPUDevice, kGPUDevice, kXPUDevice, kascendDevice, kDavinciDevice};

Add a new Runtime Device

In the Runtime Device directory:https://gitee.com/mindspore/m…It is the component related to each specific backend hardware features, such as address space of the device side, memory management (allocation, recycling) of the device side, kernel runtime component, etc., as well as some communication components related to the hardware device, such as MPI component supporting distributed communication. We will start by adding a folder called xpu to the directory in the image below (note that change the cmMakelists.txt to add folder) :

Here are three new basic components to create for the XPU Accelerator:

· XPU_DEVICE_ADDRESS: It mainly represents the memory address information on the device side of the accelerator, as well as the API interface for the memory transfer between the host side and the device side. For example, on the NVIDIA GPU it can be wrapper of: cudaMemcpyAsyncxpu_device_address.h

#include <string> #include <vector> #include "runtime/device/device_address.h" #include "utils/shape_utils.h" namespace mindspore { namespace device { namespace xpu { class XPUDeviceAddress : public DeviceAddress { public: XPUDeviceAddress(void *ptr, size_t size) : DeviceAddress(ptr, size) {} XPUDeviceAddress(void *ptr, size_t size, const string &format, TypeId type_id) : DeviceAddress(ptr, size, format, type_id) {} ~XPUDeviceAddress() override = default; bool SyncDeviceToHost(const ShapeVector &shape, size_t size, TypeId type, void *host_ptr) const override; bool SyncHostToDevice(const ShapeVector &shape, size_t size, TypeId type, const void *host_ptr) const override; DeviceAddressType DeviceType() const override { return DeviceAddressType::kXPU; }}; } // namespace xpu } // namespace device } // namespace mindspore

· XPU_RESOURCE_MANAGER: mainly responsible for the management, allocation and scheduling of memory and other resources on the device side. xpu_resource_manager.h

#include <vector>
#include <map>
#include "backend/session/kernel_graph.h"
#include "backend/session/session_basic.h"
#include "runtime/device/device_address.h"
#include "runtime/device/xpu/xpu_simple_mem_plan.h"
namespace mindspore {
namespace device {
namespace xpu {
class XPUResourceManager {
 public:
  XPUResourceManager() = default;
  ~XPUResourceManager();

  void AssignMemory(const session::KernelGraph *graph);
  void IncreaseAddressRefCount(const session::KernelGraph *graph);
  void DecreaseAddressRefCount(const AnfNodePtr &kernel);
  void *MemMalloc(size_t mem_size);
  void MemFree(void *ptr);

 private:
  void MemFree();
  XPUSimpleMemPlan mem_plan_;

  size_t mem_size_{0};
  uint8_t *mem_ptr_{nullptr};
  bool dynamic_malloc_{false};
  std::map<void *, size_t> dynamic_mem_;
};
}  // namespace xpu
}  // namespace device
}  // namespace mindspore

· XPU_KERNEL_RUNTIME: The execution control module of the hardware operator, mainly responsible for the start-up of the hardware runtime (Init()) and the execution of the network on the hardware (Run(..)). (ReleaseDeviceres ()) xpu_kernel_runtime.h

#include <memory>
#include <vector>
#include <string>
#include <map>
#include <set>
#include "runtime/device/kernel_runtime.h"
#include "runtime/device/kernel_runtime_manager.h"
#include "backend/session/kernel_graph.h"
#include "backend/session/session_basic.h"
#include "runtime/device/xpu/xpu_resource_manager.h"
#include "backend/session/anf_runtime_algorithm.h"
#include "utils/any.h"
namespace mindspore {
namespace device {
namespace xpu {
class XPUKernelRuntime : public KernelRuntime {
 public:
  XPUKernelRuntime() = default;
  ~XPUKernelRuntime() override = default;

  bool Init() override;
  void ReleaseDeviceRes() override;
  bool Run(session::KernelGraph *graph, bool is_task_sink) override;
  void AssignKernelAddress(session::KernelGraph *kernel_graph);
  void CreateOutputTensors(session::KernelGraph *kernel_graph, const std::vector<tensor::TensorPtr> &inputs,
                           VectorRef *outputs);
  void BindInputOutput(session::KernelGraph *kernel_graph, const std::vector<tensor::TensorPtr> &inputs,
                       VectorRef *outputs);

 protected:
  bool SyncStream() override { return true; };
  DeviceAddressPtr CreateDeviceAddress(void *device_ptr, size_t device_size, const string &format,
                                       TypeId type_id) override;

 private:
  XPUResourceManager resource_manager_;
  std::set<DeviceAddressPtr> bound_addresses_;
  std::map<AnfNodePtr, tensor::TensorPtr> input_param_tensor_map_;
};

MS_REG_KERNEL_RUNTIME(kXPUDevice, XPUKernelRuntime);

}  // namespace xpu
}  // namespace device
}  // namespace mindspore

Add a new Target Session

Mindspore’s Session provides the context for Op kernel execution and Tensor evaluation. Session is the core module that controls the data flow graph representing the neural network. It consists of three main steps: graph compilation (kernel generation), graph optimization, and graph execution. MindSpore for each back-end hardware platform has its own Session components, the code in the backend/Session this directory: https://gitee.com/mindspore/m… We create a new session class for XPU: xpu_session.h

#include <string> #include <memory> #include <map> #include <vector> #include "backend/session/session_basic.h" #include  "backend/session/kernel_graph.h" #include "runtime/device/xpu/xpu_kernel_runtime.h" // use the new xpu kernel runtime #include "backend/session/session_factory.h" namespace mindspore { namespace session { class XPUSession : public SessionBasic { public: XPUSession() = default; ~XPUSession() override = default; void Init(uint32_t device_id) override { InitExecutor(kXPUDevice, device_id); } GraphId CompileGraphImpl(const AnfNodePtrList &lst, const AnfNodePtrList &outputs) override; void RunGraphImpl(const GraphId &graph_id, const std::vector<tensor::TensorPtr> &inputs, VectorRef *outputs) override; void Optimize(const std::shared_ptr<KernelGraph> &kernel_graph); protected: void UnifyMindIR(const KernelGraphPtr &graph) override { return; } void CreateOutputTensors(const GraphId &graph_id, const std::vector<tensor::TensorPtr> &input_tensors, VectorRef *, std::map<tensor::TensorPtr, session::KernelWithIndex> *tensor_to_node) override; private: void SetKernelInfo(const KernelGraph *kernel_graph); void BuildKernel(const KernelGraph *kernel_graph); device::xpu::XPUKernelRuntime *runtime_ = dynamic_cast<device::xpu::XPUKernelRuntime*>(device::KernelRuntimeManager::Instance().GetKernelRuntime(kXPUDevice, 0)); }; MS_REG_SESSION(kXPUDevice, XPUSession); } // namespace session } // namespace mindspore

CompileGraphImpl(..) ), the main thing is to generate (buildKernel (..) Represents the kernel corresponding to each node OP in the neural network data flow diagram, and saves the kernel information of each node in the diagram (setKernelInfo (..)). ) for execution in the following diagram (runGraphImpl (..)). Is called in the.

Add a kernel for new hardware

The hardware backend that MindSpore supports for each OP operator is in the backend/kernel_compiler directory:https://gitee.com/mindspore/m…

Here we can see that for a few hardware backends, each folder represents a different kernel type, where:

CPU: There are calls to MKLDNN(oneDNN) operators, there are also pure C ++ written operators.
GPU: There are operators that call CUDNN/Cublas, also operators written by CUDA, and operators that support distributed training and NCCL related.
Ascend: The operator kernel files associated with Huawei DaVinci AI chips are: TBE, AICPU, AKG, HCCL, etc

To add the necessary kernel support to our new hardware backend, we will start by creating a folder called xpu in the above directory (note the change to cmakelists.txt add folder). In the new folder we will start by creating base classes for the XPU kernel:

xpu_kernel.h:

#include <string>
#include <vector>
#include <memory>
#include <numeric>
#include <functional>
#include "backend/kernel_compiler/kernel.h"
#include "ir/anf.h"
#include "backend/session/anf_runtime_algorithm.h"
#include "utils/ms_utils.h"

using mindspore::kernel::Address;
using mindspore::kernel::AddressPtr;
namespace mindspore {
namespace kernel {

class XPUKernel : public kernel::KernelMod {
 public:
  XPUKernel() = default;
  ~XPUKernel() override = default;

  void Init(const CNodePtr &kernel_node);
  virtual void InitKernel(const CNodePtr &kernel_node) = 0;
  bool Launch(const std::vector<AddressPtr> &inputs, const std::vector<AddressPtr> &workspace,
              const std::vector<AddressPtr> &outputs, void * stream_ptr) override {
    return Launch(inputs, workspace, outputs);
  };

  virtual bool Launch(const std::vector<AddressPtr> &inputs, const std::vector<AddressPtr> &workspace,
                      const std::vector<AddressPtr> &outputs) = 0;
  const std::vector<size_t> &GetInputSizeList() const override { return input_size_list_; }
  const std::vector<size_t> &GetOutputSizeList() const override { return output_size_list_; }
  const std::vector<size_t> &GetWorkspaceSizeList() const override { return workspace_size_list_; }

  void SetOpName(const std::string &op_name) { op_name_ = op_name; }
  const std::string GetOpName() const { return op_name_; }

 protected:
  virtual void InitInputOutputSize(const CNodePtr &kernel_node);
  std::vector<size_t> input_size_list_ = {};
  std::vector<size_t> output_size_list_ = {};
  std::vector<size_t> workspace_size_list_ = {};

  std::string bin_path_ = {};
  std::string tilingName_ = {};

};
}  // namespace kernel
}  // namespace mindspore

Now the popular framework for operator kernel support is generally using operator kernel name (opcode), such as MKLDNN CPU kernels in Mindspore: The advantage of the Mindspore/Mindspore format is that the repo code file is clear and the specific properties of each operator can be easily expressed. The disadvantage is that it may cause some duplicate code logic. Since the use case targeted in this paper is very simple, in fact, only two operators need to be supported: MatMul and BiasAdd. We will adopt the kernel class implementation method named according to the number of input and output Tensor.

Since both Matmul and Biasadd are operators with two inputs and one output, we define our kernel class named two_in_one_out_xpu_kernel.h

#include "backend/kernel_compiler/xpu/xpu_kernel.h" // xpu kernel base class #include "backend/kernel_compiler/xpu/xpu_kernel_factory.h" #include <stdio.h> #include <limits.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <dirent.h> #include <algorithm> #include <fstream> #include <iostream> namespace mindspore { namespace kernel { class TwoInOneOutXPUKernel : public XPUKernel { public: TwoInOneOutXPUKernel() = default; ~TwoInOneOutXPUKernel() override = default; void InitKernel(const CNodePtr &kernel_node) override; bool Launch(const std::vector<AddressPtr> &inputs, const std::vector<AddressPtr> &workspace, const std::vector<AddressPtr> &outputs) override; private: bool NeedsFormatTransformation(); char trans_a_{TRANSPOSE_NO}; char trans_b_{TRANSPOSE_NO}; int32_t dim_m_{0}; int32_t dim_n_{0}; int32_t dim_k_{0}; std::vector<size_t> inputA_shape_; std::vector<size_t> inputB_shape_; std::vector<size_t> output_shape_; size_t input_a_size_ = 0; size_t input_b_size_ = 0; size_t output_size_ = 0; void *inputA_data_ = nullptr; void *inputB_data_ = nullptr; void *output_data_ = nullptr; }; MS_REG_XPU_KERNEL( TwoInOneOutXPU, mindspore::device::xpu::KernelAttr().AddInputAttr(kNumberTypeFloat32).AddInputAttr(kNumberTypeFloat32).AddOutputAttr(kNu mberTypeFloat32), TwoInOneOutXPUKernel); } // namespace kernel } // namespace mindspore

Here we use “backend/kernel_compiler/xpu/xpu_kernel_factory.h” which we will not discuss in detail about the creation of the kernel factory class. Specific can consult cpu_kernel_factory.h:https://gitee.com/mindspore/m…

The basic two functions for each kernel are InitKernel(..). And LaunchKernel (..) Responsible for the initialization and operation of kernel respectively. It is important to note here that for general implementations like CNN static diagrams, initKernel (..) This will only run once when the kernel is created (in the compile graph of the session above), whereas the launchKernel (..) Will be called during each diagram execution. For example, the inference of CNN needs 64 images in Infernce, and the network batch size is 32, so the whole image needs to be executed twice. That is to say, for each kernel, InitKernel(..) Will be called once, and the launchKernel (..) It’s going to be called twice.

We will not discuss the specific implementation of Matmul and Biasadd Kernel in detail. We will only introduce some basic APIs for operator kernel in Mindspore:

· Get input and output shape information of TwoInoneOutxPuKernel:

inputA_shape_ = AnfAlgo::GetInputDeviceShape(kernel_node, 0);
inputB_shape_ = AnfAlgo::GetInputDeviceShape(kernel_node, 1);
output_shape_ = AnfAlgo::GetOutputDeviceShape(kernel_node, 0);

· Obtain operator attribute information, e.g., transpose information of Matmul:

bool trans_a = AnfAlgo::GetNodeAttr<bool>(kernel_node, TRANSPOSE_A);
bool trans_b = AnfAlgo::GetNodeAttr<bool>(kernel_node, TRANSPOSE_B);

· Get a pointer to input and output memory in Launch:

auto input_a = reinterpret_cast<float *>(inputs[0]->addr);
auto input_b = reinterpret_cast<float *>(inputs[1]->addr);
auto output = reinterpret_cast<float *>(outputs[0]->addr);

Other Points to Note

As with any major framework, MindSpore has its own set of standards and specifications. Here are some of the pitfalls that MindSpore has stumbled upon:

· The default format of Tensor in Mindspore is NCHW. If you are adding a hardware backend that supports a different format, be careful to add format conversions. Format conversion can be done before and after each kernel call (inefficient), or you can use a graph optimization pass to efficiently insert format conversion nodes with a network view.

· Accuracy conversion. If your hardware platform only supports some precision, such as FP16, and your network is FP32, pay attention to accuracy conversion. Accuracy conversion is similar to the above format conversion. Accuracy conversion can be done on the host side or, if hardware supports it, on the device side.

· The code logic of each kernel should distinguish which data are invariable and which will change, which need to be reinitialized before each execution, so that different logical codes can be reasonably and correctly allocated to initKernel (..). Or LaunchKernel (..) The soup.

For some Python front-end LayerAPI, MindSpore has its own some property is set, for example, Denselayer:https://gitee.com/mindspore/m… The second input matrix of is transposed:

self.matmul = P.MatMul(transpose_b=True)
self.batch_matmul = P.BatchMatMul(transpose_b=True)
self.activation = get_activation(activation) if isinstance(activation, str) else activation
if activation is not None and not isinstance(self.activation, (Cell, Primitive)):
    raise TypeError("The activation must be str or Cell or Primitive,"" but got {}.".format(activation))
self.activation_flag = self.activation is not None

· For Debug, you can add the following environment variables to help output information:

export GLOG_v=1
export SLOG_PRINT_TO_STDOUT=1

· For changes to CMake files, you can add all new files under if (ENABLE_CPU) at the beginning of the test. The CPU is equivalent to a baseline platform for MindSpore. This means that whether you build a GPU or Huawei’s D/Ascend Target, CPU-related files will be built.

conclusion

This article is based on the author’s understanding of MindSpore to share a technical article on how to modify the MindSpore source code to add a new hardware backend. The success of an open source software framework depends on the support of the community and the participation of various vendors. I hope this article will serve as a catalyst for more hardware manufacturers and developers to participate in MindSpore’s ecological development. Welcome to join us in the discussion! Finally, I wish you all a happy New Year! Wish Mindspore better and better in 2021! Getting stronger!!

Learning about the key technologies of Mindspore is exciting! Click the link and sign up now to learn a classic case of MindSpore-based deep learning in ModelArts platform!

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

How to add a new hardware backend to MindSpore? Quickly build a test environment!

The test case

Add a new Runtime Device

Add a new Target Session

Add a kernel for new hardware

Other Points to Note

conclusion

Click on the attention, the first time to understand Huawei cloud fresh technology ~

How to add a new hardware backend to MindSpore? Quickly build a test environment!

The test case

Add a new Runtime Device

Add a new Target Session

Add a kernel for new hardware

Other Points to Note

conclusion

Click on the attention, the first time to understand Huawei cloud fresh technology ~

Related Posts

LiteOS kernel source code analysis: message Queue

Based on the speech recognition model of RNN and CTC, the solution of context offset is explored

The evolution of cloud rendering services in the CG industry