Author: chu superior | kuang depending on science and technology MegEngine architects

background

In the process of algorithm research, algorithm students may often try to define various new neural network layers, such as Layer Norm, Deformable Conv, etc. To implement these layers for experimentation, algorithmic students can use the neural network framework or the basic operations provided in NUMpy (such as tensor/scalar addition, subtraction, multiplication and division, etc.) to combine the required layer functions. However, this usually causes the performance of these layers to fall off a cliff, greatly affecting the efficiency of algorithm students trying new algorithms. Therefore, in many cases, algorithm students will choose to implement high-performance kernel for their own defined layer and hope to integrate it into the framework as the Op (Operator) in the framework.

However, in general, algorithm students must have a good understanding of the framework itself, so that they can flexibly and freely use our kernel into the framework. However, this is not a simple matter. As a large-scale system, neural network framework is far more complex than conventional software projects. In order to ensure the maintainability and extensibility of such machine learning system, various levels and modules are often designed in the system, and various concepts (such as Op) are abstracted, and the interaction between each level and each module is very complex.

Using the Op system in MegEngine as an example, Figure 1 shows the abstraction of the most basic concept of Op at different levels in MegEngine.

Figure 1: Abstraction of operators at different levels in MegEngine

  • In the underlying MegDNN operator library, Op is abstracted into MegDNNOpr class, which encapsulates the specific kernel implementation of each Op on x86, Nvidia GPU and other hardware platforms as well as the context management of related hardware.
  • In graph Runtime, Op is abstracted into the OpNode class, whose primary purpose is not computation but graph optimization, so there are quite a few considerations in the design of this data structure.
  • In imperative Runtime, Op is abstracted into OpDef class to perform task scheduling with dynamic graph interpreter.
  • In Python, Op is encapsulated in functional and Module, which is the Op that meets the general algorithm perception.

When an operation is performed from Python, these Op’s will be called down layer by layer, doing part of the work in each layer, until finally calling the specific kernel in the MegDNN operator library. In this process, any level of Op concept is indispensable. Not just Op, but many other concepts, including Tensor, have similar abstractions in MegEngine systems. Learning to understand such a framework design needs to spend a lot of time and energy, the cost is often difficult for algorithm students to accept.

However, even though algorithm students have sacrificed a lot of time and hair to learn about the design of MegEngine and complete the integration of their kernel to MegEngine, things are far from over. The kernel integration process is usually highly coupled to the framework itself. The process of building the Op requires obtaining all the source code of the framework and modifying most of the modules in the compilation framework. If the relevant abstractions within the framework later change, the previously built Op becomes unavailable again.

To allow the kernel of the algorithm to be quickly integrated into the framework and the integrated Op to behave the same as the native Op in the framework and decouple it from the framework itself, MegEngine proposed a set of tools called Custom Op. It is very simple and convenient to encapsulate the c++/cuda kernel written by the algorithm students into Op and automatically compile it into a dynamic link library and integrate it into MegEngine.

However, writing a high-performance c++/cuda kernel is still a challenge for algorithms researchers without an architecture/parallel computing background. So to avoid the problems of algorithm students writing their own kernel, MegEngine further proposed a Custom Op Generator based on Custom Op. Try to automate the end-to-end generation of kernel and Custom Op code by using neural network compiler code generation and integrate it into MegEngine so that the algorithm students can add a high-performance kernel to MegEngine and use it without writing any c++ code.

Normal Op integration with Custom Op

In order to facilitate the understanding of Custom Op design, we first of the traditional Op integration process and Custom Op integration process comparison, establish a preliminary impression of its Custom Op.

Generally speaking, if our algorithmic student wants to integrate his c++/cuda kernel into a MegEngine Op, he must first understand:

  • MegEngine The entire system structure.
  • Functions of various level modules in MegEngine.
  • Concepts like Op, Tensor and so on are designed and understood at different levels of modules.

After a thorough understanding of such a system, it needs to:

  • Encapsulate its kernel in MegDNN operator layer and encapsulate it into MegDNNOpr class.
  • Encapsulate your Op as an OpNode class based on the relevant components in the static diagram in MegEngine.
  • Encapsulate your Op as an OpDef class based on the relevant components in the dynamic diagram in MegEngine.
  • Write interactive python and C++ code to expose your Op to the python environment.

And in order to simplify the process, Custom Op provides a set of frame independent very concise Op model, algorithm users add Op without any understanding of the framework itself. The only thing it needs to do is to set some basic information of Op based on this model, such as the number of input and output of Op, which kernel to call, etc., so as to establish a description of its own Op, and its drawing style is generally shown in the following code.

CUSTOM_OP_REG(MatMulScale)              // Define an Op named MatMulScale
     .add_inputs(2)                  // Two input Tensor
     .add_outputs(1)                  // An output Tensor
     .add_param("scale".1.0 f)           // a Parameter named scale with a default value of 1.0f
     .set_shape_infer(shape_infer)         // Set the Op's shape derivation function
     .set_compute("cuda", compute);        // Set the Op calculation function
Copy the code

These Settings can be expressed in a few to a dozen lines of code, which greatly simplifies the workload of kernel integration for users. See the MegEngine Custom Op instructions for more information about the use of Custom Op.

The Custom Op then automatically encapsulates the user’s Op into OpNode and OpDef in the static and dynamic diagrams of MegEngine and generates a Python interface for it that is consistent with the native operator. Thus, the user’s Custom Op can be unified with the native Op in the system in terms of interface and underlying behavior.

Custom Op design

Custom Op is designed for both users and MegEngine systems, making it easy and convenient to interact with each other. To achieve this, Custom Op needs to have the following features:

  • For users, Custom Op needs to provide a set of concise and unified, frame-independent, and write Op must use the abstract concept. Users can use these abstract interfaces to encapsulate their c++/cuda kernel as an Op.
  • System oriented, Custom Op needs to provide MegEngine with a complete set of Op adaptation and management tools to allow the system to call and maintain the Custom Op.

Based on this, we design the overall architecture of Custom Op, as shown in Figure 2.

Figure 2: The overall structure of the Custom Op

In the general algorithm user’s perception, Op is a calculation function, accept some input, complete the calculation, and then get some output. The I/O here is divided into Tensor values (normal I/O Tensor) and non-tensor values (Param, such as padding, stride in convolution, etc.) as shown in Figure 3.

Figure 3: User perspective Op

Custom Op provides you with three abstract Tensor, Param, and Op.

  • The Tensor is the Op kernel’s primary object of calculation and manipulation, and has roughly the same abstraction and behavior as the Tensor in MegEngine Python.
  • Param is used to record the input of some nontensor of the Op (such as the padding, stride, and so on in convolution).
  • Op is a wrapper around the user’s c++/cuda kernel calculation functions. It also records the input and output of the Op and Param.

For MegEngine systems, Custom Op provides a complete set of Adaptors for MegEngine based on the needs of different levels of the MegEngine system. Translate the user’s Op and Tensor into the Op and Tensor that the MegEngine Runtime can handle. On the other hand, the Custom Op also provides MegEngine with a set of Managers for the user Op, allowing MegEngine to maintain and manage the Custom Op.

We’ll look at each of the Custom Op modules in the following sections.

Tensor

From the point of view of the user of the algorithm, the Tensor is a multidimensional array, and it has properties like shapes and quantized information, so the Tensor in the Custom Op is designed as a collection of data and its properties. The data is managed by a pointer to the data storage space. These attributes tell us how to parse the data in the data space. As you can see in The Tensor section in Figure 2, these properties include Device, DType and Shape:

  • Shape represents the Tensor dimension.
  • DType corresponds to the data types of elements in the Tensor, such as Float32, Uint8, etc.
  • Device means that the Tensor is on what Device (CPU/GPU).

You can build a good description of Tensor from these properties.

The fact that Tensor and its associated properties have another rich, but somewhat redundant, expression for the MegEngine system. To make it easier to use, Custom Op simplifies these concepts, leaving only the functionality needed to write the Op.

In particular, Shape provides users with behavior similar to c++ native arrays, which we can build and use using the following code:

Shape shape = {16.3.224.224};    / / build shape
bool equal = (shape[3] = =224);     // Gets the value of a specific dimension of shape
shape = {128.100};                 // Change the shape value
Copy the code

As for Device and DType, the user doesn’t need to know the implementation behind them, just what Device the data is on and what type it is. Therefore, Device and DType in Custom Op behave like string. Users can set specific Device and DType in the form of string values.

Device device = "x86";                 // Create an x86 device type
device = "cuda";                     // Change the device type to CUDA
bool equal = (device == "cuda");            // Check whether a device is CUDA

DType dtype = "float32";                       / / build dtype
bool equal = (dtype == "int8");           // Check whether dtype is equal
Copy the code

To decouple these types of interfaces from MegEngine, pIMPl is now used to hide these types of memory layouts, and users set/get data from them through a series of interfaces.

Param

Param is mainly used to express the non-tensor inputs of Op (the padding, the stride, etc.), but the non-tensor inputs of Op are very different. Maybe Op A’s Param is A series of strings and Op B’s Param is an int. So we needed to design a mechanism to unify these very different params for use by users and MegEngine systems. To do this, ParamVal is designed in the Custom Op. ParamVal erases the static Param types of each Op Param. The static Param types of each Op Param are unified to solve the problem that the Param types of different Op Param are inconsistent. Then, a set of dynamic typing system is defined at runtime to manage the actual Param types.

It may sound complicated, but it can actually be reduced to the following data structure. Void * is used for type erasure, and the erased data is stored in data, and type records the corresponding dynamic type of this data.

class ParamVal {
    void *data;             // Type erased data
   DynamicDataType type;   // The dynamic type of data
};
Copy the code

Not only does this design solve the problem of different Param types being inconsistent, but the existence of dynamic types also greatly alleviates the inconvenience caused by the lack of reflection in c++. Custom Op designs a unified Param Parse and serialization mechanism based on this dynamic typing system, without requiring users to write this part of the code for their own Param.

Custom Op also defines a number of operators for ParamVal and functions to convert it to and from static types. Finally, from the user’s perspective, ParamVal behaves very much like a python variable.

ParamVal a = 1.0, b = 2, c = false, d = {1.2}; // Express different types of data
ParamVal e = a + b;                             // ParamVal evaluates to each other
ParamVal f = e - 2;                             // ParamVal calculates statically typed data
d = c; 
Copy the code

Op

From a user’s perspective, an Op is a description of a computational process without recording any data. In order to unify with the cognition of algorithm users, the Op in Custom Op is also designed as stateless, that is, Op only saves the relevant functions and some layout information of input and output, but does not record the specific values of input and output.

Figure 4: Op and its components

Specifically, in Custom Op, Op is the input and output Tensor information (TensorInfo), Param information (ParamInfo), and the set of Op related Functions (Functions), as shown in Figure 4. TensorInfo records the number of input and output Tensor for the Op, its name, its legal types, its dimensions, and its memory allocation strategy. ParamInfo records the name, default type, and default values of each Param of the Op. As for Functions, it actually contains two parts, kernel calculation function and Tensor attribute derivation function.

  • The kernel takes the Tensor and Param values and sends them back to the user’s c++/cuda kernel.
  • The Tensor attribute derivation function is to deduce the layout of the output Tensor data according to the attributes of the input Tensor/Param before kernel execution, so as to decouple operator execution from operator memory allocation for memory planning.

Most of the Custom Op functions provide a default implementation, and users can customize their Op by overriding the default behavior of these functions.

Manager and Adaptor

Considering that the Custom Op is a set of Op abstractions decoupled from the MegEngine system, MegEngine doesn’t directly interact with the Op defined by the Custom Op. To solve this problem, an additional set of Adaptors and Managers is designed in the Custom Op. Adaptor allows MegEngine to use the Custom Op, and Manager allows MegEngine to sense and manage the Custom Op.

The purpose of the Adaptor is to allow MegEngine to operate using the Custom Op to build networks and so on, and to allow MegEngine to interact with the Custom Op to complete calculations.

  • For the former, Adaptors can wrap the Custom Op as an abstraction of the Op from the dynamic and static diagrams in MegEngine, making it behave the same as the built-in Op in MegEngine.
  • For the latter, Adaptors can combine the user-oriented Tensor abstraction of the Custom Op with the Tensor abstraction of MegEngine to translate between the two and allow data to flow freely between MegEngine and the Custom Op.

For the Manager module, it provides the dynamic link library compiled by Custom Op and the management of Custom Op itself. Specifically, the Custom Op is loaded into the MegEngine system as a dynamic link library when used, so we managed these dynamic link libraries based on RAII. The loading and unloading of dynamic libraries is tied to the construction and destruction of Lib classes to avoid resource leaks. For the Op itself, the Manager provides some basic add, delete, change and check operations to allow MegEngine to manage the Custom Op.

Custom Op preparation

When users use Custom Op to write Op, users can use these concepts to encapsulate their own kernel into Custom Op, and use the construction tools provided by Custom Op to compile it into a dynamic link library. Load it into MegEngine at run time and use it.

Specifically, let’s say we need to add an operator called MatmulScale for MegEngine. The operator will take a matrix multiplication of two inputs, LHS and RHS, and then multiply the matrix multiplication by the scalar Scale.

The pseudo-code of the mathematical execution process of the operator is as follows:

def MatMulScale(lhs, rhs, scale) :
    result = lhs.dot(rhs)
    result = result * scale
    return result
Copy the code

For such an operation, suppose we have written a CUDa kernel code for it and provided the following interface functions to call it:

void matmul_scale(const float *lhs, const float *rhs, float *result, size_t M, size_t K, size_t N, float scale);
Copy the code

LHS, RHS, and result are float Pointers, representing two input Tensor and one output Tensor for the Op, which all need to refer to the allocated CUDA memory. And M, K, N is the dimensional information of the matrix, which means an M by K matrix times a K by N matrix. And scale is the coefficient that you multiply by when you multiply the matrix.

In this case we could write the following c++ code to encapsulate it as MegEngine’s Op.

void shape_infer(const std::vector<Shape> &inputs, const Param &params, std::vector<Shape> &outputs) {
    outputs[0] = {inputs[0] [0], inputs[1] [1]};
}

void compute(const std::vector<Tensor> &inputs, const Param &params, std::vector<Tensor> &outputs) {
    matmul_scale(inputs[0].data<float>(), inputs[1].data<float>(), outputs[0].data<float> (),...). ; }CUSTOM_OP_REG(MatMulScale)              // Define an Op named MatMulScale
     .add_inputs(2)                  // Two input Tensor
     .add_outputs(1)                  // An output Tensor
     .add_param("scale".1.0 f)           // a Parameter named scale with a default value of 1.0f
     .set_shape_infer(shape_infer)         // Set the Op's shape derivation function
     .set_compute("cuda", compute);        // Set the Op calculation function
Copy the code

The first part of this code is the definition of some functions, including the output Tensor attribute inference function and the calculation function. The former will put in the Tensor properties like shape to deduce the corresponding Tensor properties, and the latter will call the CUDa kernel to do the calculation. The second part is the registration of the Op. It’s used to define that the Op has some input and output Tensor, has some Param, and registers the pointer to the attribute inference function and the calculation function defined above with the Op. This completes the build of the Custom Op.

Custom Op Generator

Custom Op makes it easy to integrate a user-written c++/cuda kernel into MegEngine. However, it is always a last resort for users to write the kernel themselves. How can we eliminate the need for users to write the kernel themselves? MegEngine is trying to solve this problem with a tool called Custom Op Generator based on the AI compiler.

Custom Op Generator allows users to create their Op expressions directly using simple python primitives provided by the AI compiler without writing any c++/cuda code. Then the AI compiler automatically generates the kernel for the Op and decorates the Custom Op to encapsulate the kernel into the MegEngine Custom Op, and automatically builds and integrates it into MegEngine. Users only need to write some Python code to avoid the problem of writing kernel by themselves.

The framework builds a model for training, and then sends the model to the compiler for optimized deployment. In a Custom Op Generator, however, the positions are reversed. In such a Workflow, the compiler is first and the framework is last, and the compiler takes advantage of its code generation capabilities to extend the framework. In a sense, this is a new way of combining AI compilers with frameworks.

conclusion

Custom Op, as a bridge between user kernel and system Op, provides users with a simple and unified Op abstraction independent of the framework, and provides a complete Op adaptation and management tool for the system. For this purpose, in the design and implementation of Custom Op process, we analyze the user to write Op when the necessary concepts and abstract, designed a set of frame independent Op model and provide simple interface to the user, so as to achieve the user side and system side decoupling, The Op written by the user does not need to change as the system is updated and iterated. To allow the MegEngine system to manage and use these ops, the Custom Op has a management and adaptation module that automatically ADAPTS the user’s Op to the Op in the MegEngine dynamic and static diagrams. The behavior of user Op in the system is consistent with the original Op, so as to facilitate the use and management of the system. By this design, users do not need to know about the MegEngine framework when they integrate Op. They can quickly integrate the kernel into MegEngine and use it with only twenty lines of code, which greatly reduces the difficulty and workload of the algorithm users in kernel integration.

The original address: megengine.org.cn/blog/design…