Welcome to follow my public account [Jizhi Vision], reply 001 to get Google programming specification

O_o >_< O_o O_o ~_~ O_o

This tutorial shares the method of adding layer by layer operators to PyTorch – MLU on Cambrian devices.

The basic unit for data transfer and storage between operators in pyTorch – MLU layer by layer is the tensor. Pytorch – Mlu distributes the operator to different devices based on the device attribute value in the tensor. Take the ABS () operator for example. At dispatch, the operator calls are distributed to devices based on the device attribute value of Input_tensor. The logic is shown below:

Catch is decoupled from pyTorch’s source code by registering an MLU operator. Here are the steps to add an MLU operator to a Catch.

1. Registration operator

In catch/ TORch_mlu/CSRC /generated/aten_mlu_type_default. CPP register operator:

.op(torch::RegisterOperators::options().schema("aten::add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor")  // NOLINT 

  .impl_unboxedOnlyKernel<at::Tensor(const at::Tensor &, const at::Tensor &, at::Scalar), &AtenMluType::add>(at::TensorTypeId::MLUTensorId)
  
  aliasAnalysis(c10::AliasAnalysisKind::FROM_SCHEMA))
Copy the code

2. Operator distribution

AtenMluType and AtenMluCustomType are entries to operators in the Catch module. The AtenMluType class mainly contains the standard operators in the framework; The AtenMluCustomType class contains custom operators. Add the operator declaration and implementation to AtenMluType or AtenMluCustomType depending on the operator attribute.

  • The standard operator is distributed incatch/torch_mlu/csrc/aten/aten_mlu_type.hcatch/torch_mlu/csrc/aten/aten_mlu_type.cppAdd operator declaration and implementation to:
aten_mlu_type.h
static at::Tensor add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha);
aten_mlu_type.cpp
at::Tensor AtenMluType::add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha){
  return OP_DISPATCH(add, self, other, alpha);
}
Copy the code
  • Customized operator distribution

For MLU specific operators, Add operator declaration and implementation to catch/ TORch_mlu/CSRC /aten/aten_mlu_type. H and catch/ Torch_mlu/CSRC /aten/aten_mlu_custom_type. CPP:

aten_mlu_type.h
static at::Tensor linear(const at::Tensor& input,
                         const at::Tensor& weight,
                         const at::Tensor& bias,
                         const at::Tensor& q_scale,
                         const at::Tensor& q_mode);
aten_mlu_custom_type.cpp
at::Tensor AtenMluCustomType::linear(const at::Tensor& input,
                                     const at::Tensor& weight,
                                     const at::Tensor& bias,
                                     const at::Tensor& q_scale,
                                     const at::Tensor& q_mode){
    return OP_DISPATCH(linear, input, weight, bias, q_scale, q_mode);
}
Copy the code

Modify the OpMethods base class

Both AtenMluType and AtenMluCustomType are delivered to inference operators or training operators through OpMethods. In catch/torch_mlu / / aten/operators/CSRC op_methods. H and catch/torch_mlu/CSRC/aten/operators/op_methods operator is added to the CPP, declaration and implementation. The implementation part of OpMethods is the CPU implementation of this operator.

op_methods.h
virtual at::Tensor add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha);
op_methods.cpp
at::Tensor OpMethods::add(const at::Tensor& self,
                          const at::Tensor& other,
                          at::Scalar alpha){
   auto input_cpu = self.cpu();
   auto other_cpu = other.cpu();
   auto output = at::add(input_cpu, other_cpu, alpha);
   return output.to(at::Device(at::Device::Type::MLU));
}
Copy the code

4. Delivery operator

In catch/torch_mlu / / aten/operators/CSRC cnml_ops. H and catch/torch_mlu/CSRC/aten/operators/cnml_ops reasoning operator is added to the CPP, declaration and implementation.

cnml_ops.h
at::Tensor add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha);
cnml_ops.cpp
at::Tensor CnmlOps::add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha){
  CNML_DISPATCH(add, cnml_add, self, other, alpha);  // the first argument to the CNML_DISPATCH macro is the interface name, the second argument is the wrapper name, and the rest
}
Copy the code

5. Add wrapper

Wrapper is a wrapper around the operator kernel, with one wrapper for each operator. Using the add operator as an example, add wrapper as follows:

cnml_kernel.h
at::Tensor cnml_add(const at::Tensor& input, const at::Tensor& other, at::Scalar alpha);
add.cpp
at::Tensor cnml_add(const at::Tensor& input, const at::Tensor& other, at::Scalar alpha_scalar){
  TORCH_CHECK(input.dim() >= 0 || other.dim() >= 0."dimension not support");
  at::Tensor input_ = input;
  at::Tensor other_ = other;
  auto alpha_data = alpha_scalar.to<scalar_t> ();if(alpha_data ! =1) {// scale_t
    other_ = cnml::ops::cnml_scale(other_, alpha_data, 0);
  }
  if(other_.dim() < 1 && other_.device().type() == c10::DeviceType::CPU){
    auto other_scalar = other_.item();
    return cnml_add_internal(input_, other_scalar);   / / call the kernel
  }
  if(input_.dim() < 1 && input_.device().type() == c10::DeviceType::CPU){
    auto input_scalar = input_.item();
    return cnml_add_internal(other_, input_scalar);   / / call the kernel
  }
  
  boolbroadcast = input_.sizes() ! = other_.sizes();if(broadcast){
    auto broadcast_size = at::infer_size(input.sizes(), other.sizes());
    at::Tensor broadcast1 = cnml::ops::cnml_expand(input_, broadcast_size, false);
    at::Tensor broadcast2 = cnml::ops::cnml_expand(other_, broadcast_size, false);
    return cnml_add_internal(broadcast1, broadcast2);  / / call the kernel
  }else{
    return cnml_add_internal(input_, other_);  / / call the kernel
  }
  return cnml_add_internal(input_, other_);   / / call the kernel
}
Copy the code

6. Add wrapper

Operator functionality is implemented in the Wrapper by calling kernel. In the example, cnML_add_internal is called. The specific implementation of the operator is mainly accomplished by calling the interface of the CNML library. The following is the logic of the CNML library:

Kernel implementation is completed by calling the CNML library interface according to the above programming logic. In the catch/torch_mlu CSRC/aten/operators/CNML/internal/cnml_internal h and Catch/torch_mlu / / aten/operators/CSRC CNML/internal/add_internal/CPP statement and the implementation of the kernel function is added.

cnml_internal.h
at::Tensor cnml_add_internal(const at::Tensor& input1, const at::Tensor& input2);
add_internal.cpp
at::Tensor cnml_add_internal(const at::Tensor& input1, const at::Tensor& input2){
  auto output = at::native::empty_like(input1);
  // prepare input cnml tensor
  auto* input1_impl = getMluTensorImpl(input1);  / / get MluTensorImpl
  auto input1_cnml = input1_impl->CreateCnmlTensor(
       CNML_TENSOR, toCnmlDataType(input1.dtype()));  // Type adaptive: toCnmlDataType()
       
  auto* input2_impl = getMluTensorImpl(input2);
  auto input2_cnml = input2_impl->CreateCnmlTensor(
      CNML_TENSOR, toCnmlDataType(input2.dtype()));
      
  // prepare output cnml tensor
  auto* output_impl = getMluTensorImpl(output);
  auto output_cnml = output_impl->CreateCnmlTensor(
      CNML_TENSOR, toCnmlDataType(output.dtype()));
      
  // End the execution flow if not MLU device
  CHECK_MLU_DEVICE(output);
  
  // setup operator
  cnmlBaseOp_t add_op;
  TORCH_CNML_CHECK(cnmlCreateAddOp(&add_op, input1_cnml, input2_cnml, output_cnml));
  
  // return to JIT if running mode is fuse
  CHEXK_RETURN_TO_FUSE(add_op, output);
  
  // compile op
  TORCH_CNML_CHECK(cnmlCompileBaseOp(add_op, GET_CORE_VERSION, GET_CORE_NUMBER));
  
  auto queue = getCurQueue();
  TORCH_CNML_CHECK(cnmlComputeAddOpForward_V4(add_op,
                                              NULL,
                                              input1_impl->raw_mutable_data(),
                                              NULL,
                                              input2_impl->raw_mutable_data(),
                                              NULL,
                                              output_impl->raw_mutable_data(),
                                              queue.NULL));
   syncQueue(queue);
   TORCH_CNML_CHECK(cnmlDestroyBaseOp(&add_op));
   
  return output;
}
Copy the code
  • MLU does not support operator handling

For operations not supported by the MLU, the input data will be copied to the CPU, the CPU-related operations will be invoked to run on the CPU, and the output will be copied to the MLU. To implement this, you can query op_methods.cp under catch/ Torch_mlu/CSRC /aten/operators/.

op_methods.cpp
at::Tensor OpMethods::add(const at::Tensor& self,
                          const at::Tensor& other,
                          at::Scalar alpha){
  auto input_cpu = self.cpu();
  auto other_cpu = other.cpu();
  auto output = at::add(input_cpu, other_cpu, alpha);
  return output.to(at::Device(at::Device::Type::MLU));
}
Copy the code
  • When a new operator throws an exception during execution, if there is no corresponding operator operation on the CPU, the operation cannot be switched to the CPU.
  • In general Wrappercnml_The operator name, kernel generally bycnml_Operator name_internalnamed

7, operator test

Write operator unit tests using the Python-based UnitTest module. During the test, the same parameters and input data should be provided, and the operator should be executed on the MLU and CPU respectively to compare the output results of the two. The calculation results of MLU and CPU may differ, and the relative error of the two is generally acceptable within 2%.

def test_add(self):
  # "Tensor + Tensor" mode testing
  for shape1, shape2 in [((1.3.224.224), (1.3.224.224)).((2.30.80), (2.30.80)).((3.20), (3.20)).((10), (10))]:
    input1_cpu = torch.rand(shape1, dtype=torch.float)
    input2_cpu = torch.rand(shape2, dtype=torch.float) input1_mlu = input1_cpu.to(xm.mlu_device()) input2_mlu = input2_cpu.to(xm.mlu_device()) # Calculate output_CPU = on CPU Input1_cpu + input2_CPU # Calculate output_mlu = input1_mlu + input2_mlu # Calculate the error of the MLU and ensure that the relative error is within2% self.assertTensorsEqual(output_CPU, output_mlu. CPU (),0.02, use_MSE=True)
Copy the code


The add() operator is used as an example to add a layer by layer operator to the Cambrian device Pytorch – MLU.


The Method of adding layer by layer operator to MLU for Cambrian PyTorch