Basic steps for Pytorch and CUDA joint programming

Reference link – W3School

Reference link -CSDN

This article code – Github

background

PyTorch already provides rich interfaces that can be called directly, but there are still some highly customized operations that cannot be done efficiently with PyToch or Python, so PyTorch also provides extended interfaces that can be programmed using C++ and CUDA. C++ extensions come in two main forms, either built ahead of time using setuptools or built at run time using torch.utils.cpp_extension.load(). The first method is described below, and the second method will be studied later.

Basic steps

The general steps for Pytorch,CUDA, and C++ joint programming are as follows:

You first need to define a C++ file that declares the functions defined in CUDA files, do some checking, and finally forward their calls to.cuFile. In addition, the file needs to declare the functions to be called in Python and bind to Python using PyBind11. OpenPCDet divides the above steps into the following steps:
1. We define a header file that contains the functions defined in the.cu file and the functions defined in the.cpp file.
2. We then define the.cpp file, where the function is responsible for doing some checking and calling the function defined in the.cu file
3. Cu files are responsible for performing specific CUDA programming operations
4. The API file binds functions defined in the. CPP file to PYBIND11 for Python to call
insetup.pyThe file declares the module name to be compiled, the source file path, and so on.
Import the declared module with import, and use Python to compute its forward and backward propagation.

For example

. ├ ─ ─ ball_query_src │ ├ ─ ─ API. The CPP │ ├ ─ ─ ball_query. CPP │ ├ ─ ─ ball_query_cuda. Cu │ ├ ─ ─ ball_query_cuda. H │ └ ─ ─ Cuda_utils.h ├─ setup.py ├─ test_ball_query.pyCopy the code

The project’s directory is shown above, where the api.cpp file is a function declared by ball_query.cpp bound to Python using PYBIND11. The content of api.cpp is as follows:

#include <torch/serialize/tensor.h>
#include <torch/extension.h>

#include "ball_query_cuda.h"


PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    // The first argument represents the name of the call in Python, the second argument represents the corresponding CPP function, and the third argument represents the description of the function
    m.def("ball_query_wrapper", &ball_query_wrapper_fast, "ball_query_wrapper_fast");
}

Copy the code

Here, in order to make the structure of the code clearer, the authors declare two functions in ball_query.h, one that is called in C++ and one that is implemented in CUDA. The details are as follows:

#ifndef _BALL_QUERY_GPU_H
#define _BALL_QUERY_GPU_H

#include <torch/serialize/tensor.h>
#include <vector>
#include <cuda.h>
#include <cuda_runtime_api.h>

// The function bound to PyBind11, whose main function is to call the following CUDA functions
int ball_query_wrapper_fast(int b, int n, int m, float radius, int nsample, 
	at::Tensor new_xyz_tensor, at::Tensor xyz_tensor, at::Tensor idx_tensor);

// Functions in CUDA files
void ball_query_kernel_launcher_fast(int b, int n, int m, float radius, int nsample, 
	const float *xyz, const float *new_xyz, int *idx);

#endif
Copy the code

After declaring the above two files in the. H file, the ball_query. CPP and ball_query_cuda.cu files are implemented respectively.

#include <torch/serialize/tensor.h>
#include <vector>
#include <THC/THC.h>
#include <cuda.h>
#include <cuda_runtime_api.h>

#include "ball_query_cuda.h"

extern THCState *state;

// Define macros that check data types
#define CHECK_CUDA(x) do { \
	  if(! x.type().is_cuda()) { \ fprintf(stderr,"%s must be CUDA tensor at %s:%d\n", #x, __FILE__, __LINE__); \
		      exit(-1); \
		    } \
} while (0)
#define CHECK_CONTIGUOUS(x) do { \
	  if(! x.is_contiguous()) { \ fprintf(stderr,"%s must be contiguous tensor at %s:%d\n", #x, __FILE__, __LINE__); \
		      exit(-1); \
		    } \
} while (0)
#defineCHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)

// Complete the input data type check while calling the function defined in the CU file
int ball_query_wrapper_fast(int b, int n, int m, float radius, int nsample, 
    at::Tensor new_xyz_tensor, at::Tensor xyz_tensor, at::Tensor idx_tensor) {
    CHECK_INPUT(new_xyz_tensor);
    CHECK_INPUT(xyz_tensor);
    const float *new_xyz = new_xyz_tensor.data<float> ();const float *xyz = xyz_tensor.data<float> ();int *idx = idx_tensor.data<int> ();ball_query_kernel_launcher_fast(b, n, m, radius, nsample, new_xyz, xyz, idx);
    return 1;
}
Copy the code

Ball_query_cuda. Cu

#include <math.h>
#include <stdio.h>
#include <stdlib.h>

#include "ball_query_cuda.h"
#include "cuda_utils.h"


__global__ void ball_query_kernel_fast(int b, int n, int m, float radius, int nsample, 
    const float *__restrict__ new_xyz, const float *__restrict__ xyz, int *__restrict__ idx) {
    // new_xyz: (B, M, 3)
    // xyz: (B, N, 3)
    // output:
    // idx: (B, M, nsample)
    int bs_idx = blockIdx.y;
    int pt_idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (bs_idx >= b || pt_idx >= m) return;

    new_xyz += bs_idx * m * 3 + pt_idx * 3;
    xyz += bs_idx * n * 3;
    idx += bs_idx * m * nsample + pt_idx * nsample;

    float radius2 = radius * radius;
    float new_x = new_xyz[0];
    float new_y = new_xyz[1];
    float new_z = new_xyz[2];

    int cnt = 0;
    for (int k = 0; k < n; ++k) {
        float x = xyz[k * 3 + 0];
        float y = xyz[k * 3 + 1];
        float z = xyz[k * 3 + 2];
        float d2 = (new_x - x) * (new_x - x) + (new_y - y) * (new_y - y) + (new_z - z) * (new_z - z);
        if (d2 < radius2){
            if (cnt == 0) {for (int l = 0; l < nsample; ++l) {
                    idx[l] = k;
                }
            }
            idx[cnt] = k;
            ++cnt;
            if (cnt >= nsample) break; }}}void ball_query_kernel_launcher_fast(int b, int n, int m, float radius, int nsample, \
    const float *new_xyz, const float *xyz, int *idx) {
    // new_xyz: (B, M, 3)
    // xyz: (B, N, 3)
    // output:
    // idx: (B, M, nsample)

    cudaError_t err;

    dim3 blocks(DIVUP(m, THREADS_PER_BLOCK), b);  // blockIdx.x(col), blockIdx.y(row)
    dim3 threads(THREADS_PER_BLOCK);

    ball_query_kernel_fast<<<blocks, threads>>>(b, n, m, radius, nsample, new_xyz, xyz, idx);
    // cudaDeviceSynchronize(); // for using printf in kernel function
    err = cudaGetLastError(a);if(cudaSuccess ! = err) {fprintf(stderr, "CUDA kernel failed : %s\n".cudaGetErrorString(err));
        exit(- 1); }}Copy the code

Now that the core functionality of ball_query is complete, we need to compile the above file using the setup.py file. The setup.py file is implemented as follows:

import os
import subprocess

from setuptools import find_packages, setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension


def make_cuda_ext(name, module, sources) :
    cuda_ext = CUDAExtension(
        name='%s.%s' % (module, name),
        sources=[os.path.join(module.split('. ')[-1], src) for src in sources]
    )
    print([os.path.join(*module.split('. '), src) for src in sources])
    return cuda_ext

if __name__ == '__main__':
    setup(
        name='ballquery',
        packages=find_packages(),
        ext_modules=[
            CUDAExtension('ball_query_cuda'['ball_query_src/api.cpp'.'ball_query_src/ball_query.cpp'.'ball_query_src/ball_query_cuda.cu',  
            ])
        ],
        cmdclass={
            'build_ext': BuildExtension
        }
    )
Copy the code

So far, we have generated a link library of the above code, but if we want to embed it in the neural network, we need to define its forward propagation and back propagation methods. Here we do both forward and back propagation in the test_ball_query.py file.

import torch
import torch.nn as nn
from torch.autograd import Function, Variable
import math

import ball_query_cuda

Define forward propagation and back propagation methods for this method
class BallQuery(Function) :
    
    @staticmethod
    def forward(ctx, radius: float, nsample: int, xyz: torch.Tensor, new_xyz: torch.Tensor) -> torch.Tensor:
        """ :param ctx: :param radius: float, radius of the balls :param nsample: int, maximum number of features in the balls :param xyz: (B, N, 3) xyz coordinates of the features :param new_xyz: (B, npoint, 3) centers of the ball query :return: idx: (B, npoint, nsample) tensor with the indicies of the features that form the query balls """
        assert new_xyz.is_contiguous()
        assert xyz.is_contiguous()

        B, N, _ = xyz.size()
        npoint = new_xyz.size(1)
        idx = torch.cuda.IntTensor(B, npoint, nsample).zero_()
        
        ball_query_cuda.ball_query_wrapper(B, N, npoint, radius, nsample, new_xyz, xyz, idx)
        return idx

    @staticmethod
    def backward(ctx, a=None) :
        return None.None.None.None


ball_query = BallQuery.apply

xyz = torch.randn(2.128.3).cuda()
new_xyz = xyz

result = ball_query(0.8.3, xyz, new_xyz)

print(result.shape)
Copy the code

Basic steps for Pytorch and CUDA joint programming

background

Basic steps

For example

Related Posts

hadoop

Artificial intelligence puts programmers at risk: farmers lose their jobs

Bayes theory