Calling CUDA Kernel Functions with Cython

Technical background

Having previously written aArticle on Mixed Programming in Cython and CIn Cython, you can use very Pythonic ways to call functions in C. Also we've had a look at thewritingsOne solution for using CUDA computation in Python has been introduced in. In fact, there are many kinds of solutions to call CUDA from Python, such as directly using mature frameworks such as MindSpore, PyTorch, Jax, etc. for GPU acceleration, or using Numba's CUDA on-the-fly compilation JIT to directly GPU-accelerate Python functions, and also using PyCuda or Cupy to achieve GPU acceleration in Python. But no matter which program is used, there is no way to get around the conversion to C and Kernel functions. Here provides another idea, allows us to give the Python API interface, but also can use the familiar Python functions and C functions, CUDA functions at the same time, in order to achieve different purposes, this is Python + Cython + C/CUDA program.

Building Ideas

With the strong promotion of the Python language in the field of machine learning, Python programming is always a more forward option for many techies. So regardless of the way to build your project, we usually choose to open a Python API interface for users. If you are working on a project that doesn't require a lot of performance, you can use Python to do the calculations directly. But if the performance requirements are high, then C/CUDA would be a better choice. Then the remaining problem to be solved is how to build the interaction between Python API and C/CUDA. In fact, there are two options here itself: 1. Directly compile the C/CUDA code into the*.soDynamic Link Library and then load the functions of the Dynamic Link Library as an interface in Python.2. Perform calculations using only C/CUDA and compile the C/CUDA code into the*.soDynamic link library that schedules and manages both functions from Cython.

From a user-facing point of view, the first option, while probably not bad, requires going to C/CUDA code modification whenever some new functional modules need to be added. If it's just a function module, it's fine, but if it involves task scheduling and interface transfer issues, the bar is even higher. The second option is likely to have a slight attenuation in performance, but because the interface transfer and scheduling are done in Cython, even developers who only know Python can use Cython as an entry point to develop the modules they need.

Case Demonstration

Here we demonstrate a simpleHello Worldprogram, first we can write aof the CUDA file:

#include<>
	
__global__ void HelloKernel(void){
    printf("Hello World From GPU!\n");
}

int main(){
    HelloKernel<<<1,3>>>();
    cudaDeviceReset();
    return 1;
}

Then usenvccCompile this CUDA project:

$ nvcc -c ./ -Xcompiler -fPIC -o ./

Got one.of the dynamically linked file. Then, in theThis dynamic link library is introduced in the

# cythonize -i 
import ctypes
libcuda = ('./')

cpdef int run_cuda():
    cdef int res
    res = ()
    if res:
        print ('Load cuda function successfully!')
    else:
        print ('Failed to load cuda function.')
    return 1

Here we can use thecythonizecommand to compile pyx files, or you can use python'sConfigure the compilation directives. For convenience we'll just use thecythonizeDemonstrate the case:

$ cythonize -i

Then you'll get afile and a-37m-x86_64Dynamically linked files. Getting to this point is equivalent to what we did with the CUDA file in themain functionencapsulates a layer of cython calls, and then it is possible to call this cython in the python filerun_cuda()function:

# $ python3 
from hello import run_cuda
run_cuda()

Then run this python file and the output is:

$ python3 
Hello World From GPU!
Hello World From GPU!
Hello World From GPU!
Load cuda function successfully!

Successfully called the CUDA Kernel function from the Python side.

Other calling methods

As mentioned earlier, we can also call this CUDA function directly from within a C program. For example, in the above we compiledAfter the CUDA dynamic link library for the

#include <>

int main()
{
    void* handle = dlopen("./", RTLD_LAZY);
    int (*helloFromGPU)();
    helloFromGPU = dlsym(handle, "main");
    helloFromGPU();
    dlclose(handle);
    return 1;
}

Summary outline

There are many options for calling GPUs for acceleration from Python interfaces, including Cupy and PyCuda as well as the previously introduced Numba, and mature deep learning frameworks such as MindSpore, PyTorch, and Jax can also be used, and a solution for writing CUDA Kernel functions directly is introduced here. In order to be able to do the separation of CUDA-C and Python programming, Cython is introduced here as an intermediate interface, so that Python developers and C developers can go to co-develop the corresponding high-performance methods.

This article was first linked to:/dechinphy/p/

Author ID: DechinPhy