Technical background
Cython is a superset of Python. It can use Python's syntax to write performance close to C language. It can be used to rewrite the Bottleneck module encountered during Python programming to Cython to achieve acceleration. I've written some about it beforeCython Accelerated ComputingArticles. Also, because Cython will be converted to C language code during compilation, and then compiled into dynamic link library or executable file, it is natural toCalling C language in Cythonfunction. This method can also be used directlyCalling CUDA C function. In this article, we want to use Cython combined with CUDA C method to implement a CUDA version of Gather function, and output the corresponding array from an array based on the index array. Equivalent to numpyresult=source[index]
。
Interface header file
We define acuda_index.cuh
The header file is used to specify the C function interface form:
#include <>
extern "C" int Gather(float *source, int *index, float *res, int N, int M);
where source is the original array, index is the index array, res is the result array, N is the dimension of the index, and M is the dimension of the original array.
Exception capture header file
The previous article is used hereCUDA exception captureThe header file used in
#pragma once
#include <>
#define CHECK(call) do{const cudaError_t error_code = call; if (error_code != cudaSuccess){printf("CUDA Error:\n"); printf(" File: %s\n", __FILE__); printf(" Line: %d\n", __LINE__); printf(" Error code: %d\n", error_code); printf(" Error text: %s\n", cudaGetErrorString(error_code)); exit(1);}} while (0)
Through this macro, we can catch its exceptions when running the CUDA kernel function.
CUDA Gather Function
CUDA implements Gather functioncuda_index.cu
It's still relatively simple, just a simple Kernel function plus a C function that manages DeviceMemory:
// nvcc -shared ./cuda_index.cu -Xcompiler -fPIC -o ./
#include <>
#include "cuda_index.cuh"
#include ""
void __global__ GatherKernel(float *source, int *index, float *res, int N){
int idx = * + ;
if (idx < N){
res[idx] = source[index[idx]];
}
}
extern "C" int Gather(float *source, int *index, float *res, int N, int M){
float *souce_device, *res_device;
int *index_device;
CHECK(cudaMalloc((void **)&souce_device, M * sizeof(float)));
CHECK(cudaMalloc((void **)&res_device, N * sizeof(float)));
CHECK(cudaMalloc((void **)&index_device, N * sizeof(int)));
CHECK(cudaMemcpy(souce_device, source, M * sizeof(float), cudaMemcpyHostToDevice));
CHECK(cudaMemcpy(res_device, res, N * sizeof(float), cudaMemcpyHostToDevice));
CHECK(cudaMemcpy(index_device, index, N * sizeof(int), cudaMemcpyHostToDevice));
int block_size = 1024;
int grid_size = (N + block_size - 1) / block_size;
GatherKernel<<<grid_size, block_size>>>(souce_device, index_device, res_device, N);
CHECK(cudaGetLastError());
CHECK(cudaDeviceSynchronize());
CHECK(cudaMemcpy(res, res_device, N * sizeof(float), cudaMemcpyDeviceToHost));
CHECK(cudaFree(souce_device));
CHECK(cudaFree(index_device));
CHECK(cudaDeviceSynchronize());
CHECK(cudaFree(res_device));
CHECK(cudaDeviceReset());
return 1;
}
Cython call interface
Suppose we have oneForm arrays need to be indexed, of course we can also directly implement them using ready-made AI frameworks, for example
()
. But here we use Cython to directly connect to CUDA functions, in theory, you can do some more detailed operations on the array.
# cythonize -i -f
import numpy as np
cimport numpy as np
cimport cython
cdef extern from "<>" nogil:
void *dlopen(const char *, int)
char *dlerror()
void *dlsym(void *, const char *)
int dlclose(void *)
enum:
RTLD_LAZY
ctypedef int (*GatherFunc)(float *source, int *index, float *res, int N, int M) noexcept nogil
cdef void* handle = dlopen('/path/to/', RTLD_LAZY)
@(False)
@(False)
cpdef float[:] cuda_gather(float[:] x, int[:] idx):
cdef:
GatherFunc Gather
int success
int N = [0]
int M = [0]
float[:] res = ((N, ), dtype=np.float32)
Gather = <GatherFunc>dlsym(handle, "Gather")
success = Gather(&x[0], &idx[0], &res[0], N, M)
return res
while not True:
dlclose(handle)
The dynamic link library used hereIt is a compiled CUDA module, and it will be safer to use absolute paths.
Python calls functions
Our top-level function is still through Python scriptstest_gather.py
To call, with its concise syntax and a large number of third-party interfaces:
import numpy as np
(0)
from wrapper import cuda_gather
M = 1024 * 1024 * 128
N = 1024 * 1024
x = ((M,)).astype(np.float32)
idx = (0, M, (N,)).astype(np.int32)
res = (cuda_gather(x, idx))
print ()
print ((res==x[idx]).sum())
The wrapper here is the package name of our Cython file.
Running process
After editing the above related files, we need to use it in this process: first compile the CUDA related module into a dynamic link library, then use Cython to load the dynamic link library, then compile the Cython encapsulation module into a dynamic link library for Python to call, and finally execute Python tasks directly. The terminal instructions corresponding to the relevant steps are as follows:
$ nvcc -shared ./cuda_index.cu -Xcompiler -fPIC -o ./
$ cythonize -i -f
$ python3 test_gather.py
The result of running the output is as follows:
(1048576,)
1048576
If you use nvitop to monitor the usage of GPU resources, you can see some fluctuations in GPU video memory during operation. The final output result is consistent with the index function of numpy, which means that our output result is correct.
Error processing
If there is a problem that the Numpy-related lib cannot be found during the run, please refer toThis articlePerform processing.
Summary
This article uses Cython as an encapsulation function to encapsulate a Gather operator implemented in CUDA C, and then calls it through Python. This method is used to implement and call a CUDA Gather function comparing Python.
Copyright Statement
The first link to this article is:/dechinphy/p/
Author ID: DechinPhy
More original articles:/dechinphy/
Please ask the blogger to have coffee:/dechinphy/gallery/image/