Location>code7788 >text

Cython and CUDA Gather

Popularity:733 ℃/2025-02-27 10:18:53

Technical background

Cython is a superset of Python. It can use Python's syntax to write performance close to C language. It can be used to rewrite the Bottleneck module encountered during Python programming to Cython to achieve acceleration. I've written some about it beforeCython Accelerated ComputingArticles. Also, because Cython will be converted to C language code during compilation, and then compiled into dynamic link library or executable file, it is natural toCalling C language in Cythonfunction. This method can also be used directlyCalling CUDA C function. In this article, we want to use Cython combined with CUDA C method to implement a CUDA version of Gather function, and output the corresponding array from an array based on the index array. Equivalent to numpyresult=source[index]

Interface header file

We define acuda_index.cuhThe header file is used to specify the C function interface form:

#include <>

extern "C" int Gather(float *source, int *index, float *res, int N, int M);

where source is the original array, index is the index array, res is the result array, N is the dimension of the index, and M is the dimension of the original array.

Exception capture header file

The previous article is used hereCUDA exception captureThe header file used in

#pragma once
#include <>

#define CHECK(call) do{const cudaError_t error_code = call; if (error_code != cudaSuccess){printf("CUDA Error:\n"); printf("    File:   %s\n", __FILE__); printf("    Line:   %d\n", __LINE__); printf("    Error code: %d\n", error_code); printf("    Error text: %s\n", cudaGetErrorString(error_code)); exit(1);}} while (0)

Through this macro, we can catch its exceptions when running the CUDA kernel function.

CUDA Gather Function

CUDA implements Gather functioncuda_index.cuIt's still relatively simple, just a simple Kernel function plus a C function that manages DeviceMemory:

// nvcc -shared ./cuda_index.cu -Xcompiler -fPIC -o ./
#include <>
#include "cuda_index.cuh"
#include ""

void __global__ GatherKernel(float *source, int *index, float *res, int N){
    int idx =  *  + ;
    if (idx < N){
        res[idx] = source[index[idx]];
    }
}

extern "C" int Gather(float *source, int *index, float *res, int N, int M){
    float *souce_device, *res_device;
    int *index_device;
    CHECK(cudaMalloc((void **)&souce_device, M * sizeof(float)));
    CHECK(cudaMalloc((void **)&res_device, N * sizeof(float)));
    CHECK(cudaMalloc((void **)&index_device, N * sizeof(int)));
    CHECK(cudaMemcpy(souce_device, source, M * sizeof(float), cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(res_device, res, N * sizeof(float), cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(index_device, index, N * sizeof(int), cudaMemcpyHostToDevice));
    int block_size = 1024;
    int grid_size = (N + block_size - 1) / block_size;
    GatherKernel<<<grid_size, block_size>>>(souce_device, index_device, res_device, N);
    CHECK(cudaGetLastError());
    CHECK(cudaDeviceSynchronize());
    CHECK(cudaMemcpy(res, res_device, N * sizeof(float), cudaMemcpyDeviceToHost));
    CHECK(cudaFree(souce_device));
    CHECK(cudaFree(index_device));
    CHECK(cudaDeviceSynchronize());
    CHECK(cudaFree(res_device));
    CHECK(cudaDeviceReset());
    return 1;
}

Cython call interface

Suppose we have oneForm arrays need to be indexed, of course we can also directly implement them using ready-made AI frameworks, for example(). But here we use Cython to directly connect to CUDA functions, in theory, you can do some more detailed operations on the array.

# cythonize -i -f 

import numpy as np
cimport numpy as np
cimport cython

cdef extern from "<>" nogil:
    void *dlopen(const char *, int)
    char *dlerror()
    void *dlsym(void *, const char *)
    int dlclose(void *)
    enum:
        RTLD_LAZY

ctypedef int (*GatherFunc)(float *source, int *index, float *res, int N, int M) noexcept nogil

cdef void* handle = dlopen('/path/to/', RTLD_LAZY)

@(False)
@(False)
cpdef float[:] cuda_gather(float[:] x, int[:] idx):
    cdef:
        GatherFunc Gather
        int success
        int N = [0]
        int M = [0]
        float[:] res = ((N, ), dtype=np.float32)
    Gather = <GatherFunc>dlsym(handle, "Gather")
    success = Gather(&x[0], &idx[0], &res[0], N, M)
    return res

while not True:
    dlclose(handle)

The dynamic link library used hereIt is a compiled CUDA module, and it will be safer to use absolute paths.

Python calls functions

Our top-level function is still through Python scriptstest_gather.pyTo call, with its concise syntax and a large number of third-party interfaces:

import numpy as np
(0)
from wrapper import cuda_gather

M = 1024 * 1024 * 128
N = 1024 * 1024
x = ((M,)).astype(np.float32)
idx = (0, M, (N,)).astype(np.int32)
res = (cuda_gather(x, idx))
print ()
print ((res==x[idx]).sum())

The wrapper here is the package name of our Cython file.

Running process

After editing the above related files, we need to use it in this process: first compile the CUDA related module into a dynamic link library, then use Cython to load the dynamic link library, then compile the Cython encapsulation module into a dynamic link library for Python to call, and finally execute Python tasks directly. The terminal instructions corresponding to the relevant steps are as follows:

$ nvcc -shared ./cuda_index.cu -Xcompiler -fPIC -o ./ 
$ cythonize -i -f  
$ python3 test_gather.py

The result of running the output is as follows:

(1048576,)
1048576

If you use nvitop to monitor the usage of GPU resources, you can see some fluctuations in GPU video memory during operation. The final output result is consistent with the index function of numpy, which means that our output result is correct.

Error processing

If there is a problem that the Numpy-related lib cannot be found during the run, please refer toThis articlePerform processing.

Summary

This article uses Cython as an encapsulation function to encapsulate a Gather operator implemented in CUDA C, and then calls it through Python. This method is used to implement and call a CUDA Gather function comparing Python.

Copyright Statement

The first link to this article is:/dechinphy/p/

Author ID: DechinPhy

More original articles:/dechinphy/

Please ask the blogger to have coffee:/dechinphy/gallery/image/