Location>code7788 >text

Add between Cython and CUDA

Popularity:38 ℃/2025-03-05 10:00:38

Technical background

In the previous article, we introduced that using Cython combined with CUDA implemented aGather operatorAnd oneBatchGatherOperator. Here we continue to use this solution to implement a simple sum function, and calculate the sum of the array through CUDA. Since array summing is a sum of elements for different dimensions, there is no difference between high-dimensional arrays and low-dimensional arrays. Here we treat them as one-dimensional array inputs, without Batch processing.

Header file

First we need a CUDA header filecuda_add.cuhTo define the interface of the CUDA function:

#include <>

extern "C" float Add(float *A, float *B, float *res, int N);

Other header files such as exception capture can be used for referenceThis article, CUDA function timing can be referencedThis article

CUDA files

CUDA filescuda_add.cuThe algorithm containing the core part:

// nvcc -shared ./cuda_add.cu -Xcompiler -fPIC -o ./
 #include <>
 #include "cuda_add.cuh"
 #include ""
 #include ""

 __global__ void AddKernel(float *A, float *B, float *res, int N) {
     int tid = * + ;
     // Each thread processes multiple elements
     int stride = * ;
     for (int i = tid; i < N; i += stride) {
         res[i] = A[i] + B[i];
     }
 }

 extern "C" float Add(float *A, float *B, float *res, int N){
     float *A_device, *B_device, *res_device;
     CHECK(cudaMalloc((void **)&A_device, N * sizeof(float)));
     CHECK(cudaMalloc((void **)&B_device, N * sizeof(float)));
     CHECK(cudaMalloc((void **)&res_device, N * sizeof(float)));
     CHECK(cudaMemcpy(A_device, A, N * sizeof(float), cudaMemcpyHostToDevice));
     CHECK(cudaMemcpy(B_device, B, N * sizeof(float), cudaMemcpyHostToDevice));

     int block_size, grid_size;
     cudaOccupancyMaxPotentialBlockSize(&grid_size, &block_size, AddKernel, 0, N);
     grid_size = (N + block_size - 1) / block_size;

     float timeTaken = GET_CUDA_TIME((AddKernel<<<grid_size, block_size>>>(A_device, B_device, res_device, N)));
     CHECK(cudaGetLastError());
     CHECK(cudaDeviceSynchronize());
     CHECK(cudaMemcpy(res, res_device, N * sizeof(float), cudaMemcpyDeviceToHost));
     CHECK(cudaFree(A_device));
     CHECK(cudaFree(B_device));
     CHECK(cudaFree(res_device));
     return timeTaken;
 }

The code here is partially optimized by DeepSeek, for example, using a for loop in kernel functions to process multiple data instead of just one data. in additionblock_sizeDepend oncudaOccupancyMaxPotentialBlockSizeAutomatic generation also avoids some troubles caused by manual settings. However, we did not use Stream to optimize here, we just simply demonstrate a functional algorithm.

Cython interface file

Since our framework encapsulates CUDA functions through Cython and then calls them in Python, we need a Cython interface file here

# cythonize -i -f 

import numpy as np
cimport numpy as np
cimport cython

cdef extern from "<>" nogil:
    void *dlopen(const char *, int)
    char *dlerror()
    void *dlsym(void *, const char *)
    int dlclose(void *)
    enum:
        RTLD_LAZY

ctypedef float (*AddFunc)(float *A, float *B, float *res, int N) noexcept nogil

cdef void* handle_add = dlopen('/path/to/cuda/', RTLD_LAZY)

@(False)
@(False)
cpdef float[:] cuda_add(float[:] x, float[:] y):
    cdef:
        AddFunc Add
        float timeTaken
        int N = [0]
        float[:] res = ((N, ), dtype=np.float32)
    Add = <AddFunc>dlsym(handle_add, "Add")
    timeTaken = Add(&x[0], &y[0], &res[0], N)
    print (timeTaken)
    return res

while not True:
    dlclose(handle)

Python call files

Finally, let's write a Python casetest_add.pyTo call the Cython encapsulated CUDA function:

import numpy as np
(0)
from wrapper import cuda_add

N = 1024 * 1024 * 100

x = ((N,)).astype(np.float32)
y = ((N,)).astype(np.float32)

np_res = x+y

res = (cuda_add(x, y))
print ()
print ((res==np_res).sum())

Running the python file can obtain the time-consuming and corresponding return result output of the CUDA kernel function.

View GPU information

In order to have a deeper understanding of the performance of CUDA computing, we can view some key parameters of the GPU to reason about the theoretical operation limits of CUDA operations. Some versions of CUDA will come with a deviceQuery:

$ cd /usr/local/cuda-10.1/samples/1_Utilities/deviceQuery

It contains some files that can query and obtain local GPU configuration parameters:

$ ll
 Total dosage 44
 drwxr-xr-x 2 root root 4096 July 13 2021 ./
 drwxr-xr-x 8 root root 4096 July 13 2021 ../
 -rw-r--r-- 1 root root 12473 July 13 2021
 -rw-r--r-- 1 root root 10812 July 13 2021 Makefile
 -rw-r--r-- 1 root root 1789 July 13 2021
 -rw-r--r-- 1 root root 168 July 13 2021

These files can be compiled, but because these codes forcefully specify the address of nvcc/usr/local/cudaSo if there is no such path locally, you may need to use itln -sTo create a path soft link:

$ sudo ln -s /usr/local/cuda-10.1 /usr/local/cuda

Then execute the compilation command:

$ sudo make
/usr/local/cuda/bin/nvcc -ccbin g++ -I../../common/inc  -m64    -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o  -c 
/usr/local/cuda/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o deviceQuery  
mkdir -p ../../bin/x86_64/linux/release
cp deviceQuery ../../bin/x86_64/linux/release

After compilation is completed, directly execute the compiled executable file:

$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "Quadro RTX 4000"
  CUDA Driver Version / Runtime Version          12.2 / 10.1
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 7972 MBytes (8358723584 bytes)
  (36) Multiprocessors, ( 64) CUDA Cores/MP:     2304 CUDA Cores
  GPU Max Clock rate:                            1545 MHz (1.54 GHz)
  Memory Clock rate:                             6501 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 3 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Quadro RTX 4000"
  CUDA Driver Version / Runtime Version          12.2 / 10.1
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 7974 MBytes (8361738240 bytes)
  (36) Multiprocessors, ( 64) CUDA Cores/MP:     2304 CUDA Cores
  GPU Max Clock rate:                            1545 MHz (1.54 GHz)
  Memory Clock rate:                             6501 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 166 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from Quadro RTX 4000 (GPU0) -> Quadro RTX 4000 (GPU1) : Yes
> Peer access from Quadro RTX 4000 (GPU1) -> Quadro RTX 4000 (GPU0) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 10.1, NumDevs = 2
Result = PASS

Here we output the relevant parameters of the two GPUs. inMemory Bus Width: 256-bitIt indicates the total bit width, and the higher the value, the better.Memory Clock rate: 6501 MhzIndicates the access rate of video memory, it is often used to estimate the performance of the GPU, because many times the performance bottleneck of the GPU may be in the memory-video transmission.GPU Max Clock rate: 1545 MHz (1.54 GHz)It can be used to estimate the memory operation rate. Taking ordinary CUDA addition as an example, the approximate formula of the effective rate is:

\[Effective Rate (Gbps)=\frac{Physical Frequency\times 2}{1000} \]

In turn, the bandwidth can be calculated:

\[Bandwidth (GB/s)=\frac{Accurate Rate\times Bus Width}{8} \]

Finally, estimating the upper limit of the calculation rate based on the bandwidth is equivalent to estimating the lower limit of the calculation time taken by an CUDA addition:

\[Calculation time (s)=\frac{total operation data amount (B)}{bandwidth (B/s)} \]

In actual calculation, a single addition operation involves four steps: reading the array A element, reading the array B element, adding and writing to the C array. That is to say, it involves 3 memory operations and 1 sum operation. Regarding the memory part time (assumingN=1024*1024*100):

\[T_{mem}=\frac{N*4*3}{\frac{\frac{6501}{1000}*256}{8}*10^9}\approx 0.0030243 s \]

The time is estimated to be around 3ms (the 4 here is the conversion of single-precision floating point number to Byte). As for the time-consuming addition operation, it can actually be ignored, because the instruction throughput rate is roughly:

\[Instruction Throughput (TFLOPS)=Core Number\times Clock Frequency=2304\times 1.54e09\approx 3.55 \]

Then the minimum time-consuming theory is (assumingN=1024*1024*100):

\[T_{theo}(s)=\frac{total computational volume}{instruction throughput}\approx 2.95e-05 \]

The instruction calculation part takes approximately0.03 ms, and the time-consuming memory IO part3 msComparison of negligible orders.

Real test

The result of running Python code is:

$ python3 test_add.py 
3.3193600177764893
(104857600,)
104857600

This data3.32 msAlready close to the limit rate3 msIt should be said that it is difficult to optimize further under such an algorithm framework. Most of the time, the optimization point is still in the memory transmission efficiency of the CPU to GPU.

Summary

This article introduces the method of using CUDA and Cython to implement a CUDA addition operator, and introduces the algorithm that uses CUDA parameters to estimate performance limits. After actual testing, the algorithm performance optimization space for the kernel function part is no longer very large. More often than not, you can consider using Stream to optimize data transmission between Host and Device.

Copyright Statement

The first link to this article is:/dechinphy/p/

Author ID: DechinPhy

More original articles:/dechinphy/

Please ask the blogger to have coffee:/dechinphy/gallery/image/

Reference link

  1. /sunyuhua_keyboard/article/details/145633805