Technical background
In the previous article, we introduced that using Cython combined with CUDA implemented aGather operatorAnd oneBatchGatherOperator. Here we continue to use this solution to implement a simple sum function, and calculate the sum of the array through CUDA. Since array summing is a sum of elements for different dimensions, there is no difference between high-dimensional arrays and low-dimensional arrays. Here we treat them as one-dimensional array inputs, without Batch processing.
Header file
First we need a CUDA header filecuda_add.cuh
To define the interface of the CUDA function:
#include <>
extern "C" float Add(float *A, float *B, float *res, int N);
Other header files such as exception capture can be used for referenceThis article, CUDA function timing can be referencedThis article。
CUDA files
CUDA filescuda_add.cu
The algorithm containing the core part:
// nvcc -shared ./cuda_add.cu -Xcompiler -fPIC -o ./
#include <>
#include "cuda_add.cuh"
#include ""
#include ""
__global__ void AddKernel(float *A, float *B, float *res, int N) {
int tid = * + ;
// Each thread processes multiple elements
int stride = * ;
for (int i = tid; i < N; i += stride) {
res[i] = A[i] + B[i];
}
}
extern "C" float Add(float *A, float *B, float *res, int N){
float *A_device, *B_device, *res_device;
CHECK(cudaMalloc((void **)&A_device, N * sizeof(float)));
CHECK(cudaMalloc((void **)&B_device, N * sizeof(float)));
CHECK(cudaMalloc((void **)&res_device, N * sizeof(float)));
CHECK(cudaMemcpy(A_device, A, N * sizeof(float), cudaMemcpyHostToDevice));
CHECK(cudaMemcpy(B_device, B, N * sizeof(float), cudaMemcpyHostToDevice));
int block_size, grid_size;
cudaOccupancyMaxPotentialBlockSize(&grid_size, &block_size, AddKernel, 0, N);
grid_size = (N + block_size - 1) / block_size;
float timeTaken = GET_CUDA_TIME((AddKernel<<<grid_size, block_size>>>(A_device, B_device, res_device, N)));
CHECK(cudaGetLastError());
CHECK(cudaDeviceSynchronize());
CHECK(cudaMemcpy(res, res_device, N * sizeof(float), cudaMemcpyDeviceToHost));
CHECK(cudaFree(A_device));
CHECK(cudaFree(B_device));
CHECK(cudaFree(res_device));
return timeTaken;
}
The code here is partially optimized by DeepSeek, for example, using a for loop in kernel functions to process multiple data instead of just one data. in additionblock_size
Depend oncudaOccupancyMaxPotentialBlockSize
Automatic generation also avoids some troubles caused by manual settings. However, we did not use Stream to optimize here, we just simply demonstrate a functional algorithm.
Cython interface file
Since our framework encapsulates CUDA functions through Cython and then calls them in Python, we need a Cython interface file here。
# cythonize -i -f
import numpy as np
cimport numpy as np
cimport cython
cdef extern from "<>" nogil:
void *dlopen(const char *, int)
char *dlerror()
void *dlsym(void *, const char *)
int dlclose(void *)
enum:
RTLD_LAZY
ctypedef float (*AddFunc)(float *A, float *B, float *res, int N) noexcept nogil
cdef void* handle_add = dlopen('/path/to/cuda/', RTLD_LAZY)
@(False)
@(False)
cpdef float[:] cuda_add(float[:] x, float[:] y):
cdef:
AddFunc Add
float timeTaken
int N = [0]
float[:] res = ((N, ), dtype=np.float32)
Add = <AddFunc>dlsym(handle_add, "Add")
timeTaken = Add(&x[0], &y[0], &res[0], N)
print (timeTaken)
return res
while not True:
dlclose(handle)
Python call files
Finally, let's write a Python casetest_add.py
To call the Cython encapsulated CUDA function:
import numpy as np
(0)
from wrapper import cuda_add
N = 1024 * 1024 * 100
x = ((N,)).astype(np.float32)
y = ((N,)).astype(np.float32)
np_res = x+y
res = (cuda_add(x, y))
print ()
print ((res==np_res).sum())
Running the python file can obtain the time-consuming and corresponding return result output of the CUDA kernel function.
View GPU information
In order to have a deeper understanding of the performance of CUDA computing, we can view some key parameters of the GPU to reason about the theoretical operation limits of CUDA operations. Some versions of CUDA will come with a deviceQuery:
$ cd /usr/local/cuda-10.1/samples/1_Utilities/deviceQuery
It contains some files that can query and obtain local GPU configuration parameters:
$ ll
Total dosage 44
drwxr-xr-x 2 root root 4096 July 13 2021 ./
drwxr-xr-x 8 root root 4096 July 13 2021 ../
-rw-r--r-- 1 root root 12473 July 13 2021
-rw-r--r-- 1 root root 10812 July 13 2021 Makefile
-rw-r--r-- 1 root root 1789 July 13 2021
-rw-r--r-- 1 root root 168 July 13 2021
These files can be compiled, but because these codes forcefully specify the address of nvcc/usr/local/cuda
So if there is no such path locally, you may need to use itln -s
To create a path soft link:
$ sudo ln -s /usr/local/cuda-10.1 /usr/local/cuda
Then execute the compilation command:
$ sudo make
/usr/local/cuda/bin/nvcc -ccbin g++ -I../../common/inc -m64 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o -c
/usr/local/cuda/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o deviceQuery
mkdir -p ../../bin/x86_64/linux/release
cp deviceQuery ../../bin/x86_64/linux/release
After compilation is completed, directly execute the compiled executable file:
$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "Quadro RTX 4000"
CUDA Driver Version / Runtime Version 12.2 / 10.1
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 7972 MBytes (8358723584 bytes)
(36) Multiprocessors, ( 64) CUDA Cores/MP: 2304 CUDA Cores
GPU Max Clock rate: 1545 MHz (1.54 GHz)
Memory Clock rate: 6501 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "Quadro RTX 4000"
CUDA Driver Version / Runtime Version 12.2 / 10.1
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 7974 MBytes (8361738240 bytes)
(36) Multiprocessors, ( 64) CUDA Cores/MP: 2304 CUDA Cores
GPU Max Clock rate: 1545 MHz (1.54 GHz)
Memory Clock rate: 6501 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 166 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from Quadro RTX 4000 (GPU0) -> Quadro RTX 4000 (GPU1) : Yes
> Peer access from Quadro RTX 4000 (GPU1) -> Quadro RTX 4000 (GPU0) : Yes
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 10.1, NumDevs = 2
Result = PASS
Here we output the relevant parameters of the two GPUs. inMemory Bus Width: 256-bit
It indicates the total bit width, and the higher the value, the better.Memory Clock rate: 6501 Mhz
Indicates the access rate of video memory, it is often used to estimate the performance of the GPU, because many times the performance bottleneck of the GPU may be in the memory-video transmission.GPU Max Clock rate: 1545 MHz (1.54 GHz)
It can be used to estimate the memory operation rate. Taking ordinary CUDA addition as an example, the approximate formula of the effective rate is:
In turn, the bandwidth can be calculated:
Finally, estimating the upper limit of the calculation rate based on the bandwidth is equivalent to estimating the lower limit of the calculation time taken by an CUDA addition:
In actual calculation, a single addition operation involves four steps: reading the array A element, reading the array B element, adding and writing to the C array. That is to say, it involves 3 memory operations and 1 sum operation. Regarding the memory part time (assumingN=1024*1024*100
):
The time is estimated to be around 3ms (the 4 here is the conversion of single-precision floating point number to Byte). As for the time-consuming addition operation, it can actually be ignored, because the instruction throughput rate is roughly:
Then the minimum time-consuming theory is (assumingN=1024*1024*100
):
The instruction calculation part takes approximately0.03 ms
, and the time-consuming memory IO part3 ms
Comparison of negligible orders.
Real test
The result of running Python code is:
$ python3 test_add.py
3.3193600177764893
(104857600,)
104857600
This data3.32 ms
Already close to the limit rate3 ms
It should be said that it is difficult to optimize further under such an algorithm framework. Most of the time, the optimization point is still in the memory transmission efficiency of the CPU to GPU.
Summary
This article introduces the method of using CUDA and Cython to implement a CUDA addition operator, and introduces the algorithm that uses CUDA parameters to estimate performance limits. After actual testing, the algorithm performance optimization space for the kernel function part is no longer very large. More often than not, you can consider using Stream to optimize data transmission between Host and Device.
Copyright Statement
The first link to this article is:/dechinphy/p/
Author ID: DechinPhy
More original articles:/dechinphy/
Please ask the blogger to have coffee:/dechinphy/gallery/image/
Reference link
- /sunyuhua_keyboard/article/details/145633805