Technical background
In the previous article, we introduced itMethod to implement a simple Gather operator under the Cython+CUDA framework. Here we demonstrate the upgraded version implementation of the Gather operator - BatchGather operator. However, only a Batch dimension was added here, and no other dimensions were added, such as the Dimension dimension, so I will not consider it here for the time being.
CUDA header file
Here we keep the original Gather part and only add a BatchGather operation. The following iscuda_index.cuh
Content:
#include <>
extern "C" float Gather(float *source, int *index, float *res, int N, int M);
extern "C" float BatchGather(float *source, int *index, float *res, int N, int M, int B);
BatchGather just adds a B dimension to Gather. In addition to the header file of the CUDA operator itself, we also use the exception capture header file here.:
#pragma once
#include <>
#define CHECK(call) do{const cudaError_t error_code = call; if (error_code != cudaSuccess){printf("CUDA Error:\n"); printf(" File: %s\n", __FILE__); printf(" Line: %d\n", __LINE__); printf(" Error code: %d\n", error_code); printf(" Error text: %s\n", cudaGetErrorString(error_code)); exit(1);}} while (0)
The macros can be used to detect exceptions thrown by CUDA functions. There is also a header file used to count the runtime of CUDA functions:
#pragma once
#include <>
#include <cuda_runtime.h>
// Macro definition, used to measure the execution time of CUDA functions
#define TIME_CUDA_FUNCTION(func) \
do { \
cudaEvent_t start, stop; \
float elapsedTime; \
cudaEventCreate(&start); \
cudaEventCreate(&stop); \
cudaEventRecord(start, NULL); \
\
func; \
\
cudaEventRecord(stop, NULL); \
cudaEventSynchronize(stop); \
cudaEventElapsedTime(&elapsedTime, start, stop); \
printf("Time taken by function %s is: %f ms\n", #func, elapsedTime); \
\
cudaEventDestroy(start); \
cudaEventDestroy(stop); \
} while (0)
// Macro definition, used to measure the execution time of the CUDA function and return the time
#define GET_CUDA_TIME(func) \
({ \
cudaEvent_t start, stop; \
float elapsedTime = 0.0f; \
cudaEventCreate(&start); \
cudaEventCreate(&stop); \
cudaEventRecord(start, NULL); \
\
func; \
\
cudaEventRecord(stop, NULL); \
cudaEventSynchronize(stop); \
cudaEventElapsedTime(&elapsedTime, start, stop); \
\
cudaEventDestroy(start); \
cudaEventDestroy(stop); \
\
elapsedTime; \
})
You can choose the direct printing time, or you can choose the float value that returns the time.
CUDA files
Next is the official CUDA function contentcuda_index.cu
:
// nvcc -shared ./cuda_index.cu -Xcompiler -fPIC -o ./
#include <>
#include "cuda_index.cuh"
#include ""
#include ""
__global__ void GatherKernel(float *source, int *index, float *res, int N){
int idx = * + ;
if (idx < N){
res[idx] = source[index[idx]];
}
}
extern "C" float Gather(float *source, int *index, float *res, int N, int M){
float *souce_device, *res_device;
int *index_device;
CHECK(cudaMalloc((void **)&souce_device, M * sizeof(float)));
CHECK(cudaMalloc((void **)&res_device, N * sizeof(float)));
CHECK(cudaMalloc((void **)&index_device, N * sizeof(int)));
CHECK(cudaMemcpy(souce_device, source, M * sizeof(float), cudaMemcpyHostToDevice));
CHECK(cudaMemcpy(res_device, res, N * sizeof(float), cudaMemcpyHostToDevice));
CHECK(cudaMemcpy(index_device, index, N * sizeof(int), cudaMemcpyHostToDevice));
int block_size = 1024;
int grid_size = (N + block_size - 1) / block_size;
float timeTaken = GET_CUDA_TIME((GatherKernel<<<grid_size, block_size>>>(souce_device, index_device, res_device, N)));
CHECK(cudaGetLastError());
CHECK(cudaDeviceSynchronize());
CHECK(cudaMemcpy(res, res_device, N * sizeof(float), cudaMemcpyDeviceToHost));
CHECK(cudaFree(souce_device));
CHECK(cudaFree(index_device));
CHECK(cudaDeviceSynchronize());
CHECK(cudaFree(res_device));
CHECK(cudaDeviceReset());
return timeTaken;
}
__global__ void BatchGatherKernel(float *source, int *index, float *res, int N, int M, int B){
int idx = * + ;
if (idx < N*B){
int batch_idx = idx / N;
int source_idx = batch_idx * M + index[idx];
res[idx] = source[source_idx];
}
}
extern "C" float BatchGather(float *source, int *index, float *res, int N, int M, int B){
float *souce_device, *res_device;
int *index_device;
CHECK(cudaMalloc((void **)&souce_device, B * M * sizeof(float)));
CHECK(cudaMalloc((void **)&res_device, B * N * sizeof(float)));
CHECK(cudaMalloc((void **)&index_device, B * N * sizeof(int)));
CHECK(cudaMemcpy(souce_device, source, B * M * sizeof(float), cudaMemcpyHostToDevice));
CHECK(cudaMemcpy(res_device, res, B * N * sizeof(float), cudaMemcpyHostToDevice));
CHECK(cudaMemcpy(index_device, index, B * N * sizeof(int), cudaMemcpyHostToDevice));
int block_size = 1024;
int grid_size = (B * N + block_size - 1) / block_size;
float timeTaken = GET_CUDA_TIME((BatchGatherKernel<<<grid_size, block_size>>>(souce_device, index_device, res_device, N, M, B)));
CHECK(cudaGetLastError());
CHECK(cudaDeviceSynchronize());
CHECK(cudaMemcpy(res, res_device, B * N * sizeof(float), cudaMemcpyDeviceToHost));
CHECK(cudaFree(souce_device));
CHECK(cudaFree(index_device));
CHECK(cudaDeviceSynchronize());
CHECK(cudaFree(res_device));
CHECK(cudaDeviceReset());
return timeTaken;
}
Before passing it to CUDA here, we need to compress the relevant data into one-dimensional in Cython or Python, so the one passing it into the CUDA function is a one-dimensional pointer. Compared to a single Gather operation, several input meanings in BatchGather have changed, for exampleN
It represents the Index length of a single Batch.M
It represents the source array length of a single Batch.
Cython files
For a new Batch function, we need to build a new Cython call function:
# cythonize -i -f
import numpy as np
cimport numpy as np
cimport cython
cdef extern from "<>" nogil:
void *dlopen(const char *, int)
char *dlerror()
void *dlsym(void *, const char *)
int dlclose(void *)
enum:
RTLD_LAZY
ctypedef float (*GatherFunc)(float *source, int *index, float *res, int N, int M) noexcept nogil
ctypedef float (*BatchGatherFunc)(float *source, int *index, float *res, int N, int M, int B) noexcept nogil
cdef void* handle = dlopen('/path/to/', RTLD_LAZY)
@(False)
@(False)
cpdef float[:] cuda_gather(float[:] x, int[:] idx):
cdef:
GatherFunc Gather
float timeTaken
int N = [0]
int M = [0]
float[:] res = ((N, ), dtype=np.float32)
Gather = <GatherFunc>dlsym(handle, "Gather")
timeTaken = Gather(&x[0], &idx[0], &res[0], N, M)
print (timeTaken)
return res
@(False)
@(False)
cpdef float[:] batch_cuda_gather(float[:] x, int[:] idx, int B):
cdef:
BatchGatherFunc BatchGather
float timeTaken
int N = [0] // B
int M = [0] // B
float[:] res = ((B*N, ), dtype=np.float32)
BatchGather = <BatchGatherFunc>dlsym(handle, "BatchGather")
timeTaken = BatchGather(&x[0], &idx[0], &res[0], N, M, B)
print (timeTaken)
return res
while not True:
dlclose(handle)
Here we still accept one-dimensional arrays and introduce more parameters of Batch dimensions.B
, the rest are the same.
Python call files
Finally, the code on the top Python side used to calltest_gather.py
:
import numpy as np
(0)
from wrapper import batch_cuda_gather
B = 2
M = 1024 * 1024 * 128
N = 1024 * 1024
x = ((M*B,)).astype(np.float32)
idx = (0, M, (N*B,)).astype(np.int32)
np_res = ((B, N), dtype=np.float32)
for i in range(B):
np_res[i] = ((B,-1))[i][((B, -1))[i]]
np_res = np_res.reshape(-1)
res = (batch_cuda_gather(x, idx, B))
print ()
print ((res==np_res).sum())
For easy processing, when building data, we directly generate one-dimensional data in the data generation stage, and then directly call the Cython function for CUDA-related operations.
Running method
Summary
With the purpose of learning CUDA, we will follow the previous article about the implementation of Gather operators under Cython and CUDA architecture. Here we add a Batch dimension to make a simple implementation of BatchGather.
Copyright Statement
The first link to this article is:/dechinphy/p/
Author ID: DechinPhy
More original articles:/dechinphy/
Please ask the blogger to have coffee:/dechinphy/gallery/image/