Technical background
The previous article introduced inCUDA uses macros to monitor the operation of CUDA C function or Kernel function errorsquestion. By the same idea, we can write a macro to count the runtime of a function, so that we do not need to use additional tools to test the performance of the function body.
Document preparation
Because the macro changes here mainly involve the modification of CUDA header files and CUDA files, we still reuse the Cython files and Python files and exception capture macros.This articleUsed in it. The test content is to define a raw array and an index array, and output the indexed result array.
# cythonize -i -f
import numpy as np
cimport numpy as np
cimport cython
cdef extern from "<>" nogil:
void *dlopen(const char *, int)
char *dlerror()
void *dlsym(void *, const char *)
int dlclose(void *)
enum:
RTLD_LAZY
ctypedef int (*GatherFunc)(float *source, int *index, float *res, int N, int M) noexcept nogil
cdef void* handle = dlopen('/path/to/', RTLD_LAZY)
@(False)
@(False)
cpdef float[:] cuda_gather(float[:] x, int[:] idx):
cdef:
GatherFunc Gather
int success
int N = [0]
int M = [0]
float[:] res = ((N, ), dtype=np.float32)
Gather = <GatherFunc>dlsym(handle, "Gather")
success = Gather(&x[0], &idx[0], &res[0], N, M)
return res
while not True:
dlclose(handle)
test_gather.py
import numpy as np
(0)
from wrapper import cuda_gather
M = 1024 * 1024 * 128
N = 1024 * 1024
x = ((M,)).astype(np.float32)
idx = (0, M, (N,)).astype(np.int32)
res = (cuda_gather(x, idx))
print ()
print ((res==x[idx]).sum())
#pragma once
#include <>
#define CHECK(call) do{const cudaError_t error_code = call; if (error_code != cudaSuccess){printf("CUDA Error:\n"); printf(" File: %s\n", __FILE__); printf(" Line: %d\n", __LINE__); printf(" Error code: %d\n", error_code); printf(" Error text: %s\n", cudaGetErrorString(error_code)); exit(1);}} while (0)
Timing Macro
Here is a timerA header file, write a
TIME_CUDA_FUNCTION
The macro, and then call it before the function that needs to be counted in CUDA, can output the run time of the CUDA function.
#pragma once
#include <>
#include <cuda_runtime.h>
// Macro definition, used to measure the execution time of CUDA functions
#define TIME_CUDA_FUNCTION(func) \
do { \
cudaEvent_t start, stop; \
float elapsedTime; \
cudaEventCreate(&start); \
cudaEventCreate(&stop); \
cudaEventRecord(start, NULL); \
\
func; \
\
cudaEventRecord(stop, NULL); \
cudaEventSynchronize(stop); \
cudaEventElapsedTime(&elapsedTime, start, stop); \
printf("Time taken by function %s is: %f ms\n", #func, elapsedTime); \
\
cudaEventDestroy(start); \
cudaEventDestroy(stop); \
} while (0)
Use of timing macros
We're in CUDA filescuda_index.cu
Called inThe timing macro inside is used to count the execution time of a CUDA kernel function:
// nvcc -shared ./cuda_index.cu -Xcompiler -fPIC -o ./
#include <>
#include "cuda_index.cuh"
#include ""
#include ""
void __global__ GatherKernel(float *source, int *index, float *res, int N){
int idx = * + ;
if (idx < N){
res[idx] = source[index[idx]];
}
}
extern "C" int Gather(float *source, int *index, float *res, int N, int M){
float *souce_device, *res_device;
int *index_device;
CHECK(cudaMalloc((void **)&souce_device, M * sizeof(float)));
CHECK(cudaMalloc((void **)&res_device, N * sizeof(float)));
CHECK(cudaMalloc((void **)&index_device, N * sizeof(int)));
CHECK(cudaMemcpy(souce_device, source, M * sizeof(float), cudaMemcpyHostToDevice));
CHECK(cudaMemcpy(res_device, res, N * sizeof(float), cudaMemcpyHostToDevice));
CHECK(cudaMemcpy(index_device, index, N * sizeof(int), cudaMemcpyHostToDevice));
int block_size = 1024;
int grid_size = (N + block_size - 1) / block_size;
TIME_CUDA_FUNCTION((GatherKernel<<<grid_size, block_size>>>(souce_device, index_device, res_device, N)));
CHECK(cudaGetLastError());
CHECK(cudaDeviceSynchronize());
CHECK(cudaMemcpy(res, res_device, N * sizeof(float), cudaMemcpyDeviceToHost));
CHECK(cudaFree(souce_device));
CHECK(cudaFree(index_device));
CHECK(cudaDeviceSynchronize());
CHECK(cudaFree(res_device));
CHECK(cudaDeviceReset());
return 1;
}
It should be noted thatTIME_CUDA_FUNCTION
A macro can only have one input, but when using the CUDA kernel function, it will actually be treated as two inputs, so we need to encapsulate the CUDA kernel function in brackets.
Output result
Finally, according toThis articleThe running process in this way can obtain the output result like this:
Time taken by function (GatherKernel<<<grid_size, block_size>>>(souce_device, index_device, res_device, N)) is: 0.584224 ms
(1048576,)
1048576
Here, the run time of the CUDA kernel function is correctly formatted and output.
Returns the time-consuming value
In addition to printing time-consuming values directly in CUDA, we can also modify themmacros in, let them return time-consuming values:
#pragma once
#include <>
#include <cuda_runtime.h>
// Macro definition, used to measure the execution time of CUDA functions
#define TIME_CUDA_FUNCTION(func) \
do { \
cudaEvent_t start, stop; \
float elapsedTime; \
cudaEventCreate(&start); \
cudaEventCreate(&stop); \
cudaEventRecord(start, NULL); \
\
func; \
\
cudaEventRecord(stop, NULL); \
cudaEventSynchronize(stop); \
cudaEventElapsedTime(&elapsedTime, start, stop); \
printf("Time taken by function %s is: %f ms\n", #func, elapsedTime); \
\
cudaEventDestroy(start); \
cudaEventDestroy(stop); \
} while (0)
// Macro definition, used to measure the execution time of the CUDA function and return the time
#define GET_CUDA_TIME(func) \
({ \
cudaEvent_t start, stop; \
float elapsedTime = 0.0f; \
cudaEventCreate(&start); \
cudaEventCreate(&stop); \
cudaEventRecord(start, NULL); \
\
func; \
\
cudaEventRecord(stop, NULL); \
cudaEventSynchronize(stop); \
cudaEventElapsedTime(&elapsedTime, start, stop); \
\
cudaEventDestroy(start); \
cudaEventDestroy(stop); \
\
elapsedTime; \
})
Modify the header filecuda_index.cuh
, because here we need to return a runtime float value, which is no longer an int type:
#include <>
extern "C" float Gather(float *source, int *index, float *res, int N, int M);
Finally, the corresponding modificationcuda_index.cu
Contents in:
// nvcc -shared ./cuda_index.cu -Xcompiler -fPIC -o ./
#include <>
#include "cuda_index.cuh"
#include ""
#include ""
void __global__ GatherKernel(float *source, int *index, float *res, int N){
int idx = * + ;
if (idx < N){
res[idx] = source[index[idx]];
}
}
extern "C" float Gather(float *source, int *index, float *res, int N, int M){
float *souce_device, *res_device;
int *index_device;
CHECK(cudaMalloc((void **)&souce_device, M * sizeof(float)));
CHECK(cudaMalloc((void **)&res_device, N * sizeof(float)));
CHECK(cudaMalloc((void **)&index_device, N * sizeof(int)));
CHECK(cudaMemcpy(souce_device, source, M * sizeof(float), cudaMemcpyHostToDevice));
CHECK(cudaMemcpy(res_device, res, N * sizeof(float), cudaMemcpyHostToDevice));
CHECK(cudaMemcpy(index_device, index, N * sizeof(int), cudaMemcpyHostToDevice));
int block_size = 1024;
int grid_size = (N + block_size - 1) / block_size;
float timeTaken = GET_CUDA_TIME((GatherKernel<<<grid_size, block_size>>>(souce_device, index_device, res_device, N)));
CHECK(cudaGetLastError());
CHECK(cudaDeviceSynchronize());
CHECK(cudaMemcpy(res, res_device, N * sizeof(float), cudaMemcpyDeviceToHost));
CHECK(cudaFree(souce_device));
CHECK(cudaFree(index_device));
CHECK(cudaDeviceSynchronize());
CHECK(cudaFree(res_device));
CHECK(cudaDeviceReset());
return timeTaken;
}
This will return the value of the function's time-consuming operation to the Cython file, and then in the Cython fileTime to print in:
# cythonize -i -f
import numpy as np
cimport numpy as np
cimport cython
cdef extern from "<>" nogil:
void *dlopen(const char *, int)
char *dlerror()
void *dlsym(void *, const char *)
int dlclose(void *)
enum:
RTLD_LAZY
ctypedef float (*GatherFunc)(float *source, int *index, float *res, int N, int M) noexcept nogil
cdef void* handle = dlopen('/home/dechin/projects/gitee/dechin/tests/cuda/', RTLD_LAZY)
@(False)
@(False)
cpdef float[:] cuda_gather(float[:] x, int[:] idx):
cdef:
GatherFunc Gather
float timeTaken
int N = [0]
int M = [0]
float[:] res = ((N, ), dtype=np.float32)
Gather = <GatherFunc>dlsym(handle, "Gather")
timeTaken = Gather(&x[0], &idx[0], &res[0], N, M)
print (timeTaken)
return res
while not True:
dlclose(handle)
Finally, it is called through the Python module (no changes are required), and the output result is:
0.6107839941978455
(1048576,)
1048576
The unit here is ms.
Summary
This article mainly introduces a CUDA preparation technique: use CUDA header file to write a macro specifically used for CUDA function runtime statistics, so that the runtime of the target Kernel function can be counted. You can print the corresponding value directly in CUDA, or you can pass it back to Cython or Python for printing.
Copyright Statement
The first link to this article is:/dechinphy/p/
Author ID: DechinPhy
More original articles:/dechinphy/
Please ask the blogger to have coffee:/dechinphy/gallery/image/