Location>code7788 >text

CUDA duration statistics

Popularity:934 ℃/2025-02-28 09:42:14

Technical background

The previous article introduced inCUDA uses macros to monitor the operation of CUDA C function or Kernel function errorsquestion. By the same idea, we can write a macro to count the runtime of a function, so that we do not need to use additional tools to test the performance of the function body.

Document preparation

Because the macro changes here mainly involve the modification of CUDA header files and CUDA files, we still reuse the Cython files and Python files and exception capture macros.This articleUsed in it. The test content is to define a raw array and an index array, and output the indexed result array.

# cythonize -i -f 

import numpy as np
cimport numpy as np
cimport cython

cdef extern from "<>" nogil:
    void *dlopen(const char *, int)
    char *dlerror()
    void *dlsym(void *, const char *)
    int dlclose(void *)
    enum:
        RTLD_LAZY

ctypedef int (*GatherFunc)(float *source, int *index, float *res, int N, int M) noexcept nogil

cdef void* handle = dlopen('/path/to/', RTLD_LAZY)

@(False)
@(False)
cpdef float[:] cuda_gather(float[:] x, int[:] idx):
    cdef:
        GatherFunc Gather
        int success
        int N = [0]
        int M = [0]
        float[:] res = ((N, ), dtype=np.float32)
    Gather = <GatherFunc>dlsym(handle, "Gather")
    success = Gather(&x[0], &idx[0], &res[0], N, M)
    return res

while not True:
    dlclose(handle)

test_gather.py

import numpy as np
(0)
from wrapper import cuda_gather

M = 1024 * 1024 * 128
N = 1024 * 1024
x = ((M,)).astype(np.float32)
idx = (0, M, (N,)).astype(np.int32)
res = (cuda_gather(x, idx))
print ()
print ((res==x[idx]).sum())

#pragma once
#include <>

#define CHECK(call) do{const cudaError_t error_code = call; if (error_code != cudaSuccess){printf("CUDA Error:\n"); printf("    File:   %s\n", __FILE__); printf("    Line:   %d\n", __LINE__); printf("    Error code: %d\n", error_code); printf("    Error text: %s\n", cudaGetErrorString(error_code)); exit(1);}} while (0)

Timing Macro

Here is a timerA header file, write aTIME_CUDA_FUNCTIONThe macro, and then call it before the function that needs to be counted in CUDA, can output the run time of the CUDA function.

#pragma once
 #include <>
 #include <cuda_runtime.h>

 // Macro definition, used to measure the execution time of CUDA functions
 #define TIME_CUDA_FUNCTION(func) \
     do { \
         cudaEvent_t start, stop; \
         float elapsedTime; \
         cudaEventCreate(&start); \
         cudaEventCreate(&stop); \
         cudaEventRecord(start, NULL); \
         \
         func; \
         \
         cudaEventRecord(stop, NULL); \
         cudaEventSynchronize(stop); \
         cudaEventElapsedTime(&elapsedTime, start, stop); \
         printf("Time taken by function %s is: %f ms\n", #func, elapsedTime); \
         \
         cudaEventDestroy(start); \
         cudaEventDestroy(stop); \
     } while (0)

Use of timing macros

We're in CUDA filescuda_index.cuCalled inThe timing macro inside is used to count the execution time of a CUDA kernel function:

// nvcc -shared ./cuda_index.cu -Xcompiler -fPIC -o ./
#include <>
#include "cuda_index.cuh"
#include ""
#include ""

void __global__ GatherKernel(float *source, int *index, float *res, int N){
    int idx =  *  + ;
    if (idx < N){
        res[idx] = source[index[idx]];
    }
}

extern "C" int Gather(float *source, int *index, float *res, int N, int M){
    float *souce_device, *res_device;
    int *index_device;
    CHECK(cudaMalloc((void **)&souce_device, M * sizeof(float)));
    CHECK(cudaMalloc((void **)&res_device, N * sizeof(float)));
    CHECK(cudaMalloc((void **)&index_device, N * sizeof(int)));
    CHECK(cudaMemcpy(souce_device, source, M * sizeof(float), cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(res_device, res, N * sizeof(float), cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(index_device, index, N * sizeof(int), cudaMemcpyHostToDevice));
    int block_size = 1024;
    int grid_size = (N + block_size - 1) / block_size;
    TIME_CUDA_FUNCTION((GatherKernel<<<grid_size, block_size>>>(souce_device, index_device, res_device, N)));
    CHECK(cudaGetLastError());
    CHECK(cudaDeviceSynchronize());
    CHECK(cudaMemcpy(res, res_device, N * sizeof(float), cudaMemcpyDeviceToHost));
    CHECK(cudaFree(souce_device));
    CHECK(cudaFree(index_device));
    CHECK(cudaDeviceSynchronize());
    CHECK(cudaFree(res_device));
    CHECK(cudaDeviceReset());
    return 1;
}

It should be noted thatTIME_CUDA_FUNCTIONA macro can only have one input, but when using the CUDA kernel function, it will actually be treated as two inputs, so we need to encapsulate the CUDA kernel function in brackets.

Output result

Finally, according toThis articleThe running process in this way can obtain the output result like this:

Time taken by function (GatherKernel<<<grid_size, block_size>>>(souce_device, index_device, res_device, N)) is: 0.584224 ms
(1048576,)
1048576

Here, the run time of the CUDA kernel function is correctly formatted and output.

Returns the time-consuming value

In addition to printing time-consuming values ​​directly in CUDA, we can also modify themmacros in, let them return time-consuming values:

#pragma once
 #include <>
 #include <cuda_runtime.h>

 // Macro definition, used to measure the execution time of CUDA functions
 #define TIME_CUDA_FUNCTION(func) \
     do { \
         cudaEvent_t start, stop; \
         float elapsedTime; \
         cudaEventCreate(&start); \
         cudaEventCreate(&stop); \
         cudaEventRecord(start, NULL); \
         \
         func; \
         \
         cudaEventRecord(stop, NULL); \
         cudaEventSynchronize(stop); \
         cudaEventElapsedTime(&elapsedTime, start, stop); \
         printf("Time taken by function %s is: %f ms\n", #func, elapsedTime); \
         \
         cudaEventDestroy(start); \
         cudaEventDestroy(stop); \
     } while (0)

 // Macro definition, used to measure the execution time of the CUDA function and return the time
 #define GET_CUDA_TIME(func) \
     ({ \
         cudaEvent_t start, stop; \
         float elapsedTime = 0.0f; \
         cudaEventCreate(&start); \
         cudaEventCreate(&stop); \
         cudaEventRecord(start, NULL); \
         \
         func; \
         \
         cudaEventRecord(stop, NULL); \
         cudaEventSynchronize(stop); \
         cudaEventElapsedTime(&elapsedTime, start, stop); \
         \
         cudaEventDestroy(start); \
         cudaEventDestroy(stop); \
         \
         elapsedTime; \
     })

Modify the header filecuda_index.cuh, because here we need to return a runtime float value, which is no longer an int type:

#include <>

extern "C" float Gather(float *source, int *index, float *res, int N, int M);

Finally, the corresponding modificationcuda_index.cuContents in:

// nvcc -shared ./cuda_index.cu -Xcompiler -fPIC -o ./
#include <>
#include "cuda_index.cuh"
#include ""
#include ""

void __global__ GatherKernel(float *source, int *index, float *res, int N){
    int idx =  *  + ;
    if (idx < N){
        res[idx] = source[index[idx]];
    }
}

extern "C" float Gather(float *source, int *index, float *res, int N, int M){
    float *souce_device, *res_device;
    int *index_device;
    CHECK(cudaMalloc((void **)&souce_device, M * sizeof(float)));
    CHECK(cudaMalloc((void **)&res_device, N * sizeof(float)));
    CHECK(cudaMalloc((void **)&index_device, N * sizeof(int)));
    CHECK(cudaMemcpy(souce_device, source, M * sizeof(float), cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(res_device, res, N * sizeof(float), cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(index_device, index, N * sizeof(int), cudaMemcpyHostToDevice));
    int block_size = 1024;
    int grid_size = (N + block_size - 1) / block_size;
    float timeTaken = GET_CUDA_TIME((GatherKernel<<<grid_size, block_size>>>(souce_device, index_device, res_device, N)));
    CHECK(cudaGetLastError());
    CHECK(cudaDeviceSynchronize());
    CHECK(cudaMemcpy(res, res_device, N * sizeof(float), cudaMemcpyDeviceToHost));
    CHECK(cudaFree(souce_device));
    CHECK(cudaFree(index_device));
    CHECK(cudaDeviceSynchronize());
    CHECK(cudaFree(res_device));
    CHECK(cudaDeviceReset());
    return timeTaken;
}

This will return the value of the function's time-consuming operation to the Cython file, and then in the Cython fileTime to print in:

# cythonize -i -f 

import numpy as np
cimport numpy as np
cimport cython

cdef extern from "<>" nogil:
    void *dlopen(const char *, int)
    char *dlerror()
    void *dlsym(void *, const char *)
    int dlclose(void *)
    enum:
        RTLD_LAZY

ctypedef float (*GatherFunc)(float *source, int *index, float *res, int N, int M) noexcept nogil

cdef void* handle = dlopen('/home/dechin/projects/gitee/dechin/tests/cuda/', RTLD_LAZY)

@(False)
@(False)
cpdef float[:] cuda_gather(float[:] x, int[:] idx):
    cdef:
        GatherFunc Gather
        float timeTaken
        int N = [0]
        int M = [0]
        float[:] res = ((N, ), dtype=np.float32)
    Gather = <GatherFunc>dlsym(handle, "Gather")
    timeTaken = Gather(&x[0], &idx[0], &res[0], N, M)
    print (timeTaken)
    return res

while not True:
    dlclose(handle)

Finally, it is called through the Python module (no changes are required), and the output result is:

0.6107839941978455
(1048576,)
1048576

The unit here is ms.

Summary

This article mainly introduces a CUDA preparation technique: use CUDA header file to write a macro specifically used for CUDA function runtime statistics, so that the runtime of the target Kernel function can be counted. You can print the corresponding value directly in CUDA, or you can pass it back to Cython or Python for printing.

Copyright Statement

The first link to this article is:/dechinphy/p/

Author ID: DechinPhy

More original articles:/dechinphy/

Please ask the blogger to have coffee:/dechinphy/gallery/image/