优雅的在NVIDIA GPU上实现sleep

当我们在测试或者其他情况下，也许需要GPU执行完某步后sleep一会儿。这时我们就可以通过cuda所提供的C编程接口clock64()这个函数来实现。这里摘录一段cuda手册中对clock64()函数的说明：

when executed in device code, returns the value of a per-multiprocessor counter that is incremented every clock cycle. Sampling this counter at the beginning and at the end of a kernel, taking the difference of the two samples, and recording the result per thread provides a measure for each thread of the number of clock cycles taken by the device to completely execute the thread, but not of the number of clock cycles the device actually spent executing thread instructions. The former number is greater than the latter since threads are time sliced.

clock64()这个函数将返回线程所处的SM上的时钟周期数。如果在线程的开始和结束进行采样，并获取差值，将获得线程执行所花费的总时钟周期数，这将比线程实际运行的时钟周期数稍大，因为SM上多个线程之间是分时间片执行的。

因此为了优雅的实现设备上的延时函数，我们将在设备上调用clock64()这个函数，其函数原型为long long int clock64()，具体实现如下：

#define CLOCK_RATE 1695000  /* modify for different device */
__device__ void sleep(float t) {    
    clock_t t0 = clock64();
    clock_t t1 = t0;
    while ((t1 - t0)/(CLOCK_RATE*1000.0f) < t)
        t1 = clock64();
}

以上代码中的CLOCK_RATE可通过如下方式获得：

cudaDeviceProp  prop;
cudaGetDeviceProperties(&prop, 0); 
clock_t clock_rate = prop.clockRate;

此处获得的时钟频率单位为kilohertz，因此sleep函数中为获得以秒为单位的延时，需要采用CLOCK_RATE*1000.0f这种方式。

完整代码可见。

优雅的在NVIDIA GPU上实现sleep

zzppq

引用和评论

7步开始rust（自定义环境安装与vscode开发环境配置）

想从事嵌入式软件，有推荐的吗？

程序员如何利用周末提升自己

英伟达新一代GPU架构（50系列显卡）PyTorch兼容性解决方案

深入探索嵌入式开发中的 FreeRTOS：从入门到精通

GPUDirect RDMA 的演进与实现

现在纠结于到底是学stm32好还是Arduino好？