The colossal computing capabilities of GPUs have increasingly emerged as a powerful tool for high performance computing platforms. However, the parallel architecture of GPUs has exposed performance issues under conditional branch scenarios commonly seen in GPGPU applications, such as MCML. The lack of complex branch interpreters on GPUs forces the multi-core hardware to execute thread-level divergent codes serially, inflicting serious performance degradations.
This paper introduces a mechanism for eliminating thread divergence through CUDA Streams. This software-level optimization remaps threads that take different paths to alternate Streams, allowing divergent codes to potentially overlap and result in performance improvement.

footer 情報処理学会 セキュリティ プライバシーポリシー 倫理綱領 著作権について