![]() The OpenCL version has 48-58 registers (depending on JIT options and platforms) and the CUDA version has 64+ registers. We also run the kernel with a large number of threads, so the global memory latency was effectively hidden.įor both the OpenCL and CUDA versions, this kernel is largely bounded by registers. from the profiler output of the CUDA implementation, memory latency only accounts for 3-4% of the total latency (as oppose to 41% due to execution dependency and 23% due to instruction fetch). We know this particular kernel is compute-bound. Thanks njuffa, your comments have always been helpful. Also there are expressions like ‘sin(acos(x))’ that suggest they might be replaceable by algebraic computation, unless the intermediate angle is used elsewhere. When I see ‘t = x*M_PI r = sincos (t)’, it suggests sincospi() should be used instead of sincos() for accuracy and performance. I spotted some code idioms that may not represent best practice. Profiler stats would also help understand the effects of these local code changes, but, alas, not available for OpenCL. Can you get at the SASS in the OpenCL environment. Only a detailed analysis of the generated machine code would be able to confirm or refute that hand-wavy assumption. However your specific optimization seem to involve related to thread divergence surrounding use of the curious mcx_nextafter() function? This may be partially connected to the use of more advanced compiler technology in the CUDA toolchain, which may lead to better handling of possibly divergent branches in CUDA vs OpenCL. NVIDIA pretty much froze OpenCL four or five years ago, while CUDA is being optimized continuously. So the heavy computational load may be (partially) related to those and may also contribute to different speedups between CUDA and OpenCL. I don’t have time to dig into your code in detail, but based on a cursory glance it seems it involves some amount of transcendental functions, in particular trigonometry. Do you have a roofline performance model for this application? The code may be limited by memory throughput on the GTX 1050 Ti, while it is (partially) limited by computation throughput on GPUs with higher memory bandwidth. The following is quite speculative, not any kind of conclusive analysis. I am curious what do you think about this? is there any tool or technique we can use to find out why the OpenCL version is slower? There is no Opencl profiler like nvvp, and -cl-nv-verbose also does not tell much. However, for OpenCL on nvidia GPUs, we have very limited ways to tell what is happening. We’ve reported similar drastic changes of speed in older CUDA drivers. Very often, large speed improvement following a small change is often a result of fragile compiler predicates. However, this result is only after we enabled two control-flow related optimizations (see the jump from “x” to “#” stacked bars for 1050Ti):Įnabling of the above two code-blocks makes OpenCL 1.8x faster than without them, and pushes the speed comparable to CUDA. This is the only GPU that OpenCL has comparable speed to CUDA. The other curious data point is GTX 1050Ti. We understand that NVIDIA’s CUDA driver is more up-to-date than its OpenCL driver, however, we still feel that alone is not enough to explain the difference observed. The CUDA-based simulation speed is about 2x to 5x faster than the OpenCL-based simulation, except GTX 1050Ti, which is 1-to-1.įrom other papers comparing CUDA and OpenCL, the speed difference found in our study is quite high. 2 (attached below), we notice a huge speed gap between running the OpenCL version of our code ( ) vs the CUDA version ( ) on most tested NVIDIA GPUs. From our tests, shown as the inset in Fig. X-MeterPU to evaluate the energy consumption and the performance.We have a new paper published just a few days ago on an OpenCL Monte Carlo photon simulator. ![]() Required to parallelize the code using a specific framework. To evaluate the programming productivity we use our homegrown toolĬodeStat, which enables us to determine the percentage of code lines that was OpenCL, and CUDA with respect to programming productivity, performance, andĮnergy. In this paper, we study empirically the characteristics of OpenMP, OpenACC, Is suitable for a target context is not straightforward. There are various parallel programmingįrameworks (such as, OpenMP, OpenCL, OpenACC, CUDA) and selecting the one that However, exploiting the available performance of heterogeneousĪrchitectures may be challenging. Intel Xeon Phi) that provide high performance with suitable energy-consumptionĬharacteristics. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Authors: Suejb Memeti, Lu Li, Sabri Pllana, Joanna Kolodziej, Christoph Kessler Download PDF Abstract: Many modern parallel computing systems are heterogeneous at their node level.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |