Programming Parallel Computers 2020

Help with debugging

nvprof quick start

You can use nvprof to quickly check how long each CUDA kernel takes. For example, try this with your CP5 solution:

nvprof ./cp-benchmark 4000 4000 10

It will run the program as usual, but also print a lot of additional information; the first part of the output is relevant here:

==27273== Profiling result:
Time(%)      Time  Calls       Avg       Min       Max  Name
 94.37%  4.38112s     10  438.11ms  436.83ms  439.17ms  dotprod(int, int, float const *, float*)
  2.12%  98.621ms     10  9.8621ms  9.8375ms  9.9301ms  [CUDA memcpy HtoD]
  2.10%  97.332ms     10  9.7332ms  9.7240ms  9.7427ms  [CUDA memcpy DtoH]
  1.41%  65.331ms     10  6.5331ms  6.4901ms  6.5679ms  normalize(int, int, float*)

==27273== API calls:
...

In this implementation, the key operations are:

We can compare these numbers with the total running time of 465ms.