You can use
nvprof to quickly check how long each CUDA kernel takes. For example, try this with your CP5 solution:
nvprof ./cp-benchmark 4000 4000 10
It will run the program as usual, but also print a lot of additional information; the first part of the output is relevant here:
==27273== Profiling result: Time(%) Time Calls Avg Min Max Name 94.37% 4.38112s 10 438.11ms 436.83ms 439.17ms dotprod(int, int, float const *, float*) 2.12% 98.621ms 10 9.8621ms 9.8375ms 9.9301ms [CUDA memcpy HtoD] 2.10% 97.332ms 10 9.7332ms 9.7240ms 9.7427ms [CUDA memcpy DtoH] 1.41% 65.331ms 10 6.5331ms 6.4901ms 6.5679ms normalize(int, int, float*) ==27273== API calls: ...
In this implementation, the key operations are:
We can compare these numbers with the total running time of 465ms.