Programming Parallel Computers 2019

Help with debugging

Performance issues

Please always read the task-specific hints first!

Always try to understand which part of the code is the bottleneck, and why. Use performance analysis tools such as perf.

Make sure there is no other load on the machine that you use for benchmarking. Try uptime and top to see what is the current load and who is running what there. To find a computer with a low load, see ppc-helpers.

Remember to run make clean after experimenting with debug builds.

Switching off hyper-threading

You can try to disable hyper-threading and run your code with 4 threads, one thread per CPU, e.g. as follows:

OMP_PROC_BIND=true OMP_NUM_THREADS=4 ./cp-benchmark 4000 4000 10

Naturally you can also combine this with perf:

OMP_PROC_BIND=true OMP_NUM_THREADS=4 perf stat ./cp-benchmark 4000 4000 10

Reading assembly code

The makefiles provide two ways of outputting the assembly code produced by the compiler. For example, if you want to see the compiled version of mf.cc, try the following commands:

make mf.asm1
make mf.asm2

Then open the file mf.asm1 or mf.asm2 in your text editor. Both of these try to produce somewhat readable assembly code, but it may depend on your luck which of these is more readable in your case.

Usually the assembly code is very long, and the most challenging part is finding the relevant part of it quickly. Here is one trick that you can use:

Another trick is to simply search for the relevant instruction. For example, you can search for vmulps to find all places in which you are multiplying float8_t vectors.

Once you have identified what instructions are executed in the performance-critical part of your code, refer to the Instruction tables to see what is the throughput and latency of those instructions. Recall that “Skylake” is the relevant CPU architecture in our case.

Timing: manual way

Here is a simple example that shows how you can measure how long each part of the code takes. (Note that we are also printing out the values that we calculate in order to make sure the compiler cannot simply optimize the relevant part away — here we assume calculations take so long that the time it took to print the value is negligible).

#include <sys/time.h>
#include <iostream>

static double get_time() {
    struct timeval tm;
    gettimeofday(&tm, NULL);
    return static_cast<double>(tm.tv_sec)
        + static_cast<double>(tm.tv_usec) / 1E6;
}

static double calculate1() {...}
static double calculate2() {...}

int main() {
    double t0 = get_time();
    std::cout << calculate1() << std::endl;
    double t1 = get_time();
    std::cout << "calculate1() took " << t1 - t0 << " seconds" << std::endl;
    std::cout << calculate2() << std::endl;
    double t2 = get_time();
    std::cout << "calculate2() took " << t2 - t1 << " seconds" << std::endl;
}

Please remember to remove (or comment out) all debugging printout and benchmarking code before you submit your final version for grading!

Timing: using our helper classes

In your Git repository, you will find the header file common/stopwatch.h with two helper classes that you can use for timing: ppc::stopwatch can handle up to 32 measurements, while ppc::stopwatch_dynamic can handle any number of measurements. Our makefiles are set up so that simply using #include "stopwatch.h" is enough.

Here is a full example of how to use it:

#include "stopwatch.h"

static double calculate1() {...}
static double calculate2() {...}

int main() {
    ppc::stopwatch sw;
    sw.record();
    std::cout << calculate1() << std::endl;
    sw.record();
    std::cout << calculate2() << std::endl;
    sw.record();
    sw.print();
}