Programming Parallel Computers

IntroChapter 1 2 3 4Lecture 1 2 3 4 5 6Links About Index

Chapter 2: Case study

IntroV0 AsmOpenMPV1 AsmV2 AsmV3 AsmV4 Asm CacheV5 AsmV6V7

Version 4: Cache memory [advanced]

Here is a question that is often very helpful to ask when doing performance engineering: how fast would it be if all memory reads were L1 cache hits? That is, how fast would it run if we somehow could arrange things so that all memory reads are as cache friendly as possible?

If it turns out that the performance improves dramatically, then we know that memory accesses are a bottleneck, and many memory accesses are L1 cache misses, so there is plenty of room for improvement if we could make our code more cache-friendly. On the other hand, if we do not see much improvement, then we know that no amount of tinkering with memory access patterns is going to help: either everything is already coming from L1 cache, or memory accesses are insignificant in comparison with some other bottlenecks, and we can look elsewhere.

But how do we answer this question? Let us study this in the context of our current solution, where the innermost loop looks like this:

for (int ka = 0; ka < na; ++ka) {
    float8_t y0 = vt[na*(jc * nd + 0) + ka];
    float8_t y1 = vt[na*(jc * nd + 1) + ka];
    float8_t y2 = vt[na*(jc * nd + 2) + ka];
    float8_t x0 = vd[na*(ic * nd + 0) + ka];
    float8_t x1 = vd[na*(ic * nd + 1) + ka];
    float8_t x2 = vd[na*(ic * nd + 2) + ka];
    vv[0][0] = min8(vv[0][0], x0 + y0);
    vv[0][1] = min8(vv[0][1], x0 + y1);
    vv[0][2] = min8(vv[0][2], x0 + y2);
    vv[1][0] = min8(vv[1][0], x1 + y0);
    vv[1][1] = min8(vv[1][1], x1 + y1);
    vv[1][2] = min8(vv[1][2], x1 + y2);
    vv[2][0] = min8(vv[2][0], x2 + y0);
    vv[2][1] = min8(vv[2][1], x2 + y1);
    vv[2][2] = min8(vv[2][2], x2 + y2);
}

And the corresponding assembly code looks like this:

LOOP:
           leaq       (%rax,%rdi), %rcx
           vmovaps    (%rax), %ymm2
           addq       $32, %rax
           vmovaps    (%rdx), %ymm3
           vmovaps    (%rcx,%r9), %ymm1
           vmovaps    (%rcx,%rsi), %ymm0
           leaq       (%rdx,%r10), %rcx
           addq       $32, %rdx
           cmpq       %r8, %rax
           vmovaps    (%rcx,%rbx), %ymm14
           vmovaps    (%rcx,%r11), %ymm13
           vaddps     %ymm14, %ymm2, %ymm15
           vminps     %ymm15, %ymm12, %ymm12
           vaddps     %ymm14, %ymm1, %ymm15
           vaddps     %ymm14, %ymm0, %ymm14
           vminps     %ymm15, %ymm11, %ymm11
           vminps     %ymm14, %ymm10, %ymm10
           vaddps     %ymm3, %ymm2, %ymm14
           vaddps     %ymm13, %ymm2, %ymm2
           vminps     %ymm14, %ymm9, %ymm9
           vaddps     %ymm3, %ymm1, %ymm14
           vaddps     %ymm3, %ymm0, %ymm3
           vaddps     %ymm13, %ymm1, %ymm1
           vaddps     %ymm13, %ymm0, %ymm0
           vminps     %ymm14, %ymm8, %ymm8
           vminps     %ymm3, %ymm7, %ymm7
           vminps     %ymm2, %ymm6, %ymm6
           vminps     %ymm1, %ymm5, %ymm5
           vminps     %ymm0, %ymm4, %ymm4
           jne        LOOP

There are 6 memory reads (vmovaps), and we would like to somehow, by magic, make sure that all of those are L1 cache hits. Of course achieving this without any other changes is not even physically possible, as there simply is not enough L1 cache space.

We will here resort to the following strategy:

We modify the code slightly so that we repeatedly read the same memory addresses.
We do not try to compute the correct result. It is fine if we read garbage and produce garbage as a result, as long as we go through the same steps as the correct solution so that we can benchmark the code and see how fast it runs.

In particular, we want to produce something where the processor is still doing 6 vector reads from the memory (vmovaps), 9 vector additions (vaddps), and 9 vector minimums (vminps), in a loop that runs the right number of times. How do we achieve that?

It turns out that some care is needed here. Let us first see what does not work…

Wrong approach

The first, obvious idea is to just modify the code so that each iteration reads the same 6 values, like this:

for (int ka = 0; ka < na; ++ka) {
    float8_t y0 = vt[0];
    float8_t y1 = vt[1];
    float8_t y2 = vt[2];
    float8_t x0 = vd[0];
    float8_t x1 = vd[1];
    float8_t x2 = vd[2];
    vv[0][0] = min8(vv[0][0], x0 + y0);
    vv[0][1] = min8(vv[0][1], x0 + y1);
    vv[0][2] = min8(vv[0][2], x0 + y2);
    vv[1][0] = min8(vv[1][0], x1 + y0);
    vv[1][1] = min8(vv[1][1], x1 + y1);
    vv[1][2] = min8(vv[1][2], x1 + y2);
    vv[2][0] = min8(vv[2][0], x2 + y0);
    vv[2][1] = min8(vv[2][1], x2 + y1);
    vv[2][2] = min8(vv[2][2], x2 + y2);
}

However, if we compile this code and see what the compiler generated, there are bad news; the innermost loop gets simplified to something like this:

LOOP:
           incl       %eax
           vminps     %ymm15, %ymm8, %ymm8
           vminps     %ymm14, %ymm7, %ymm7
           vminps     %ymm13, %ymm6, %ymm6
           vminps     160(%rsp), %ymm5, %ymm5
           vminps     128(%rsp), %ymm4, %ymm4
           vminps     %ymm12, %ymm3, %ymm3
           vminps     %ymm11, %ymm2, %ymm2
           vminps     %ymm10, %ymm1, %ymm1
           vminps     %ymm9, %ymm0, %ymm0
           cmpl       %eax, %r15d
           jg         LOOP

Here the compiler was clever enough to realize that e.g. x0, y0, and x0 + y0 do not depend at all on the loop counter ka, and hence they can be precomputed outside the loop. We are only left with a sequence of vminps operations, which is not what we want. Benchmarking this piece of code would be meaningless: it does not tell anything about memory accesses and cache-friendliness.

Using volatile

Here is a somewhat ugly trick that gets the job done: we can do memory reads through a volatile pointer; here is an example of how to do it in practice:

volatile float8_t* dummy = vt.data();

#pragma omp parallel for
for (int ic = 0; ic < nc; ++ic) {
    for (int jc = 0; jc < nc; ++jc) {
        float8_t vv[nd][nd];
        for (int id = 0; id < nd; ++id) {
            for (int jd = 0; jd < nd; ++jd) {
                vv[id][jd] = f8infty;
            }
        }
        for (int ka = 0; ka < na; ++ka) {
            float8_t y0 = dummy[0];
            float8_t y1 = dummy[0];
            float8_t y2 = dummy[0];
            float8_t x0 = dummy[0];
            float8_t x1 = dummy[0];
            float8_t x2 = dummy[0];
            vv[0][0] = min8(vv[0][0], x0 + y0);
            vv[0][1] = min8(vv[0][1], x0 + y1);
            vv[0][2] = min8(vv[0][2], x0 + y2);
            vv[1][0] = min8(vv[1][0], x1 + y0);
            vv[1][1] = min8(vv[1][1], x1 + y1);
            vv[1][2] = min8(vv[1][2], x1 + y2);
            vv[2][0] = min8(vv[2][0], x2 + y0);
            vv[2][1] = min8(vv[2][1], x2 + y1);
            vv[2][2] = min8(vv[2][2], x2 + y2);
        }
        for (int id = 0; id < nd; ++id) {
            for (int jd = 0; jd < nd; ++jd) {
                int i = ic * nd + id;
                int j = jc * nd + jd;
                if (i < n && j < n) {
                    r[n*i + j] = hmin8(vv[id][jd]);
                }
            }
        }
    }
}

In essence, each occurrence of dummy[0] is now something that the compiler has to faithfully turn into a memory read, and the compiler cannot assume that these memory reads will produce the same value. In this case the compiler produced the following code for the innermost loop:

LOOP:
           vmovaps    (%rdi), %ymm2
           vmovaps    (%rdi), %ymm1
           incl       %edx
           vmovaps    (%rdi), %ymm0
           vmovaps    (%rdi), %ymm4
           vmovaps    (%rdi), %ymm3
           vmovaps    (%rdi), %ymm5
           vaddps     %ymm4, %ymm2, %ymm15
           vminps     %ymm15, %ymm14, %ymm14
           vaddps     %ymm4, %ymm1, %ymm15
           vaddps     %ymm4, %ymm0, %ymm4
           vminps     %ymm15, %ymm13, %ymm13
           vminps     %ymm4, %ymm12, %ymm12
           vaddps     %ymm3, %ymm2, %ymm4
           vaddps     %ymm5, %ymm2, %ymm2
           vminps     %ymm4, %ymm11, %ymm11
           vaddps     %ymm3, %ymm1, %ymm4
           vaddps     %ymm3, %ymm0, %ymm3
           vaddps     %ymm5, %ymm1, %ymm1
           vaddps     %ymm5, %ymm0, %ymm0
           vminps     %ymm2, %ymm8, %ymm8
           vminps     %ymm4, %ymm10, %ymm10
           vminps     %ymm3, %ymm9, %ymm9
           vminps     %ymm1, %ymm7, %ymm7
           vminps     %ymm0, %ymm6, %ymm6
           cmpl       %edx, %r15d
           jg         LOOP

Understandably, details related to pointer arithmetic are somewhat different, but now the spirit of the innermost loop is there: we do 6 vector reads from the memory (vmovaps), 9 vector additions (vaddps), and 9 vector minimums (vminps), in a loop that runs the right number of times. Furthermore, all memory reads hit the same memory address, so the CPU should have no difficulties keeping it in L1 cache.

Results

Now we can actually benchmark this and see how fast the code would be if all memory reads hit the L1 cache. Do we get any improvements?

In this case it turns out that the running time indeed improves, to approx. 0.7 seconds, which is already close to what we would expect if the innermost loop did not have any memory reads at all.

In essence, what this means is that in our current solution the arithmetic units of the CPU are still part of the time idle, waiting for data to arrive from memory. We have now got two different strategies for further improvement: either further reduce the number of memory reads, or make them more cache efficient. We will next try out both strategies.

Do we benefit from caches at all?

Above we tried out what would happen if all memory reads were L1 cache hits. It is also often insightful to compare with the other extreme: what if all memory reads were complete cache misses?

Here we can do a quick back-of-the-envelope calculation: we need to calculate 4000 × 4000 results, one pass calculates 3 × 3 results, and one pass reads 6 × 4000 single-precision floating-point numbers, each 4 bytes. In total this is approx. 43 billion single-precision floating-point numbers, or 171 GB of data to read from memory.

The maximum main memory bandwidth of our CPU is 34.1 GB/s, so if all of the reads were complete cache misses, we would expect our code to take at least 5.0 seconds to run, and it is already much faster than that. Hence it certainly is the case that our current solution already benefits from the cache memory hierarchy, even though we have not taken it into account much yet. In version 7 we will engineer a solution that tries to push cache efficiency even further, but before that we will try to first reduce the total volume of memory reading.