Here is a question that is often very helpful to ask when doing performance engineering: how fast would it be if all memory reads were L1 cache hits? That is, how fast would it run if we somehow could arrange things so that all memory reads are as cache friendly as possible?
If it turns out that the performance improves dramatically, then we know that memory accesses are a bottleneck, and many memory accesses are L1 cache misses, so there is plenty of room for improvement if we could make our code more cache-friendly. On the other hand, if we do not see much improvement, then we know that no amount of tinkering with memory access patterns is going to help: either everything is already coming from L1 cache, or memory accesses are insignificant in comparison with some other bottlenecks, and we can look elsewhere.
But how do we answer this question? Let us study this in the context of our current solution, where the innermost loop looks like this:
for (int ka = 0; ka < na; ++ka) {
float8_t y0 = vt[na*(jc * nd + 0) + ka];
float8_t y1 = vt[na*(jc * nd + 1) + ka];
float8_t y2 = vt[na*(jc * nd + 2) + ka];
float8_t x0 = vd[na*(ic * nd + 0) + ka];
float8_t x1 = vd[na*(ic * nd + 1) + ka];
float8_t x2 = vd[na*(ic * nd + 2) + ka];
vv[0][0] = min8(vv[0][0], x0 + y0);
vv[0][1] = min8(vv[0][1], x0 + y1);
vv[0][2] = min8(vv[0][2], x0 + y2);
vv[1][0] = min8(vv[1][0], x1 + y0);
vv[1][1] = min8(vv[1][1], x1 + y1);
vv[1][2] = min8(vv[1][2], x1 + y2);
vv[2][0] = min8(vv[2][0], x2 + y0);
vv[2][1] = min8(vv[2][1], x2 + y1);
vv[2][2] = min8(vv[2][2], x2 + y2);
}
And the corresponding assembly code looks like this:
LOOP:
leaq (%rax,%rdi), %rcx
vmovaps (%rax), %ymm2
addq $32, %rax
vmovaps (%rdx), %ymm3
vmovaps (%rcx,%r9), %ymm1
vmovaps (%rcx,%rsi), %ymm0
leaq (%rdx,%r10), %rcx
addq $32, %rdx
cmpq %r8, %rax
vmovaps (%rcx,%rbx), %ymm14
vmovaps (%rcx,%r11), %ymm13
vaddps %ymm14, %ymm2, %ymm15
vminps %ymm15, %ymm12, %ymm12
vaddps %ymm14, %ymm1, %ymm15
vaddps %ymm14, %ymm0, %ymm14
vminps %ymm15, %ymm11, %ymm11
vminps %ymm14, %ymm10, %ymm10
vaddps %ymm3, %ymm2, %ymm14
vaddps %ymm13, %ymm2, %ymm2
vminps %ymm14, %ymm9, %ymm9
vaddps %ymm3, %ymm1, %ymm14
vaddps %ymm3, %ymm0, %ymm3
vaddps %ymm13, %ymm1, %ymm1
vaddps %ymm13, %ymm0, %ymm0
vminps %ymm14, %ymm8, %ymm8
vminps %ymm3, %ymm7, %ymm7
vminps %ymm2, %ymm6, %ymm6
vminps %ymm1, %ymm5, %ymm5
vminps %ymm0, %ymm4, %ymm4
jne LOOP
There are 6 memory reads (vmovaps
), and we would like to somehow, by magic, make sure that all of those are L1 cache hits. Of course achieving this without any other changes is not even physically possible, as there simply is not enough L1 cache space.
We will here resort to the following strategy:
In particular, we want to produce something where the processor is still doing 6 vector reads from the memory (vmovaps
), 9 vector additions (vaddps
), and 9 vector minimums (vminps
), in a loop that runs the right number of times. How do we achieve that?
It turns out that some care is needed here. Let us first see what does not work…
The first, obvious idea is to just modify the code so that each iteration reads the same 6 values, like this:
for (int ka = 0; ka < na; ++ka) {
float8_t y0 = vt[0];
float8_t y1 = vt[1];
float8_t y2 = vt[2];
float8_t x0 = vd[0];
float8_t x1 = vd[1];
float8_t x2 = vd[2];
vv[0][0] = min8(vv[0][0], x0 + y0);
vv[0][1] = min8(vv[0][1], x0 + y1);
vv[0][2] = min8(vv[0][2], x0 + y2);
vv[1][0] = min8(vv[1][0], x1 + y0);
vv[1][1] = min8(vv[1][1], x1 + y1);
vv[1][2] = min8(vv[1][2], x1 + y2);
vv[2][0] = min8(vv[2][0], x2 + y0);
vv[2][1] = min8(vv[2][1], x2 + y1);
vv[2][2] = min8(vv[2][2], x2 + y2);
}
However, if we compile this code and see what the compiler generated, there are bad news; the innermost loop gets simplified to something like this:
LOOP:
incl %eax
vminps %ymm15, %ymm8, %ymm8
vminps %ymm14, %ymm7, %ymm7
vminps %ymm13, %ymm6, %ymm6
vminps 160(%rsp), %ymm5, %ymm5
vminps 128(%rsp), %ymm4, %ymm4
vminps %ymm12, %ymm3, %ymm3
vminps %ymm11, %ymm2, %ymm2
vminps %ymm10, %ymm1, %ymm1
vminps %ymm9, %ymm0, %ymm0
cmpl %eax, %r15d
jg LOOP
Here the compiler was clever enough to realize that e.g. x0
, y0
, and x0 + y0
do not depend at all on the loop counter ka
, and hence they can be precomputed outside the loop. We are only left with a sequence of vminps
operations, which is not what we want. Benchmarking this piece of code would be meaningless: it does not tell anything about memory accesses and cache-friendliness.
Here is a somewhat ugly trick that gets the job done: we can do memory reads through a volatile
pointer; here is an example of how to do it in practice:
volatile float8_t* dummy = vt.data();
#pragma omp parallel for
for (int ic = 0; ic < nc; ++ic) {
for (int jc = 0; jc < nc; ++jc) {
float8_t vv[nd][nd];
for (int id = 0; id < nd; ++id) {
for (int jd = 0; jd < nd; ++jd) {
vv[id][jd] = f8infty;
}
}
for (int ka = 0; ka < na; ++ka) {
float8_t y0 = dummy[0];
float8_t y1 = dummy[0];
float8_t y2 = dummy[0];
float8_t x0 = dummy[0];
float8_t x1 = dummy[0];
float8_t x2 = dummy[0];
vv[0][0] = min8(vv[0][0], x0 + y0);
vv[0][1] = min8(vv[0][1], x0 + y1);
vv[0][2] = min8(vv[0][2], x0 + y2);
vv[1][0] = min8(vv[1][0], x1 + y0);
vv[1][1] = min8(vv[1][1], x1 + y1);
vv[1][2] = min8(vv[1][2], x1 + y2);
vv[2][0] = min8(vv[2][0], x2 + y0);
vv[2][1] = min8(vv[2][1], x2 + y1);
vv[2][2] = min8(vv[2][2], x2 + y2);
}
for (int id = 0; id < nd; ++id) {
for (int jd = 0; jd < nd; ++jd) {
int i = ic * nd + id;
int j = jc * nd + jd;
if (i < n && j < n) {
r[n*i + j] = hmin8(vv[id][jd]);
}
}
}
}
}
In essence, each occurrence of dummy[0]
is now something that the compiler has to faithfully turn into a memory read, and the compiler cannot assume that these memory reads will produce the same value. In this case the compiler produced the following code for the innermost loop:
LOOP:
vmovaps (%rdi), %ymm2
vmovaps (%rdi), %ymm1
incl %edx
vmovaps (%rdi), %ymm0
vmovaps (%rdi), %ymm4
vmovaps (%rdi), %ymm3
vmovaps (%rdi), %ymm5
vaddps %ymm4, %ymm2, %ymm15
vminps %ymm15, %ymm14, %ymm14
vaddps %ymm4, %ymm1, %ymm15
vaddps %ymm4, %ymm0, %ymm4
vminps %ymm15, %ymm13, %ymm13
vminps %ymm4, %ymm12, %ymm12
vaddps %ymm3, %ymm2, %ymm4
vaddps %ymm5, %ymm2, %ymm2
vminps %ymm4, %ymm11, %ymm11
vaddps %ymm3, %ymm1, %ymm4
vaddps %ymm3, %ymm0, %ymm3
vaddps %ymm5, %ymm1, %ymm1
vaddps %ymm5, %ymm0, %ymm0
vminps %ymm2, %ymm8, %ymm8
vminps %ymm4, %ymm10, %ymm10
vminps %ymm3, %ymm9, %ymm9
vminps %ymm1, %ymm7, %ymm7
vminps %ymm0, %ymm6, %ymm6
cmpl %edx, %r15d
jg LOOP
Understandably, details related to pointer arithmetic are somewhat different, but now the spirit of the innermost loop is there: we do 6 vector reads from the memory (vmovaps
), 9 vector additions (vaddps
), and 9 vector minimums (vminps
), in a loop that runs the right number of times. Furthermore, all memory reads hit the same memory address, so the CPU should have no difficulties keeping it in L1 cache.
Now we can actually benchmark this and see how fast the code would be if all memory reads hit the L1 cache. Do we get any improvements?
In this case it turns out that the running time indeed improves, to approx. 0.7 seconds, which is already close to what we would expect if the innermost loop did not have any memory reads at all.
In essence, what this means is that in our current solution the arithmetic units of the CPU are still part of the time idle, waiting for data to arrive from memory. We have now got two different strategies for further improvement: either further reduce the number of memory reads, or make them more cache efficient. We will next try out both strategies.
Above we tried out what would happen if all memory reads were L1 cache hits. It is also often insightful to compare with the other extreme: what if all memory reads were complete cache misses?
Here we can do a quick back-of-the-envelope calculation: we need to calculate 4000 × 4000 results, one pass calculates 3 × 3 results, and one pass reads 6 × 4000 single-precision floating-point numbers, each 4 bytes. In total this is approx. 43 billion single-precision floating-point numbers, or 171 GB of data to read from memory.
The maximum main memory bandwidth of our CPU is 34.1 GB/s, so if all of the reads were complete cache misses, we would expect our code to take at least 5.0 seconds to run, and it is already much faster than that. Hence it certainly is the case that our current solution already benefits from the cache memory hierarchy, even though we have not taken it into account much yet. In version 7 we will engineer a solution that tries to push cache efficiency even further, but before that we will try to first reduce the total volume of memory reading.