Recall how we identified the relevant part of the assembly code of version 0. Let us now look at the code generated for version 1. The innermost loop is really simple this time, with only 6 instructions:
LOOP:
vmovss (%rsi,%rax,4), %xmm0
vaddss (%rdx,%rax,4), %xmm0, %xmm0
addq $1, %rax
cmpl %eax, %ebx
vminss %xmm1, %xmm0, %xmm1
jg LOOP
Here are the most interesting instructions, side by side with a slightly modified version of the source code. Register %xmm0
plays the role of variable x
and register %xmm1
plays the role of variable v
:
Modified C++ code | Assembly code |
---|---|
x = d[n*i + k];
x = t[n*j + k] + x;
v = std::min(v, x);
|
vmovss (%rsi,%rax,4), %xmm0
vaddss (%rdx,%rax,4), %xmm0, %xmm0
vminss %xmm1, %xmm0, %xmm1
|
Let us see if we can understand the results of the benchmarks based on what we see here.
We have been able to identify the key instructions that do the actual relevant arithmetic operations in this code: vaddss
and vminss
. Now we can look up the latency and throughput of these instructions e.g. in Agner Fog's Instruction tables.
In the tables, the relevant section for the processor that we use is Intel Skylake. Instructions vaddss
and vminss
are not listed separately, but we can find addss
and minss
in the table. Particularly interesting are the following rows (abbreviated here):
Instruction | Latency | Reciprocal throughput |
---|---|---|
ADDSS v,v,v | 4 | 0.5 |
MINSS v,v,v | 4 | 0.5 |
Basically, addition and minimum have the same latency, assuming all operands are already in the registers. (This is not the case in our example for the addition instruction, but let us put this detail aside for now.)
A reciprocal throughput of 0.5 clock cycles per operations is the same as a throughput of 2 operations per clock cycle. If there were no other bottlenecks in our code — all data was readily available in the CPU registers, and all operations were independent — we would expect to complete 2 useful operations per clock cycle. This is far from the number 0.50 that we got from our benchmarks.
It turns out that in our code there is a dependency chain that makes the code latency-limited. If we only focus on the instructions that manipulate the %xmm1
register, we will see that the CPU will execute the following sequence of operations:
...
vminss %xmm1, %xmm0, %xmm1
...
vminss %xmm1, %xmm0, %xmm1
...
vminss %xmm1, %xmm0, %xmm1
...
Each vminss
operation depends on the previous vminss
operation. Each operation uses the current value of register %xmm1
as input, and stores the result in register %xmm1
. Hence, we must complete the previous operation first in order to know what is the new value of %xmm1
before we can start the next operation.
The same dependency chain is visible also in the original C++ code. If we only focus on the manipulations of variable v
, we will see the following sequence of instructions:
...
v = std::min(v, z);
...
v = std::min(v, z);
...
v = std::min(v, z);
...
Each std::min
operation depends on the result of the previous std::min
operation.
In essence, in each iteration of the innermost loop we will have to wait for at least the latency of the vminss
operation, which is 4 clock cycles. Put otherwise, in 4 clock cycles we can only perform 2 useful operations: one addition and one minimum. Even if there were no other bottlenecks, we would expect from this code the performance of at most 0.5 useful operations per clock cycle. And this is almost exactly the same number as what we got in our benchmarks!
In particular, now we know that getting data from the main memory to the CPU is not anymore a bottleneck. We will need to focus next on how to eliminate the sequential dependency chain in the “min” operations.
We provide a slightly modified version on Compiler Explorer, which can be found here. To reduce the amount of boilerplate code generated, this version does not contain the OpenMP directives, and only provides the part of step
that implements the main loop, skipping over the generation of the transposed matrix.