Recall how we identified the relevant part of the assembly code of version 0. Let us now look at the code generated for version 1. The innermost loop is really simple this time, with only 6 instructions:
LOOP: vmovss (%rsi,%rax,4), %xmm0 vaddss (%rdx,%rax,4), %xmm0, %xmm0 addq $1, %rax cmpl %eax, %ebx vminss %xmm1, %xmm0, %xmm1 jg LOOP
Here are the most interesting instructions, side by side with a slightly modified version of the source code. Register
%xmm0 plays the role of variable
x and register
%xmm1 plays the role of variable
|Modified C++ code||Assembly code|
x = d[n*i + k]; x = t[n*j + k] + x; v = std::min(v, x);
vmovss (%rsi,%rax,4), %xmm0 vaddss (%rdx,%rax,4), %xmm0, %xmm0 vminss %xmm1, %xmm0, %xmm1
Let us see if we can understand the results of the benchmarks based on what we see here.
We have been able to identify the key instructions that do the actual relevant arithmetic operations in this code:
vminss. Now we can look up the latency and throughput of these instructions e.g. in Agner Fog's Instruction tables.
In the tables, the relevant section for the processor that we use is Intel Skylake. Instructions
vminss are not listed separately, but we can find
minss in the table. Particularly interesting are the following rows (abbreviated here):
Basically, addition and minimum have the same latency, assuming all operands are already in the registers. (This is not the case in our example for the addition instruction, but let us put this detail aside for now.)
A reciprocal throughput of 0.5 clock cycles per operations is the same as a throughput of 2 operations per clock cycle. If there were no other bottlenecks in our code — all data was readily available in the CPU registers, and all operations were independent — we would expect to complete 2 useful operations per clock cycle. This is far from the number 0.50 that we got from our benchmarks.
It turns out that in our code there is a dependency chain that makes the code latency-limited. If we only focus on the instructions that manipulate the
%xmm1 register, we will see that the CPU will execute the following sequence of operations:
... vminss %xmm1, %xmm0, %xmm1 ... vminss %xmm1, %xmm0, %xmm1 ... vminss %xmm1, %xmm0, %xmm1 ...
vminss operation depends on the previous
vminss operation. Each operation uses the current value of register
%xmm1 as input, and stores the result in register
%xmm1. Hence we must complete the previous operation first in order to know what is the new value of
%xmm1 before we can start the next operation.
The same dependency chain is visible also in the original C++ code. If we only focus on the manipulations of variable
v, we will see the following sequence of instructions:
... v = std::min(v, z); ... v = std::min(v, z); ... v = std::min(v, z); ...
std::min operation depends on the result of the previous
In essence, in each iteration of the innermost loop we will have to wait for at least the latency of the
vminss operation, which is 4 clock cycles. Put otherwise, in 4 clock cycles we can only perform 2 useful operations: one addition and one minimum. Even if there were no other bottlenecks, we would expect from this code the performance of at most 0.5 useful operations per clock cycle. And this is almost exactly the same number as what we got in our benchmarks!
In particular, now we know that getting data from the main memory to the CPU is not any more a bottleneck. We will need to focus next on how to eliminate the sequential dependency chain in the “min” operations.