Let us first check that the compiler indeed did what we wanted. Our plan was that we would read 3 + 3 vectors, do 3 × 3 pairwise vector additions, and then update 3 × 3 minimums. This is exactly what the compiler gave us:
LOOP: leaq (%rax,%rdi), %rcx vmovaps (%rax), %ymm2 addq $32, %rax vmovaps (%rdx), %ymm3 vmovaps (%rcx,%r9), %ymm1 vmovaps (%rcx,%rsi), %ymm0 leaq (%rdx,%r10), %rcx addq $32, %rdx cmpq %r8, %rax vmovaps (%rcx,%rbx), %ymm14 vmovaps (%rcx,%r11), %ymm13 vaddps %ymm14, %ymm2, %ymm15 vminps %ymm15, %ymm12, %ymm12 vaddps %ymm14, %ymm1, %ymm15 vaddps %ymm14, %ymm0, %ymm14 vminps %ymm15, %ymm11, %ymm11 vminps %ymm14, %ymm10, %ymm10 vaddps %ymm3, %ymm2, %ymm14 vaddps %ymm13, %ymm2, %ymm2 vminps %ymm14, %ymm9, %ymm9 vaddps %ymm3, %ymm1, %ymm14 vaddps %ymm3, %ymm0, %ymm3 vaddps %ymm13, %ymm1, %ymm1 vaddps %ymm13, %ymm0, %ymm0 vminps %ymm14, %ymm8, %ymm8 vminps %ymm3, %ymm7, %ymm7 vminps %ymm2, %ymm6, %ymm6 vminps %ymm1, %ymm5, %ymm5 vminps %ymm0, %ymm4, %ymm4 jne LOOP
We can count 6 vector reads from the memory (
vmovaps), 9 vector additions (
vaddps), and 9 vector minimums (
vminps). All intermediate results are kept in the vector registers (
%ymm). The only memory accesses in the innermost loop are reads.
It is also good to note that this code is using as many as 16 vector registers:
There is a little bit of room for saving some registers, but no matter what we do, we will need at least 9 registers for the minimums that we accumulate, plus some number of registers for the values that we read and want to keep around for reuse.
The CPU that we use has only got 16 vector registers. Hence in a sense the scheme that we used cannot be improved much further. If we tried to calculate a 4 × 4 block of the results by scanning 4 rows and 4 columns, we would run out of registers.
The compiler would of course still compile the program, and it would work correctly. However, the performance would not be great; some of the intermediate results would have to be kept in local variables, and there we would have additional memory reads and writes in the innermost loop, which would hurt the overall performance.
Please try it out and see what happens! Try to extend the block size to e.g. 4 × 4, benchmark, and look at the assembly code! And if 4 × 4 is indeed too much, could we find a way to calculate a 3 × 4 block efficiently?