Let us first check that the compiler indeed did what we wanted. Our plan was that we would read 3 + 3 vectors, do 3 × 3 pairwise vector additions, and then update 3 × 3 minimums. This is exactly what the compiler gave us:
LOOP:
leaq (%rax,%rdi), %rcx
vmovaps (%rax), %ymm2
addq $32, %rax
vmovaps (%rdx), %ymm3
vmovaps (%rcx,%r9), %ymm1
vmovaps (%rcx,%rsi), %ymm0
leaq (%rdx,%r10), %rcx
addq $32, %rdx
cmpq %r8, %rax
vmovaps (%rcx,%rbx), %ymm14
vmovaps (%rcx,%r11), %ymm13
vaddps %ymm14, %ymm2, %ymm15
vminps %ymm15, %ymm12, %ymm12
vaddps %ymm14, %ymm1, %ymm15
vaddps %ymm14, %ymm0, %ymm14
vminps %ymm15, %ymm11, %ymm11
vminps %ymm14, %ymm10, %ymm10
vaddps %ymm3, %ymm2, %ymm14
vaddps %ymm13, %ymm2, %ymm2
vminps %ymm14, %ymm9, %ymm9
vaddps %ymm3, %ymm1, %ymm14
vaddps %ymm3, %ymm0, %ymm3
vaddps %ymm13, %ymm1, %ymm1
vaddps %ymm13, %ymm0, %ymm0
vminps %ymm14, %ymm8, %ymm8
vminps %ymm3, %ymm7, %ymm7
vminps %ymm2, %ymm6, %ymm6
vminps %ymm1, %ymm5, %ymm5
vminps %ymm0, %ymm4, %ymm4
jne LOOP
We can count 6 vector reads from the memory (vmovaps
), 9 vector additions (vaddps
), and 9 vector minimums (vminps
). All intermediate results are kept in the vector registers (%ymm
). The only memory accesses in the innermost loop are reads.
It is also good to note that this code is using as many as 16 vector registers:
There is a little bit of room for saving some registers, but no matter what we do, we will need at least 9 registers for the minimums that we accumulate, plus some number of registers for the values that we read and want to keep around for reuse.
The CPU that we use has only got 16 vector registers. Hence, in a sense the scheme that we used cannot be improved much further. If we tried to calculate a 4 × 4 block of the results by scanning 4 rows and 4 columns, we would run out of registers.
You can use Compiler Explorer to try it out:
There is no room to keep all intermediate results in registers, and some of them have to be kept in the stack. This is called register spilling. Hence we have additional memory reads and writes in the innermost loop, which typically hurts the overall performance. However, if there are only a few spills, it may be possible to hide the overhead during other computations. It may make sense to try out implementation strategies in which the number of intermediate results is only slightly larger than the number of registers.
-march=cascadelake
?
The Cascade Lake architecture supports AVX-512. Apart from introducing 512-bit vector operands, AVX-512 also doubles the number of vector registers to 32. Even though switching the target architecture does not make the compiler use larger vector datatypes, it can exploit additional registers without any code changes. Thus, 4 × 4 is now supported without any register spills.