Programming Parallel Computers

Chapter 2: Case study

Version 5: Assembly code [advanced]

The assembly code of the innermost loop is very straightforward. There are only 3 instructions related to the loop counters. There are 2 memory accesses, 4 operations that permute the elements, and then 8 + 8 useful vector operations (vminps and vaddps).

LOOP:
           vmovaps    (%rdx,%rax), %ymm2
           vmovaps    (%r12,%rax), %ymm3
           addq       $32, %rax
           vpermilps  $177, %ymm2, %ymm0
           cmpq       %rax, %rcx
           vperm2f128 $1, %ymm3, %ymm3, %ymm13
           vaddps     %ymm2, %ymm3, %ymm15
           vpermilps  $78, %ymm3, %ymm14
           vaddps     %ymm0, %ymm3, %ymm3
           vpermilps  $78, %ymm13, %ymm1
           vminps     %ymm15, %ymm11, %ymm11
           vminps     %ymm3, %ymm7, %ymm7
           vaddps     %ymm14, %ymm2, %ymm3
           vaddps     %ymm14, %ymm0, %ymm14
           vminps     %ymm3, %ymm10, %ymm10
           vaddps     %ymm13, %ymm2, %ymm3
           vaddps     %ymm13, %ymm0, %ymm13
           vaddps     %ymm1, %ymm2, %ymm2
           vaddps     %ymm1, %ymm0, %ymm0
           vminps     %ymm14, %ymm6, %ymm6
           vminps     %ymm3, %ymm9, %ymm9
           vminps     %ymm13, %ymm5, %ymm5
           vminps     %ymm2, %ymm8, %ymm8
           vminps     %ymm0, %ymm4, %ymm4
           jne        LOOP

Analysis

Let us try to do a bit more careful study of all instructions that we have in the innermost loop and how they might be scheduled by the CPU. Here is a table of the instructions and the execution ports that they could use (again derived from Agner Fog's Instruction tables):

InstructionCountExecution ports
vaddps80, 1
vminps80, 1
vmovaps22, 3
vpermilps35
vperm2f12815
addq10, 1, 5, 6
cmpq10, 1, 5, 6
jne16

If the vaddps and vminps instructions keep execution ports 0 and 1 busy for 8 clock cycles, we can see that there are plenty of execution ports that the other instructions could use. For example, vpermilps and vperm2f128 could use port 5 for 4 cycles, vmovaps could use ports 2 and 3 for 1 cycle, and the remaining operations could use port 6 for 3 cycles. While this is a rather simplified view of the internal workings of the CPU, this suggests that there is a good reason to expect near-100% efficiency: the relevant instructions vaddps and vminps are the bottleneck here and everything else should be easy to do on the side.