Programming Parallel Computers

IntroChapter 1 2 3 4Lecture 1 2 3 4 5 6Links About Index

Chapter 2: Case study

IntroV0 AsmOpenMPV1 AsmV2 AsmV3 AsmV4 Asm CacheV5 AsmV6V7

Version 5: Assembly code [advanced]

The assembly code of the innermost loop is very straightforward. There are only 3 instructions related to the loop counters. There are 2 memory accesses, 4 operations that permute the elements, and then 8 + 8 useful vector operations (vminps and vaddps).

LOOP:
           vmovaps    (%rdx,%rax), %ymm2
           vmovaps    (%r12,%rax), %ymm3
           addq       $32, %rax
           vpermilps  $177, %ymm2, %ymm0
           cmpq       %rax, %rcx
           vperm2f128 $1, %ymm3, %ymm3, %ymm13
           vaddps     %ymm2, %ymm3, %ymm15
           vpermilps  $78, %ymm3, %ymm14
           vaddps     %ymm0, %ymm3, %ymm3
           vpermilps  $78, %ymm13, %ymm1
           vminps     %ymm15, %ymm11, %ymm11
           vminps     %ymm3, %ymm7, %ymm7
           vaddps     %ymm14, %ymm2, %ymm3
           vaddps     %ymm14, %ymm0, %ymm14
           vminps     %ymm3, %ymm10, %ymm10
           vaddps     %ymm13, %ymm2, %ymm3
           vaddps     %ymm13, %ymm0, %ymm13
           vaddps     %ymm1, %ymm2, %ymm2
           vaddps     %ymm1, %ymm0, %ymm0
           vminps     %ymm14, %ymm6, %ymm6
           vminps     %ymm3, %ymm9, %ymm9
           vminps     %ymm13, %ymm5, %ymm5
           vminps     %ymm2, %ymm8, %ymm8
           vminps     %ymm0, %ymm4, %ymm4
           jne        LOOP

Analysis

Let us try to do a bit more careful study of all instructions that we have in the innermost loop and how they might be scheduled by the CPU. Here is a table of the instructions and the execution ports that they could use (again derived from Agner Fog's Instruction tables):

Instruction	Count	Execution ports
`vaddps`	8	0, 1
`vminps`	8	0, 1
`vmovaps`	2	2, 3
`vpermilps`	3	5
`vperm2f128`	1	5
`addq`	1	0, 1, 5, 6
`cmpq`	1	0, 1, 5, 6
`jne`	1	6

If the vaddps and vminps instructions keep execution ports 0 and 1 busy for 8 clock cycles, we can see that there are plenty of execution ports that the other instructions could use. For example, vpermilps and vperm2f128 could use port 5 for 4 cycles, vmovaps could use ports 2 and 3 for 1 cycle, and the remaining operations could use port 6 for 3 cycles. While this is a rather simplified view of the internal workings of the CPU, this suggests that there is a good reason to expect near-100% efficiency: the relevant instructions vaddps and vminps are the bottleneck here and everything else should be easy to do on the side.

LLVM Machine Code Analyzer

So far we have analyzed assembly code manually to find potential bottlenecks. The process can be automated to some extent with tools such as the LLVM Machine Code Analyzer, or llvm-mca for short.

llvm-mca works by simulating some aspects of the CPU core using scheduling models. It is able to simulate, for example, out-of-order execution and pipelining, register dependencies and allocation, as well as execution port scheduling and throughput. There are many important aspects it does not simulate, however, such as memory latency, cache misses, branch prediction and data dependencies in memory. The scheduling model may also be inaccurate in some cases.

Let us try to analyze the above assembly code. We will use this command:

llvm-mca -march=x86-64 -mcpu=skylake v5.s

The assembly file you feed in should only contain the relevant part, i.e., the innermost loop. The code is assumed to repeat sequentially regardless of any jump instructions. The first part of the output is a summary:

Iterations:        100
Instructions:      2500
Total Cycles:      816
Total uOps:        2500

Dispatch Width:    6
uOps Per Cycle:    3.06
IPC:               3.06
Block RThroughput: 8.0

This tells us that 100 iterations of the loop were simulated, totalling 2500 instructions. The simulation completed in 816 clock cycles. The conclusion is similar to what we got from manual analysis: each loop iteration completes in around 8 clock cycles, which is the best efficiency we can expect.

Finally, we will request llvm-mca to plot a timeline of the execution of two loop iterations.

llvm-mca -march=x86-64 -mcpu=skylake -timeline -iterations 2 v5.s

The columns represent clock cycles (0–31), while rows correspond to instructions. We can see exactly which clock cycles an instruction is executed on (e) as determined by the scheduling model.

Timeline view:
                    0123456789          01
Index     0123456789          0123456789  

[0,0]     DeeeeeeeER.    .    .    .    ..   vmovaps	(%rdx,%rax), %ymm2
[0,1]     DeeeeeeeER.    .    .    .    ..   vmovaps	(%r12,%rax), %ymm3
[0,2]     DeE------R.    .    .    .    ..   addq	$32, %rax
[0,3]     D=======eER    .    .    .    ..   vpermilps	$177, %ymm2, %ymm0
[0,4]     D=eE------R    .    .    .    ..   cmpq	%rax, %rcx
[0,5]     D========eeeER .    .    .    ..   vperm2f128	$1, %ymm3, %ymm3, %ymm13
[0,6]     .D======eeeeER .    .    .    ..   vaddps	%ymm2, %ymm3, %ymm15
[0,7]     .D========eE-R .    .    .    ..   vpermilps	$78, %ymm3, %ymm14
[0,8]     .D=======eeeeER.    .    .    ..   vaddps	%ymm0, %ymm3, %ymm3
[0,9]     .D==========eER.    .    .    ..   vpermilps	$78, %ymm13, %ymm1
[0,10]    .D==========eeeeER  .    .    ..   vminps	%ymm15, %ymm11, %ymm11
[0,11]    .D===========eeeeER .    .    ..   vminps	%ymm3, %ymm7, %ymm7
[0,12]    . D========eeeeE--R .    .    ..   vaddps	%ymm2, %ymm14, %ymm3
[0,13]    . D========eeeeE--R .    .    ..   vaddps	%ymm0, %ymm14, %ymm14
[0,14]    . D============eeeeER    .    ..   vminps	%ymm3, %ymm10, %ymm10
[0,15]    . D=========eeeeE---R    .    ..   vaddps	%ymm2, %ymm13, %ymm3
[0,16]    . D==========eeeeE--R    .    ..   vaddps	%ymm0, %ymm13, %ymm13
[0,17]    . D===========eeeeE-R    .    ..   vaddps	%ymm1, %ymm2, %ymm2
[0,18]    .  D==========eeeeE-R    .    ..   vaddps	%ymm1, %ymm0, %ymm0
[0,19]    .  D===========eeeeER    .    ..   vminps	%ymm14, %ymm6, %ymm6
[0,20]    .  D============eeeeER   .    ..   vminps	%ymm3, %ymm9, %ymm9
[0,21]    .  D=============eeeeER  .    ..   vminps	%ymm13, %ymm5, %ymm5
[0,22]    .  D==============eeeeER .    ..   vminps	%ymm2, %ymm8, %ymm8
[0,23]    .  D==============eeeeER .    ..   vminps	%ymm0, %ymm4, %ymm4
[0,24]    .   DeE----------------R .    ..   jne	LOOP
[1,0]     .   DeeeeeeeE----------R .    ..   vmovaps	(%rdx,%rax), %ymm2
[1,1]     .   DeeeeeeeE----------R .    ..   vmovaps	(%r12,%rax), %ymm3
[1,2]     .   DeE----------------R .    ..   addq	$32, %rax
[1,3]     .   D========eE--------R .    ..   vpermilps	$177, %ymm2, %ymm0
[1,4]     .   D=eE---------------R .    ..   cmpq	%rax, %rcx
[1,5]     .    D========eeeE-----R .    ..   vperm2f128	$1, %ymm3, %ymm3, %ymm13
[1,6]     .    D==========eeeeE--R .    ..   vaddps	%ymm2, %ymm3, %ymm15
[1,7]     .    D=========eE------R .    ..   vpermilps	$78, %ymm3, %ymm14
[1,8]     .    D===========eeeeE-R .    ..   vaddps	%ymm0, %ymm3, %ymm3
[1,9]     .    D===========eE----R .    ..   vpermilps	$78, %ymm13, %ymm1
[1,10]    .    D==============eeeeER    ..   vminps	%ymm15, %ymm11, %ymm11
[1,11]    .    .D==============eeeeER   ..   vminps	%ymm3, %ymm7, %ymm7
[1,12]    .    .D============eeeeE--R   ..   vaddps	%ymm2, %ymm14, %ymm3
[1,13]    .    .D============eeeeE--R   ..   vaddps	%ymm0, %ymm14, %ymm14
[1,14]    .    .D================eeeeER ..   vminps	%ymm3, %ymm10, %ymm10
[1,15]    .    .D=============eeeeE---R ..   vaddps	%ymm2, %ymm13, %ymm3
[1,16]    .    .D==============eeeeE--R ..   vaddps	%ymm0, %ymm13, %ymm13
[1,17]    .    . D==============eeeeE-R ..   vaddps	%ymm1, %ymm2, %ymm2
[1,18]    .    . D==============eeeeE-R ..   vaddps	%ymm1, %ymm0, %ymm0
[1,19]    .    . D===============eeeeER ..   vminps	%ymm14, %ymm6, %ymm6
[1,20]    .    . D================eeeeER..   vminps	%ymm3, %ymm9, %ymm9
[1,21]    .    . D=================eeeeER.   vminps	%ymm13, %ymm5, %ymm5
[1,22]    .    . D==================eeeeER   vminps	%ymm2, %ymm8, %ymm8
[1,23]    .    .  D=================eeeeER   vminps	%ymm0, %ymm4, %ymm4
[1,24]    .    .  DeE--------------------R   jne	LOOP

Interactive assembly

Here is the Compiler Explorer version to play with.

What happens if you change the target architecture to something that does not support AVX, e.g., by switching to -march=x86-64?
Here we provide just the swap definitions to experiment with. Can you make the implementation more portable by replacing the intrinsics with an appropriate GCC built-in function? If you do this correctly, the resulting assembly should be the same. Check that this works also for generic -march=x86-64 architecture.
Try using llvm-mca by selecting it under "Add tool...". You may need to use the marker comments # LLVM-MCA-BEGIN and # LLVM-MCA-END to delimit the correct part of the assembly.