Programming Parallel Computers 2020

Exercise CP: correlated pairs

Individual tasks

Please see the grading tool for details on benchmark instances and time limits.

CP1: CPU baseline ★

Implement a simple sequential baseline solution. Make sure it works correctly. Do not use any form of parallelism yet.

Please do all arithmetic with double-precision floating point numbers.

Deadline week 1. Maximum points 5, after deadline 3.

CP2a: instruction-level parallelism ★

Parallelize your solution to CP1 by exploiting instruction-level parallelism. Make sure that the performance-critical operations are pipelined efficiently. Do not use any other form of parallelism yet in this exercise.

Please do all arithmetic with double-precision floating point numbers.

Deadline week 2. Maximum points 3, after deadline 2.

CP2b: multicore parallelism ★

Parallelize your solution to CP1 with the help of OpenMP and multithreading so that you are exploiting multiple CPU cores in parallel. Do not use any other form of parallelism yet in this exercise.

Please do all arithmetic with double-precision floating point numbers.

Deadline week 2. Maximum points 3, after deadline 2.

CP2c: vectorization ★

Parallelize your solution to CP1 with the help of vector operations so that you can perform multiple useful arithmetic operations with one instruction.

Do not use any other form of parallelism yet in this exercise.

Please do all arithmetic with double-precision floating point numbers.

Deadline week 3. Maximum points 3, after deadline 2.

CP3a: fast solution with doubles ★★

Using all resources that you have in the CPU, solve the task as fast as possible. You are encouraged to exploit instruction-level parallelism, multithreading, and vector instructions whenever possible, and also to optimize the memory access pattern.

Please do all arithmetic with double-precision floating point numbers.

Deadline week 3. Maximum points 5 + 1, after deadline 3 + 1. Participates in the contest.

CP3b: fast solution with floats ★★

Using all resources that you have in the CPU, solve the task as fast as possible. You are encouraged to exploit instruction-level parallelism, multithreading, and vector instructions whenever possible, and also to optimize the memory access pattern.

In this task, you are permitted to use single-precision floating point numbers.

Deadline week 3. Maximum points 5 + 1, after deadline 3 + 1. Participates in the contest.

CP4: GPU baseline ★

Implement a simple baseline solution for the GPU. Make sure it works correctly and that it is reasonably efficient. Make sure that all performance-critical parts are executed on the GPU; you can do some lightweight preprocessing and postprocessing also on the CPU.

In this task, you are permitted to use single-precision floating point numbers.

Deadline week 4. Maximum points 5, after deadline 3.

CP5: fast GPU solution ★★

Using all resources that you have in the GPU, solve the task as fast as possible.

In this task, you are permitted to use single-precision floating point numbers.

Deadline week 5. Maximum points 10 + 1, after deadline 6 + 1. Participates in the contest.

CP9a: better algorithm [challenging]

Try to use Strassen's algorithm to speed up matrix multiplications. Can you improve on your solution to CP3a or CP3b this way?

Your submission has to contain the source code of your implementation and a written report.

Deadline week 6. Maximum points 5. This is an open-ended task that is graded by the course staff.

CP9b: performance analysis [challenging]

How close are your solutions to CP3a and CP3b to the theoretical and practical capabilities of the CPU? How many floating point operations do you do in your code in total? How many floating point operations does your code do per clock cycle and per CPU core? Could it possibly do any better than that?

Write benchmark code that achieves as many floating point operations per cycle as possible with the CPU that we use. Your program does not need to do anything useful. Make sure that you use vector instructions, you do not have any memory accesses that might be a bottleneck, and there are enough opportunities for instruction-level parallelism.

Read the specifications of the CPU: how many floating point operations is it supposed to be capable of doing per clock cycle and per CPU core, under optimal conditions? Does your benchmark code agree with this?

Alternatively, you can also choose to do a similar analysis of your solution to CP5, in comparison with the theoretical and practical capabilities of the GPU.

Your submission has to contain at least a written report.

Deadline week 6. Maximum points 5. This is an open-ended task that is graded by the course staff.

CP9c: application [challenging]

Develop an application for aligning two images.

Your program reads two images, A and B, and finds the best way to align them both horizontally and vertically. The program should create multiple versions of image A with different horizontal translations, and multiple versions of image B with different vertical translations, calculate all pairwise correlations between them, and select the best fit.

Handling horizontal and vertical translation is sufficient; you do not need to handle rotation or scaling. You can use common/pngio.cc in your Git repository for reading PNG files. You can use e.g. your CP3b or CP5 as a subroutine to calculate pairwise correlations, or you can develop an algorithm specifically for this task.

Your report should demonstrate how your program works with real-life images (e.g., snap two photos with your mobile phone of the same scene, handheld). You are free to choose reasonable image sizes and how large translations are supported.

Your submission has to contain the source code of your implementation and a written report.

Deadline week 6. Maximum points 5. This is an open-ended task that is graded by the course staff.