# M2R Parallel Systems

## Parallel Systems

### General Informations

These lectures take place from 9:45 to 12:45 every Monday in room H202, H102, H203, or H205. In doubt, check the ADE website (look for PDES and then for the Parallel System lectures).

Here is a map of the campus for those who do not know where the rooms are.

The coordinator for these lectures is Arnaud Legrand. The lecturers are Bruno Raffin and Arnaud Legrand .

### Objectives

Today, parallel computing is omnipresent across a large spectrum of computing platforms. At the microscopic'' level, processor cores have used multiple functional units in concurrent and pipelined fashions for years, and multiple-core chips are now commonplace with a trend toward rapidly increasing numbers of cores per chip. At the macroscopic'' level, one can now build clusters of hundreds to thousands of individual (multi-core) computers. Such distributed-memory systems have become mainstream and affordable in the form of commodity clusters. Furthermore, advances in network technology and infrastructures have made it possible to aggregate parallel computing platforms across wide-area networks in so-called grids.'' The popularization of virtualization has allowed to consolidate workload and resource exploitation in clouds'' and raise many energy and efficiency issues.

An efficient exploitation of such platforms requires a deep understanding of both architecture, software and infrastructure mechanisms and of advanced algorithmic principles. The aim of this course is thus twofold. It aims at introducing the main trends and principles in the area of high performance computing infrastructures, illustrated by examples of the current state of the art. It intends to provide a rigorous yet accessible treatment of parallel algorithms, including theoretical models of parallel computation, parallel algorithm design for homogeneous and heterogeneous platforms, complexity and performance analysis, and fundamental notions of scheduling and work-stealing. These notions will always be presented in connection with real applications and platforms.

### Course Organization

The course gives 6 credits (ECTS). In previous years, each student used to performs an individual performance evaluation study. Now that the MOSIG comprises a series of lectures on Performance Evaluation, this does not need to be done anymore solely in the context of the Parallel Systems lecture. There will be an exam at the end of the year and an extra lecture will be devoted to study together one of the exam of the previous years.

### Program and expected schedule

Check last year's schedule to get a foretaste.

1. 26 September 2016 (09:45-12:45 @ H202): Bruno Raffin Introduction to parallel computing.
2. 3 October 2015 (09:45-12:45 @ H103): Arnaud Legrand Collective operation and parallel algorithms on a ring. The key notions are SPMD programming, speedup/efficiency, Amhdal's law, pipelining, changing granularity to improve efficiency, and searching for sequential time in parallel algorithms.
• Documents: slides PC_02_parallel_algorithms.pdf. We skept part 1 on models but went through sections 5 to 12 ("Matrix vector product" in the "Algorithms on a ring" part).
• Expected work: review the slides on 1D algorithms. Make sure SimGrid/SMPI works on your machine (follow these instructions) and report any problem.
3. 10 October 2015 (09:45-12:45 @ H103): Bruno Raffin Quick presentation of the LIG teams related to the PDES option and of typical internships. How to Efficiently Program High Performance Architectures with MPI.
• Bibliography:
• Expected work: Implement and evaluate broadcast algorithms with SMPI following these instructions.
4. 17 Octobre 2015 (09:45-12:45 @ H203): Arnaud Legrand Parallel algorithms continued (1D: Matrix product, stencil, LU ; 2D; matrix product, block cyclic distributions).
• Expected work: Implement the double broadcast parallel matrix multiplication with MPI following these instructions.
5. 24 October 2015 (09:45-12:45 @ H205): Bruno Raffin End of the previous lecture… (openMP CILK)
• Expected work: Evaluate the performance of your matrix multiplication algorithm following these instructions.
6. 7 November 2015 (09:45-12:45 @ H203): Arnaud Legrand Dataflow graph representation of an execution. BSP programs. Introduction to Scheduling. We illustrat in detail the list scheduling technique and how the notion of critical path (or depth) was impacting the performance of schedules.
7. 21 November 2015 (09:45-12:45 @ H203): Bruno Raffin Multicore programming: from parallelism-aware algorithms to parallelism-oblivious algorithm. Work, depth and work-stealing. Illustration with a real implementation of such technique.
8. 28 November 2015 (09:45-12:45 @ H203): Bruno Raffin End of multicore programming and Practical introduction to OpenMP.

Work-stealing and data locality. Sorting and merging, FFT, matrix operations. Adaptive algorithms and cascading divide & conquer: prefix computation, data compression, linear system solving

9. 5 December 2015 (09:45-12:45 @ H203): Bruno Raffin End of the previous lecture.
10. 12 December 2015 (09:45-12:45 @ H203): Arnaud Legrand From fine-grain to coarse-grain. PRAM, sorting networks and deriving MPI-like implementations.
• Documents: PC_05_theory.pdf (you need to read and understand the whole set of slides, even those on FFT that have not been presented during the lecture; they are very similar in essence to the sorting networks and are thus a good exercise). You may also want to have a look at this more condensed set of slides on PRAM algorithms.
• Regarding the efficient parallel prefix algorithm, the Ladner-Fisher scheme to compute PREFIX $$(A[0,n[)$$ is as follows:

1. In parallel for $$i = 0, \dots, n/2$$, compute $$A[2 \times i + 1] := A[2 \times i] ⊕ A[2 \times i + 1]$$;
2. Recursively compute the prefix of the sub-array $$A_2[1, n[$$ with elements with odd index in $$A$$: $$\{A[1], A[3], \dots, A[2\times(n/2) − 1]\}$$;
3. In parallel for $$i = 0, \dots, n/2$$, compute the even prefix $$A[2\times i] := A[2\times i − 1] ⊕ A[2\times i]$$;

You can easily prove that the depth (in number of ⊕ operations) is $$2. log(n)$$ and that the work is $$2n$$.

11. 15 December 2016 (09:45-12:45): Bruno Raffin Cache and Data Structures
12. 9 January 2015 (09:45-12:45 @ H103): Arnaud Legrand Recent challenges and trends in HPC linear algebra: a recap on the whole lecture. We covered the following topics:
1. Virtualization and cloud (virtualization part of these slides)
2. A few words about exascale ? (beginning of the slides on exascale related to architecture diversity in these slides)
3. Synchronization reduction algorithms ? The "Synchronization-reducing algorithms" section of these slides plus the two slides from the "Programming and Application Challenges" section from these slides.
4. Auto-tuning: the "Block" size and "Compiler Optimization" sections from these slides.
5. Numerical Reproducibility: a quick look at the slides of Philippe Langlois: langlois.pdf.
13. 10 January 2015?? (?? @ ??): Arnaud Legrand Looking together at one of the previous exams.

### Bibliography

All you need to know is in the large set of slides we provide.