# M2R Parallel Systems

## Table of Contents

## Sitemap

## Parallel Systems

### General Informations

These lectures take place **from 14:00 to 17:00 every Monday in room
B002**. In doubt, check the ADE website (look for PDES and then for the
Parallel System lectures).

The coordinator for these lectures is Arnaud Legrand. The lecturers are Vincent Danjean and Arnaud Legrand .

### Objectives

Today, parallel computing is omnipresent across a large spectrum of computing platforms. At the ``microscopic'' level, processor cores have used multiple functional units in concurrent and pipelined fashions for years, and multiple-core chips are now commonplace with a trend toward rapidly increasing numbers of cores per chip. At the ``macroscopic'' level, one can now build clusters of hundreds to thousands of individual (multi-core) computers. Such distributed-memory systems have become mainstream and affordable in the form of commodity clusters. Furthermore, advances in network technology and infrastructures have made it possible to aggregate parallel computing platforms across wide-area networks in so-called ``grids.'' The popularization of virtualization has allowed to consolidate workload and resource exploitation in ``clouds'' and raise many energy and efficiency issues.

An efficient exploitation of such platforms requires a deep understanding of both architecture, software and infrastructure mechanisms and of advanced algorithmic principles. The aim of this course is thus twofold. It aims at introducing the main trends and principles in the area of high performance computing infrastructures, illustrated by examples of the current state of the art. It intends to provide a rigorous yet accessible treatment of parallel algorithms, including theoretical models of parallel computation, parallel algorithm design for homogeneous and heterogeneous platforms, complexity and performance analysis, and fundamental notions of scheduling and work-stealing. These notions will always be presented in connection with real applications and platforms.

### Course Organization

The course gives 6 credits (ECTS). In previous years, each student used to performs an individual performance evaluation study. Now that the MOSIG comprises a series of lectures on Performance Evaluation, this does not need to be done anymore solely in the context of the Parallel Systems lecture. There will be an exam at the end of the year and an extra lecture will be devoted to study together one of the exam of the previous years.

### Program and expected schedule

Check last year's schedule to get a foretaste.

`28 September 2015 (14:00 - 17:00): Arnaud Legrand`

Introduction to parallel computing. High Performance Architectures Processors (superscalar, simultaneous multi-threading, multi-core, GPU…). Symmetric MultiProcessors. OS features for cluster computing Multi-threading. Cache-aware vs. cache-oblivious algorithms.`Documents:`

slides parallel_architectures.pdf, What every programmer should know about memory, The Ninja Performance Gap.`Work to do:`

Study the slides, get more knowledge (e.g., on wikipedia) on all techniques presented in the slides (e.g., pipelining, hyperthreading, Grid computing, etc.). Do not limitate to the reading of the previous links and browse related links to get a good picture of recent evolutions. You can also have a look at the top500 and summary.To understand better the impact of caches, you should:

- Implement in C the simple three-nested-loop version of the matrix product and try to evaluate its performance for a relatively large matrix size.
- Implement the blocked version with six nested loops to check whether you can observe a significant gain.
- Execute these algorithms step by step to get a good understanding of data movements between the cache and the memory and try to evaluate their respective complexity in term of distant memory access.
- Execute these two versions of the code with valgrind and kcachegrind to get a precise evaluation of their performance in term of cache misses.

`5 October 2015 (14:00 - 17:00): Arnaud Legrand`

Simple communication models, parallel algorithms on a ring and on a grid. The key notions are speedup/efficiency, Amhdal's law, pipelining, changing granularity to improve efficiency, and searching for sequential time in parallel algorithms.`Documents:`

slides PC_02_parallel_algorithms.pdf. We read through this set until the end of section 12 ("Matrix vector product" in the "Algorithms on a ring" part). I also exemplified some points with this set of slides (slides 5 and 11).`References:`

This set of slides is huge and self-content. There is no need to dig further. Just be curious, redo the computations by yourself to make sure you truly understand how this works. The slides are here as safeguards to give you the solutions…

`12 October 2015 (14:00 - 17:00): Vincent Danjean`

How to Efficiently Program High Performance Architectures ? Classical HPC API: MPI, pthreads, openMP, CUDA, openCL, …`Documents:`

slides languages.pdf

`19 October 2015 (14:00 - 17:00): Vincent Danjean`

End of the previous lecture…`Expected work:`

Your next lecture with Vincent will be in the end of November. In the meantime, you should play with MPI on G5K: First, you may want to have a look at the following documents to know how to run an MPI code on G5K (look in the wiki and at the Cheat Sheet). Then, there are two possible options that do not illustrate the same issues. I suggest to start with the first one on sorting and later, once I have explained you the basics of parallel matrix multiplication algorithms give a try to the second one:- Practical session on performance evaluation of parallel programs. PC_par_sort.pdf, parsort-1.0.tar.gz, mpi_sort.c This session is great for understanding the problems of measurement and parallel speedup. This kind of aspects is partially adressed in the Performance evaluation lecture but with multi-threaded code so you can compare between the two approaches.
- Double broadcast parallel matrix multiplication 07_MPI_tutorial.tgz This session is great for understanding the data movements and distribution in an SPMD program and the power of collective communication operations. You'll probably need to have seen the lectures on Parallel algorithms first.

`2 November 2015 (14:00 - 17:00): Vincent Danjean`

Parallel algorithms continued (1D: Matrix product, stencil, LU ; 2D; matrix product, block cyclic distributions).`Documents:`

slides PC_02_parallel_algorithms.pdf`References:`

This set of slides is huge and self-content. There is no need to dig further.`Expected work:`

Implement the double broadcast parallel matrix multiplication with MPI.

`9 November 2015 (14:00 - 17:00): Arnaud Legrand (H104)`

Dataflow graph representation of an execution. BSP programs. Introduction to Scheduling. We illustrated in detail the list scheduling technique and how the notion of critical path (or depth) was impacting the performance of schedules.`16 November 2015 (14:00 - 17:00): Arnaud Legrand`

From fine-grain to coarse-grain. PRAM, sorting networks and application to implementation on NOWs.`Documents:`

PC_05_theory.pdf (you need to read and understand the whole set of slides, even those on FFT that have not been presented during the lecture; they are very similar in essence to the sorting networks and are thus a good exercise). You may also want to have a look at this more condensed set of slides on PRAM algorithms.

`23 November 2015 (14:00 - 17:00): Vincent Danjean`

From parallelism-aware algorithms to parallelism-oblivious algorithm. Work, depth and work-stealing. Illustration with a real implementation of such technique.`Documents:`

slides PC_07_WS.pdf and slides on Kaapi PC_07_kaapi.pdf.

`30 November 2015 (14:00 - 17:00): Vincent Danjean`

Work-stealing and data locality. Sorting and merging, FFT, matrix operations. Adaptive algorithms and cascading divide & conquer: prefix computation, data compression, linear system solving`Documents:`

slides PC_08_algo_adaptatif.pdf

`7 December 2015 (14:00 - 16:00): Arnaud Legrand`

Hype and trends: cloud computing and how Virtualization changed the Grid perspective, exascale computing and how energy issues changed the HPC perspective.`Documents:`

Cloud + Exascale slides PC_10_hype.pdf + Fault tolerance slides (local copy)- http://www.green500.org/lists/green201311
- http://www.cse.psu.edu/conferences/icpppasa2016/ (mostly the fun citations of the webpage ;).
- http://www.top500.org/lists/2013/11/
- http://www.theregister.co.uk/2012/09/05/intel_xeon_phi_coprocessor/

`14 December 2015 (10:00 - 12:00 & 13:00 - 15:00): Arnaud Legrand`

End of the previous lecture. Beginning of the new one: recent challenges and trends in HPC linear algebra: a recap on the whole lecture.`14 December 2015 (13:00 - 15:00): Arnaud Legrand`

End of the series of lectures on HPC trends. Also used slides of Philippe Langlois on numerical reproducibility (local copy)`4 January 2015 (14:00 - 17:00): Arnaud Legrand`

Looking together at 2014's exam, 2013's exam or possibly 2012's exam.

### Bibliography

All you need to know is in the **large** set of slides.