Designing and Building Applications for Extreme Scale Systems

Learn how to design and implement applications for extreme scale systems, including analyzing and understanding the performance of applications, the primary causes of poor performance and scalability, and how both the choice of algorithm and programming system impact achievable performance.The course covers multi-and many-core processors, interconnects in HPC systems, parallel I/O, and the impact of faults on program and algorithm design.

HPC Course update notes

(April 2022)

There are two major ways to look at designing parallelism into applications. One is to start from a serial algorithm and add parallelism. This is roughly how the current course is organized. This approach makes it easy to add complexity in small steps. A draw back is that it can be hard to achieve efficient, massively parallel algorithms starting from a serial algorithm.

The other is start from a maximally parallel algorithm and aggregate the parallel threads of execution to match the available resources. The challenge here is that modern systems achieve their parallelism through many different mechanisms, and it is hard (though not impossible, by picking just one of those mechanisms) to start here.

I plan to update this course with additional material that augments, not replaces, the current serial-first approach with a parallel-first approach.

Another update that is needed in these lectures is the use of accelerators, particularly GPUs, and special memory to achieve performance. Many nodes in HPC systems now get most of their performance from accelerators. These lectures mentioned but did not discuss programming these systems in detail, in large part because of the immaturity of the software solutions available at the time. While software for accelerators continues to evolve, the software ecosystem is more mature now and there are more good choices, including language extensions (especially OpenMP v5) and HPC libraries and GPU-enabled applications built on top of lower level (and less standardized) GPU APIs.

Short and long term plans for an update to these lectures

In the short term, the plan is to add supplementary material on GPUs, HBM, and I/O "burst buffers". This will include programming systems as well as some material on algorithms that fit GPU parallelism.

In the long term, the description of the computer architecture and algorithms will be changed to combine both a bottom up (starting from individual cores) and top down (maximal parallelism) approach, rather than working up (or down).

Sessions

Note: Each lecture was recorded with both a presentation and view of the presenter. However, these were not captured in a single video. The video links on this page show the presentation and provide the audio for the lecture.
Introduction
Performance Modeling and Extreme Scale Systems
Covers what extreme scale systems are today and the scale of problems they are used to solve. Introduces basic performance modeling and its use to gain insight into the performance of applications.
Benchmarking and Sparse Matrix Vector Multiply
This session introduces some of the most important HPC benchmarks.

The second presentation analyzes a sparse matrix-vector multiply and shows how a simple performance model gives insight into application performance.

Cache Memory and Performance
This session adds cache memory to the performance model. The second lecture discusses some of the challenges in measuring performance.
Spatial Locality
This session discusses the importance of spatial locality and how to model the performance of a simple cache memory system.
The Cache Oblivious Approach
This session discusses a way to think about caches and developing algorithms and implementations which is (mostly) independent of the specific details of the cache. The second presentation covers some other features of caches that can impact performance.
More on How the Processor Works
This session discusses a simple execution model and introduces the issue of aliasing of memory.

The second lecture revisits the dense matrix-matrix multiply operation using the ideas developed in this class.

Instruction Execution and Pipelining
This session covers how modern processors execute instructions in a series of steps and how it affects performance.
Vectors
This session introduces vectors in modern processors and discusses some of their features and challenges.
Moore's Law and Speedup
This session discusses Moore's Law, Dennard Scaling, and some limits on speedup of applications.
Threads
This section introduces threads and thread parallelism, including some advantages and disadvantages of shared memory.
OpenMP Basics
This session introduces OpenMP, the most widely used approach for threaded programming for scientific applications, and how to use OpenMP to parallelize loops.
OpenMP and MAXLOC
This session discusses critical sections and atomic operations in loops in OpenMP, with an emphasis on understanding the performance implications of different approaches.
OpenMP and General Synchronization
This session discusses more general loop parallelism in OpenMP, using a linked list as an example in discussing thread locks and performance.
Distributed Memory Parallelism
An overview of parallelism, from single processors to massively parallel distributed memory systems. Interconnects and bisection bandwidth.
Parallel Programming Models and Systems
Illustrates different ways to program the same simple loop, using different parallel programming systems. Emphasizes the role of the data decomposition in expressing parallelism.
MPI Basics
An introduction to the Message Passing Interface (MPI), including simple send and receive.
MPI Point to Point Communication
Understanding MPI point to point communication, including the effects of the MPI implementation
Strategies for Designing Parallel Applications
An introduction to several ways to think about the design of a parallel application, with a halo exchange as an example
Performance Models
More on Efficient Halo Exchange
This session uses halo exchange to explain the importance of deferring synchronization. It also discusses the use of MPI datatypes to avoid extra memory motion.
MPI Process Topologies
This session discusses virtual and actual (physical) topologies and how to use MPI to influence the layout of processes on the parallel computer
Collective Communication in MPI
This session discusses the rich collective communication and computation features available in MPI
More on Collective Communication
This session discusses some performance considerations of collective operations, including an example of when using an optimal collective routine gives poorer performance than simpler choices for communication
Introduction to Parallel I/O
This session introduces some of the issues in effectively using I/O with massively parallel computers.
Introduction to MPI I/O
An introduction to the parallel I/O features provided by MPI and how they compare to POSIX I/O.
More on MPI I/O Performance
This session covers more on MPI I/O, including performance optimizations.
One-Sided Communication in MPI
This session introduces a different model of parallel computing, one-sided communication, and describes how this model is represented in MPI.
More on One-Sided Communication
This session covers passive target synchronization and understanding the MPI RMA memory models.
MPI, Hybrid Programming, and Shared Memory
An introduction to using the MPI+X hybrid programming approach. MPI with threads and the different thread levels are covered. Also introduced is the MPI-3 shared-memory feature, sometimes called the MPI+MPI programming approach.
New Features of MPI-3
This session provides a summary of the new features in MPI-3.