Designing and Building Applications for Extreme Scale Systems
Learn how to design and implement applications for extreme scale
systems, including analyzing and understanding the performance of
applications, the primary causes of poor performance and scalability,
and how both the choice of algorithm and programming system impact
achievable performance.The course covers multi-and many-core
processors, interconnects in HPC systems, parallel I/O, and the impact
of faults on program and algorithm design.
HPC Course update notes
There are two major ways to look at designing parallelism into
applications. One is to start from a serial algorithm and add
parallelism. This is roughly how the current course is organized. This
approach makes it easy to add complexity in small steps. A draw back
is that it can be hard to achieve efficient, massively parallel
algorithms starting from a serial algorithm.
The other is start from a maximally parallel algorithm and aggregate
the parallel threads of execution to match the available
resources. The challenge here is that modern systems achieve their
parallelism through many different mechanisms, and it is hard (though
not impossible, by picking just one of those mechanisms) to start
I plan to update this course with additional material that augments,
not replaces, the current serial-first approach with a parallel-first
Another update that is needed in these lectures is the use of
accelerators, particularly GPUs, and special memory to achieve
performance. Many nodes in HPC systems now get most of their
performance from accelerators. These lectures mentioned but did not
discuss programming these systems in detail, in large part because of
the immaturity of the software solutions available at the time. While
software for accelerators continues to evolve, the software ecosystem
is more mature now and there are more good choices, including language
extensions (especially OpenMP v5) and HPC libraries and GPU-enabled
applications built on top of lower level (and less standardized) GPU
Short and long term plans for an update to these lectures
In the short term, the plan is to add supplementary material on GPUs,
HBM, and I/O "burst buffers". This will include programming systems as
well as some material on algorithms that fit GPU parallelism.
In the long term, the description of the computer architecture and
algorithms will be changed to combine both a bottom up (starting from
individual cores) and top down (maximal parallelism) approach, rather
than working up (or down).
Note: Each lecture was recorded with both a presentation and view of the
presenter. However, these were not captured in a single video.
The video links on this page show the presentation and provide the audio for
- Performance Modeling and Extreme Scale Systems
Covers what extreme scale systems are today and the scale of problems they are used to solve. Introduces basic performance modeling and its use to gain insight into the performance of applications.
- Benchmarking and Sparse Matrix Vector Multiply
- This session introduces some of the most important HPC benchmarks.
The second presentation analyzes a sparse matrix-vector multiply and
shows how a simple performance model gives insight into application
- Cache Memory and Performance
This session adds cache memory to the performance model.
The second lecture discusses some of the challenges in measuring performance.
- Spatial Locality
This session discusses the importance of spatial locality and how to
model the performance of a simple cache memory system.
- The Cache Oblivious Approach
This session discusses a way to think about caches and developing
algorithms and implementations which is (mostly) independent of the
specific details of the cache. The second presentation covers some
other features of caches that can impact performance.
- More on How the Processor Works
This session discusses a simple execution model and introduces the
issue of aliasing of memory.
The second lecture revisits the dense matrix-matrix multiply operation
using the ideas developed in this class.
- Instruction Execution and Pipelining
This session covers how modern processors execute instructions in a
series of steps and how it affects performance.
This session introduces vectors in modern processors and discusses
some of their features and challenges.
- Moore's Law and Speedup
This session discusses Moore's Law, Dennard Scaling, and some limits
on speedup of applications.
This section introduces threads and thread parallelism, including some
advantages and disadvantages of shared memory.
- OpenMP Basics
This session introduces OpenMP, the most widely used approach for
threaded programming for scientific applications, and how to use
OpenMP to parallelize loops.
- OpenMP and MAXLOC
This session discusses critical sections and atomic operations in
loops in OpenMP, with an emphasis on understanding the performance
implications of different approaches.
- OpenMP and General Synchronization
This session discusses more general loop parallelism in OpenMP, using
a linked list as an example in discussing thread locks and
- Distributed Memory Parallelism
An overview of parallelism, from single processors to massively
parallel distributed memory systems. Interconnects and bisection
- Parallel Programming Models and Systems
Illustrates different ways to program the same simple loop, using
different parallel programming systems. Emphasizes the role of the
data decomposition in expressing parallelism.
- MPI Basics
An introduction to the Message Passing Interface (MPI), including
simple send and receive.
- MPI Point to Point Communication
Understanding MPI point to point communication, including the effects
of the MPI implementation
- Strategies for Designing Parallel Applications
An introduction to several ways to think about the design of a
parallel application, with a halo exchange as an example
- Performance Models
- More on Efficient Halo Exchange
This session uses halo exchange to explain the importance of deferring
synchronization. It also discusses the use of MPI datatypes to avoid
extra memory motion.
- MPI Process Topologies
This session discusses virtual and actual (physical) topologies and
how to use MPI to influence the layout of processes on the parallel
- Collective Communication in MPI
This session discusses the rich collective communication and
computation features available in MPI
- More on Collective Communication
This session discusses some performance considerations of collective
operations, including an example of when using an optimal collective
routine gives poorer performance than simpler choices for
- Introduction to Parallel I/O
This session introduces some of the issues in effectively using I/O
with massively parallel computers.
- Introduction to MPI I/O
An introduction to the parallel I/O features provided by MPI and how
they compare to POSIX I/O.
- More on MPI I/O Performance
This session covers more on MPI I/O, including performance optimizations.
- One-Sided Communication in MPI
This session introduces a different model of parallel computing,
one-sided communication, and describes how this model is represented
- More on One-Sided Communication
This session covers passive target synchronization and understanding
the MPI RMA memory models.
- MPI, Hybrid Programming, and Shared Memory
An introduction to using the MPI+X hybrid programming approach. MPI
with threads and the different thread levels are covered. Also
introduced is the MPI-3 shared-memory feature, sometimes called the
MPI+MPI programming approach.
- New Features of MPI-3
- This session provides a summary of the new features in MPI-3.