Designing and Building Applications for Extreme Scale Systems

Learn how to design and implement applications for extreme scale systems, including analyzing and understanding the performance of applications, the primary causes of poor performance and scalability, and how both the choice of algorithm and programming system impact achievable performance.The course covers multi-and many-core processors, interconnects in HPC systems, parallel I/O, and the impact of faults on program and algorithm design.

Sessions

Note: Each lecture was recorded; unfortunately, the college of Engineering, which had possesion of the videos, no longer makes them available and they may be lost.
Introduction
Performance Modeling and Extreme Scale Systems
Covers what extreme scale systems are today and the scale of problems they are used to solve. Introduces basic performance modeling and its use to gain insight into the performance of applications.
Benchmarking and Sparse Matrix Vector Multiply
This session introduces some of the most important HPC benchmarks.

The second presentation analyzes a sparse matrix-vector multiply and shows how a simple performance model gives insight into application performance.

Cache Memory and Performance
This session adds cache memory to the performance model. The second lecture discusses some of the challenges in measuring performance.
Spatial Locality
This session discusses the importance of spatial locality and how to model the performance of a simple cache memory system.
The Cache Oblivious Approach
This session discusses a way to think about caches and developing algorithms and implementations which is (mostly) independent of the specific details of the cache. The second presentation covers some other features of caches that can impact performance.
More on How the Processor Works
This session discusses a simple execution model and introduces the issue of aliasing of memory.

The second lecture revisits the dense matrix-matrix multiply operation using the ideas developed in this class.

Instruction Execution and Pipelining
This session covers how modern processors execute instructions in a series of steps and how it affects performance.
Vectors
This session introduces vectors in modern processors and discusses some of their features and challenges.
Moore's Law and Speedup
This session discusses Moore's Law, Dennard Scaling, and some limits on speedup of applications.
Threads
This section introduces threads and thread parallelism, including some advantages and disadvantages of shared memory.
OpenMP Basics
This session introduces OpenMP, the most widely used approach for threaded programming for scientific applications, and how to use OpenMP to parallelize loops.
OpenMP and MAXLOC
This session discusses critical sections and atomic operations in loops in OpenMP, with an emphasis on understanding the performance implications of different approaches.
OpenMP and General Synchronization
This session discusses more general loop parallelism in OpenMP, using a linked list as an example in discussing thread locks and performance.
Distributed Memory Parallelism
An overview of parallelism, from single processors to massively parallel distributed memory systems. Interconnects and bisection bandwidth.
Parallel Programming Models and Systems
Illustrates different ways to program the same simple loop, using different parallel programming systems. Emphasizes the role of the data decomposition in expressing parallelism.
MPI Basics
An introduction to the Message Passing Interface (MPI), including simple send and receive.
MPI Point to Point Communication
Understanding MPI point to point communication, including the effects of the MPI implementation
Strategies for Designing Parallel Applications
An introduction to several ways to think about the design of a parallel application, with a halo exchange as an example
Performance Models
More on Efficient Halo Exchange
This session uses halo exchange to explain the importance of deferring synchronization. It also discusses the use of MPI datatypes to avoid extra memory motion.
MPI Process Topologies
This session discusses virtual and actual (physical) topologies and how to use MPI to influence the layout of processes on the parallel computer
Collective Communication in MPI
This session discusses the rich collective communication and computation features available in MPI
More on Collective Communication
This session discusses some performance considerations of collective operations, including an example of when using an optimal collective routine gives poorer performance than simpler choices for communication
Introduction to Parallel I/O
This session introduces some of the issues in effectively using I/O with massively parallel computers.
Introduction to MPI I/O
An introduction to the parallel I/O features provided by MPI and how they compare to POSIX I/O.
More on MPI I/O Performance
This session covers more on MPI I/O, including performance optimizations.
One-Sided Communication in MPI
This session introduces a different model of parallel computing, one-sided communication, and describes how this model is represented in MPI.
More on One-Sided Communication
This session covers passive target synchronization and understanding the MPI RMA memory models.
MPI, Hybrid Programming, and Shared Memory
An introduction to using the MPI+X hybrid programming approach. MPI with threads and the different thread levels are covered. Also introduced is the MPI-3 shared-memory feature, sometimes called the MPI+MPI programming approach.
New Features of MPI-3
This session provides a summary of the new features in MPI-3.