Paul Eller
 Hassan Eslami
 Samah Karim
 Tarun Prabhu

Performance Modeling and Analysis for HPC

Background Papers

Other Performance Analysis Resources

Here are additional resources for performance analysis of computational science applications. This is a very eccentric and incomplete list; recommendations for additions are welcome.


These papers provide a quick reference to some of the issues of computer architecture, particularly memory performance, that influence the performance of computational science applications.
These papers consider some aspects of algorithms that must perform well on real hardware, as opposed to idealized hardware.

An analysis of sparse matrix-vector multiply is presented in this paper: Achieving high sustained performance in an unstructured mesh CFD application (SC1999)

In parallel computing, the assignment of processes to processors can be an important step in achieving good performance. A good paper with results from a recent supercomputer is Topology Mapping for Blue Gene/L Supercomputer

Sometimes the algorithm must compensate for a problem in the architecture. Section 2 in Extending Stability Beyond CPU Millennium discusses how a molecular dynamics application worked around errors in the L1 cache by modifying the algorithm.

Changes to the data structures used in a algorithm may be critical for performance. One is Sparsity: Optimization Framework for Sparse Matrix Kernels, which optimizes for register and cache usage, and and Optimizing Sparse Data Structures for Matrix-vector Multiply, which optimizes for prefetch support in processor hardware.

Programming Models
Many programming models have been proposed to address various goals. These papers look at some of the issues in programming for high performance while retaining productivity.
Code Transformations
Performance analysis can identify performance shortfalls in applications. To remedy those shortfalls, it may be necessary to change the datastructures or algorithms, as described in some of the papers above. However, in some cases, what is required is transformation of the code to make better use of the computational resources, such as the memory hierarchy or processor instructions. This section will add references for both papers and tools that aid in transforming codes.
  • Stencil optimizations
  • Optimizations for short vectors (commodity processor "vector" instructions).
  • Loop optimizations
  • Others
  • (Meta)Tools and language annotations for applying transformation tools
Performance Modeling

An discussion of performance modeling is in the presentation Analytical Performance Modeling and Simulation for Blue Waters.

A promising direction for parallel codes, particularly ones that use either message-passing or very structured shared or remote memory access, is the use of formal methods to reason about the performance properties of applications. Early results in this direction include the identification of unnecessary barriers in MPI codes, A Formal Approach to Detect Functionally Irrelevant Barriers in MPI Programs .

Case Studies and Examples

Performance analysis can help understand the difference between the measured performance of an implementation and the performance that should be ahievable if all parts of the application (including the system and runtime libraries)

Further Reading

The book Performance Optimization of Numerically Intensive Codes is a good place to find examples of performance optmization strategies applied to code that you can go and run, as well as examples of why this course is also discussing programming models.

For a through discussion of the single node architecture, see the classic Computer Architecture: A Quantitative Approach

LLCbench is a collection of performance benchmarks that may be useful in applying some of the ideas developed in this class.

Computer Science Department University of Illinois Urbana-Champaign