Performance Modeling and Analysis for HPC
Background Papers
- Bit Reversal on Uniprocessors, Alan H. Karp, SIAM Review, Vol 38, No. 1, March 1996. A nice discussion of some of the issues in understanding the performance of a simple algorithm on real hardware.
- Achieving high sustained performance in an unstructured mesh CFD application, W. Anderson, W. Gropp, D. Kaushik, D. Keyes, and B. Smith, Proceedings of SC99, 1999. A simple performance analysis of sparse matrix vector product; argues for achievable performance as the correct goal, rather than peak.
- Cache Oblivious Algorithms, M. Frigo, C. Leiserson, H. Prokop, S. Ramachandran, Proceedings of the 40th Annual Symposium on the Foundations of Computer Science, 1999. Provides a simplified cache model that does not need details of the cache parameters.
- How Well Can Simple Metrics Represent the Performance of HPC Applications, L. Carrington, M. Laurenzano, A. Snavely, R. Campbell, Jr., L. Davis, Proceedings of SC05, 2005. A study of different microbenchmarks and how predictive they are. Shows that you really need to know something about your workload or about the kind of operations you perform in an application.
- Roofline: an insightful visual performance model for multicore architectures, S. Williams, A. Waterman, and D. Patterson, CACM, Vol 52, No. 4, April 2009.
Other Performance Analysis Resources
Here are additional resources for performance analysis of computational science applications. This is a very eccentric and incomplete list; recommendations for additions are welcome.Papers
- Architecture
-
These papers provide a quick reference to some of the issues of
computer architecture, particularly memory performance, that influence
the performance of computational science applications.
- Bit reversal on uniprocessors (Alan Karp, SIAM Review, 1996)
- Experimental Analysis of Algorithms (Catherine McGeoch, Notices of the American Mathematical Society, March 2001)
- Reflections on the Memory Wall (Sally McKee, ACM Conference on Computing Frontiers, 2004)
- Algorithms
- These papers consider some aspects of algorithms that must perform
well on real hardware, as opposed to idealized hardware.
- The Landscape of Parallel Computing Research: A View from Berkeley, particularly Section 3. (Tech report EECS-2006-183, 2006)
- Cache-Oblivious Algorithms (Algorithms for Memory Hierarchies, LNCS 2625, pages 193-212, Springer Verlag)
- Is Cache-Oblivious DGEMM Viable provides a different perspective on cache oblivious programs for what is probably the best-case for the cache-oblivious approach. This is a shorter version of An experimental comparison of cache-oblivious and cache-conscious programs
An analysis of sparse matrix-vector multiply is presented in this paper: Achieving high sustained performance in an unstructured mesh CFD application (SC1999)
In parallel computing, the assignment of processes to processors can be an important step in achieving good performance. A good paper with results from a recent supercomputer is Topology Mapping for Blue Gene/L Supercomputer
Sometimes the algorithm must compensate for a problem in the architecture. Section 2 in Extending Stability Beyond CPU Millennium discusses how a molecular dynamics application worked around errors in the L1 cache by modifying the algorithm.
Changes to the data structures used in a algorithm may be critical for performance. One is Sparsity: Optimization Framework for Sparse Matrix Kernels, which optimizes for register and cache usage, and and Optimizing Sparse Data Structures for Matrix-vector Multiply, which optimizes for prefetch support in processor hardware.
- Programming Models
- Many programming models have been proposed to address various goals. These papers look at some of the issues in programming for high performance while retaining productivity.
- Code Transformations
-
Performance analysis can identify performance shortfalls in
applications. To remedy those shortfalls, it may be necessary to
change the datastructures or algorithms, as described in some of the
papers above. However, in some cases, what is required is
transformation of the code to make better use of the computational
resources, such as the memory hierarchy or processor instructions.
This section will add references for both papers and tools that aid in
transforming codes.
- Stencil optimizations
- Optimizations for short vectors (commodity processor "vector" instructions).
- Loop optimizations
- Others
- (Meta)Tools and language annotations for applying transformation tools
- Performance Modeling
-
An discussion of performance modeling is in the presentation Analytical Performance Modeling and Simulation for Blue Waters.
A promising direction for parallel codes, particularly ones that use either message-passing or very structured shared or remote memory access, is the use of formal methods to reason about the performance properties of applications. Early results in this direction include the identification of unnecessary barriers in MPI codes, A Formal Approach to Detect Functionally Irrelevant Barriers in MPI Programs .
- Case Studies and Examples
-
Performance analysis can help understand the difference between the measured performance of an implementation and the performance that should be ahievable if all parts of the application (including the system and runtime libraries)
Further Reading
The book Performance Optimization of Numerically Intensive Codes is a good place to find examples of performance optmization strategies applied to code that you can go and run, as well as examples of why this course is also discussing programming models.
For a through discussion of the single node architecture, see the classic Computer Architecture: A Quantitative Approach
LLCbench is a collection of performance benchmarks that may be useful in applying some of the ideas developed in this class.