Student Projects

These are my current projects for which I am looking for students. Contact me by email if you are interested in one of these.

MPI Process Mapping

Description

How MPI processes are assigned to cores, sockets, and nodes can impact the amount of data that moves between each level of that hierarchy - and thus the scalability and performance of an application.

Project

This project looks at ways to improve that mapping of MPI processes and thus provide better performance and scalability with little to no work by application programmer. The first step is a more complete and extensible implementation of Cartesian process grids, as described in "Using Node and Socket Information to Implement MPI Cartesian Topologies", referenced below. The extensions include adding NUMA (non-uniform memory access regions) to the determination of the MPI process to hardware mapping.

This approach will be tested on some standard benchmarks as well as applications such as nek5000 and the related nekbone benchmark. A paper focusing on the performance improvements in applications will be submitted to the EuroMPI conference.

References

There are numerous papers looking at MPI process mapping, especially for general mappings of processes to an interconnection network. These are just a sampling, but will provide a good starting point. The first paper is the most relevant to this project.

Using Node and Socket Information to Implement MPI Cartesian Topologies: William Gropp, Parallel Computing, Extended version of EuroMPI'18 paper.
Modeling MPI Communication Performance on SMP Nodes: Is it Time to Retire the Ping Pong Test: William Gropp, Luke Olson, Philipp Samfass, in proceedings of EuroMPI 2016, pp. 41-50.
Generic Topology Mapping Strategies for Large-scale Parallel Architectures: Torsten Hoefler and Marc Snir, Proceedings of the 2011 ACM International Conference on Supercomputing (ICS'11), presented in Tucson, AZ, pages 75--85, ACM, ISBN: 978-1-4503-0102-2, Jun. 2011.
Netloc: A Tool for Topology-Aware Process Mapping: Cyril Bordage, Clement Foyer, and Brice Goglin, Euro-Par 2017: Euro-Par 2017: Parallel Processing Workshops pp 157-166.

Data Movement Optimizations for I/O and Data Management

Description

Modern systems have complex and hierarchical data storage systems and applications rarely make good use of these capabilities. This project is part of a larger effort to improve tools for managing data for science applications.

Project

This project seeks to take advantage of knowledge about the way in which the application is accessing data to improve the performance. As noted in the paper of Luu et al. below, a mismatch between the design of the I/O system and the application's access pattern can cause orders of magnitude loss in performance. The project builds on an existing system, the Proactive Data Containers (PDC) project. There are three parts to this project:

Accelerate data access through proactive data movement and reorganization. Given information about the access patterns, data movement efficiency can be improved using storage-system aware data placement. This part will create a resource graph of data access and take advantage of methods for mapping such graphs to resources (and see the MPI process mapping project above for similar issues).
Increasing scalability with tunable consistency. POSIX I/O semantics enforces strict ordering and atomic consistency. This limits scalability and performance, and can require complex and fragile software to attempt to optimize performance while preserving the POSIX requirements. While POSIX I/O is frequently listed as a requirement for HPC systems, in fact few if any HPC applications require the strong consistency guaranteed by POSIX. This part will develop methods and tools to both analyze applications for their consistency requirements and develop software APIs for tools such as PDC (see references below) to express weaker consistency models. Experiments can take advantage of the non-POSIX I/O available on NCSA's Delta supercomputer.
Maintaining fault tolerance and resilience. With the increasing complexity of both software and hardware in HPC systems, planning for failures is increasingly important. This part looks at different tradeoffs in performance and resilience, and explores ways for the user or application developer to select different levels of fault tolerance.

References

Proactive Data Containers (PDC): Project web page.
PDC Documentation: PDC documentation, including installation, API, and examples.
PDC Papers: Publications about PDC
A Multiplatform Study of I/O Behavior on Petascale Supercomputers: Huong Luu, Marianne Winslett, William Gropp, Robert B. Ross, Philip H. Carns, Kevin Harms, Prabhat, Surendra Byna, and Yushu Yao, Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2015, Portland, OR, USA, June 15-19, 2015, Thilo Kielmann, Dean Hildebrand, and Michela Taufer, 33–44, 2015.

Programming Tools for Heterogeneous Systems

Description

Most HPC systems are heterogeneous at some level - that is, they contain different types of processing elements. Programming these systems remains a challenge. While there has been much research in this area, there are few working systems.

There are different degrees of heterogeneity. Here are the ones that are important for this project:

Within a processor, in a way that may require special effort from the application developer. Vector instructions are an example here; these often require special compiler flags and/or directives in the source code, along with care in writing vectorizable code.
Within a node. Most often, this means using an accelerator such as one or more GPUs. However, special memory systems, such as user-accessible HBM or "burst buffers", can also be considered heterogeneous components at the node level.
Between nodes. This applies when nodes have different processors, accelerators, or other developer-visible configurations. For example, NCSA's Delta supercomputer has 4 major node types: dual CPU nodes (124), 4xA100 GPU nodes (100), 8xA100 GPU nodes (5), and 4xA40 GPU nodes (100). There is also 1 8xMI100 GPU node. The recently announced DeltaAI system will add yet another node type, with 4xH100 GPU nodes.

Despite some claims, heterogeneous programming remains difficult at all levels. Even where it works well, the solutions are often "brittle" - small changes in the requirements or algorithm may break the approach. However, for the purposes of this project, the focus is on heterogeneity between nodes. In this case, the major issues are both the pragmatic ones of managing different executables and the challenge of efficient scheduling of operations onto nodes with different capabilities.

Project

This project starts with an evaluation of current methods, especially for multi-GPU nodes, by applying them to a few key benchmarks: HPL (used for the Top500 ranking), HPCG (which is more representative than HPL of many science applications), and perhaps an ML benchmark. This will enable use of "evergreen" (frequently updated) of parallel hardware for challenging applications. These results will be gathered into a paper and submitted to a conference.

Building on the results of this study, gaps in current solutions will be identified and addressed.

References

There are many papers, tools, and systems for programming heterogeneous systems.

Evolution of Programming Approaches for High-Performance Heterogeneous Systems: Jacob Lambert, 2020. A nice review of approaches for heterogeneous computing.
HPL: The High Performance Linpack (HPL) benchmark solves a dense system of linear equations, and is used to rank systems on the Top500 list.
StarPU: A Unified Runtime System for Heterogeneous Multicore Architectures: StarPU is an example of a system focused on the scheduling of tasks on heterogeneous systems.
OpenACC: OpenACC was created to support programming accelerators
OpenMP: OpenMP standard; Open MP provides programming support for multi-core processors and for accelerators.

Code Transformations for Performance

Description

The reality is that compilers often don't have enough information to produce the best performing code. This project uses code transformations to improve performance of applications, particularly for codes limited by memory performance.

Project

The goal of this project is to address issues that are impeding adoption of code transformation approaches to creating better performing code. The first part of this project is to install the current version of ICE/LOCUS (see papers by Teixeira et al. below), a tool defined to manage both the application of code transformations and autotuning of the generated code.

Once the ICE/LOCUS system is installed and operational, it will be applied to several benchmarks and to applications at NCSA. The results will be evaluated to determine the next steps. Such steps might include:

Creating new code transformations to meet needs of applications
Exploring use of AI/ML for new transformations, for optimization of autotuning, for generation of test cases, and for documentation of the transformations applied.
Enhancing the code analysis framework to better support code transformations, particularly inter-procedural analysis (including in separate files)
Application to multicore and accelerators, which often require different approaches to tuning.

References

Center for Exascale-enabled Scramjet Design (CEESD): This project includes a significant effort in generating code from a high-level representation, and includes work on lower-level code generation.
A DSL for Performance Orchestration: Thiago Santos Faria Xavier Teixeira, David Padua, and William Gropp, 26th International Conference on Parallel Architectures and Compilation Techniques, PACT 2017, Portland, OR, USA, September 9-13, 2017, 372, 2017.
Managing code transformations for better performance portability: Thiago SFX Teixeira, William Gropp, and David Padua, The International Journal of High Performance Computing Applications, 33, 6, 1290–1306, 2019.
loopy github page: Loopy is a Python-based tool for implementing code transformations on loops for CPUs and GPUs.
Moya—A JIT Compiler for HPC: Prabhu, Tarun and Gropp, William, Programming and Performance Visualization Tools, Bhatele, Abhinav, Boehme, David, Levine, Joshua A., Malony, Allen D., and Schulz, Martin, 56–73, 2019.

MPI Datatypes for Programmed Data Movement

Description

MPI datatypes provide a language for describing data movement, particularly for non-continuous data. This project seeks to improve performance of applications that make use of MPI datatypes, especially for applications using GPUs and multicore CPUs.

Project

Create a new implementation of the datatype move engine for MPI, and integrate it into the MPICH implementation of MPI. All of this needs to be driven by some representative benchmarks. Torsten's ddtbench is a good starting point, but there are others, and we need GPU-aware benchmarks as well.

Some of the issues in the current datatype engine in MPICH:

Performance in MPICH is sometimes inferior to Open MPI and to straightforward user code,
Yaksa enormously increases the size of the MPICH library and potentially executables.
Datatypes that do not map easily into the triple loop model of Yaksa may not perform well, and
Yaksa does provide good performance for moving data to/from GPUs, where the MPI datatype matches the Yaksa model. This performance benefit needs to be preserved.

Proposed approach.
MPI Datatypes are converted into an internal intermediate representation. This representation preserves information about loops and other common aggregate operations. A starting point for this IR could be that used in DAME.

Next, this IR is optimized, using familiar compiler techniques. The transformations to consider include:

Loop fusion (this is good for vector of contiguous types, reducing a double loop to a single loop)
Strength reduction (e.g., index to stride. Could be combined with loop splitting if most but not all indexed entries have constant stride to the next)
Constant propagation (e.g., for offsets)
Loop interchange and cache blocking (good for regular but non-unit stride access)
Precompiled data movement patterns (DAME provide simple loops; 2 or 3 level nesting could be considered).

A new optimization to consider for GPUs is a variant of data sieving - for non-contiguous data, estimate whether performance is better reading a block and discarding the unneeded data or transferring data piecemeal. This can be used both into and out of the GPU. A refinement of this is to pick the approach but then monitor performance to refine the model, and potentially switch approach.

Another interesting optimization that could apply to both collective communication and collective I/O is to optimize for multiple processes on a single node (or NUMA domain), particularly to stripe across multiple NICs.

Finally, execution of this IR needs to work well on GPUs as well as CPUs. To accomplish this, execution of the IR should try to invoke fast kernels as much as possible. These should probably include some kind of nested loop (as in Yaksa). Note that DAME included compiled versions of code for the innermost loops.

Things to keep in mind:

It must be possible to segment the copies, in the case where an intermediate buffer is used and that buffer is too small to hold all of the data. Care needs to be taken that this feature doesn't impact performance when the data fits within a single segment or into a single user buffer.
Years ago, part of MPICH, IIRC, assumed flattened datatypes; DAME has a better representation, but flattening the DAME representation discarded all of the benefit of DAME. For this new approach to fit easily into MPICH, there needs to be a clear API for copies under datatypes, and all parts of MPICH will need to correctly implement that API.
There is a point of diminishing return in directly compiling support for nested loops, particularly if loop optimizations such as loop interchange are supported. A study should look at relative value for each level supported.
Compiling a loop or series of loops may not give the best performance, if the loop parameters are not known at compile time. E.g., whether or not to use vector instructions can depend on specific values of the loop parameters.
JIT is tempting but was rarely valuable when DAME was built, because of very high overhead in using compiler-based tools for JIT (e.g., LLVM) and the complexity of code generation on modern architectures (if you want to roll your own). However, this may have changed, and should be investigated.
MPI requires that it be possible to recover the construction of the MPI datatype. This may not be the same as the IR representation, particularly after any optimizations. Thus, it is necessary to maintain a separate description (note that an extension could allow the user to indicate that they will not query the datatype, eliminating the need for maintaining this information.

References

DAME: Runtime-compilation for data movement: Tarun Prabhu and William Gropp, The International Journal of High Performance Computing Applications, 32, 5, 760-774, 2018.
DDTBench: Micro-Applications for Communication Data Access Patterns and MPI Datatypes: Benchmark web page, includes link to paper in EuroMPI'12.
Yaksa Web Page: Project page for the Yaksa datatype engine currently used in MPICH
Processing MPI Datatypes Outside MPI: Robert Ross, Robert Latham, William Gropp, Ewing Lusk and Rajeev Thakur; Outstanding paper at EuroPVMMPI'09, Helsinki, Finland.
Implementing Fast and Reusable Datatype Processing: Robert Ross, Neill Miller, and William Gropp. Euro PVMMPI'03.