Papers

A list of publications is also available on Google Scholar

The list of papers here is attempted to be kept up-to-date and the contents are listed in reverse chronological order.

Fault Recovery Methods for Asynchronous Linear Solvers

Evan Coleman, Erik Jensen, and Masha Sosonkina
International Journal of Parallel Programming (2021) (pdf)
Abstract/Overview This study seeks to understand the soft error vulnerability of asynchronous iterative methods, with a focus on stationary iterative solvers such as Jacobi. A theoretical investigation into the performance of the asynchronous iterative methods is presented and used to motivate several fault recovery methods for asynchronous linear solvers. The numerical experiments utilize a hybrid-parallel implementation where the computational work is distributed over multiple nodes using MPI and parallelized on each node using OpenMP, and a series of runs are conducted to measure both the impact of soft faults and the effectiveness of the recovery methods. Trials are run to compare two models for simulating the occurrence of a fault as well as techniques for recovering from the effects of a fault. The results show that the proposed strategies can effectively recover from the impact of a fault and that the numerical model for simulating soft faults consistently produces fault effects that enable the investigation and tuning of recovery techniques in action.

Implementing Asynchronous Linear Solvers Using Non-uniform Distributions

Erik Jensen, Evan Coleman, and Masha Sosonkina
Journal of Simulation Engineering (2020) (pdf)
Abstract/Overview Asynchronous iterative methods present a mechanism to improve the performance of algorithms for highly parallel computational platforms by removing the overhead associated with synchronization among computing elements. This paper considers a class of asynchronous iterative linear system solvers that employ randomization to determine the component update orders, specifically focusing on the effects of drawing the order from non-uniform distributions. Results from shared-memory experiments with a two-dimensional finite-difference discrete Laplacian problem show that using distributions favoring the selection of components with a larger contribution to the residual may lead to faster convergence than selecting uniformly. Multiple implementations of the randomized asynchronous linear system solvers are considered and tested with various distributions and parameters. In the best case of parameter choices, average times for the normal and exponential distributions were, respectively, 13.3% and 17.3% faster than the performance with a uniform distribution, and were able to converge in approximately 10% fewer iterations than traditional stationary solvers.

Resilience for Asynchronous Iterative Methods for Sparse Linear Systems

Evan Coleman
PhD Dissertation, Old Dominion University (2019) (pdf)
Abstract/Overview

Large scale simulations are used in a variety of application areas in science and engineering to help forward the progress of innovation. Many spend the vast majority of their computational time attempting to solve large systems of linear equations; typically arising from discretizations of partial differential equations that are used to mathematically model various phenomena. The algorithms used to solve these problems are typically iterative in nature, and making efficient use of computational time on High Performance Computing (HPC) clusters involves constantly improving these iterative algorithms. Future HPC platforms are expected to encounter three main problem areas: scalability of code, reliability of hardware, and energy efficiency of the platform. The HPC resources that are expected to run the large programs are planned to consist of billions of processing units that come from more traditional multicore processors as well as a variety of different hardware accelerators.

This growth in parallelism leads to the presence of all three problems. Previously, work on algorithm development has focused primarily on creating fault tolerance mechanisms for traditional iterative solvers. Recent work has begun to revisit using asynchronous methods for solving large scale applications, and this dissertation presents research into fault tolerance for fine-grained methods that are asynchronous in nature. Classical convergence results for asynchronous methods are revisited and modified to account for the possible occurrence of a fault, and a variety of techniques for recovery from the effects of a fault are proposed. Examples of how these techniques can be used are shown for various algorithms, including an analysis of a fine-grained algorithm for computing incomplete factorizations. Lastly, numerous modeling and simulation tools for the further construction of iterative algorithms for HPC applications are developed, including numerical models for simulating faults and a simulation framework that can be used to extrapolate the performance of algorithms towards future HPC systems.

Enhancing Asynchronous Linear Solvers through Randomization

Evan Coleman, Erik Jensen, and Masha Sosonkina
Proceedings of the 2019 Spring Simulation Multiconference (2019) (pdf)
Abstract/Overview Asynchronous iterative methods present a mechanism to improve the performance of parallel algorithms for highly parallel computational platforms by removing the overhead associated with synchronization among computing elements. This paper considers a class of asynchronous iterative linear system solvers that employ randomization to determine the component update orders, and specifically focusing on the effects of non-uniform distributions. Results show that using distributions favoring the selection of components with a larger residual may lead to a faster convergence than that when selecting uniformly. In particular, in the best case of parameter choices, average times for the normal and exponential distributions were, respectively, 13.3% and 17.3% better than the performance with a uniform distribution.

Predictive Modeling of the Performance of Asynchronous Iterative Methods

Erik Jensen, Evan Coleman, and Masha Sosonkina
Journal of Supercomputing (2019) (pdf)
Abstract/Overview Asynchronous algorithms may increase the performance of parallel applications on large-scale HPC platforms due to decreased dependence among processing elements. This work investigates strategies for implementing asynchronous hybrid parallel MPI–OpenMP iterative solvers. Seven different implementations are considered, and results show that striking a balance between communication and computation that balances the number of iterations in each processing element benefits performance and solution quality. A predictive performance model that utilizes kernel density estimation to model the underlying probability density function to the collected data is then developed to optimize execution parameters for a given problem. For the majority of iteration executions, the performance model matches within about 6% of the empirical data. The different hybrid parallel implementations are examined further to find optimal parametric distributions whose parameters can be tuned to the problem at hand. The generalized extreme value distribution was selected based on a combination of quantitative and qualitative comparisons, and for the most of the iteration executions, the model matches the data within about 6.1%. Results from the parametric distribution model are examined along with results of the model on related problems, and possible further extensions to the predictive model are discussed.

Soft-Fault Resilience for Fine-Grained Parallel Incomplete Factorizations

Evan Coleman and Masha Sosonkina
Naval Surface Warfare Center, Dahlgren Division Technical Report TR-18/176 (2018) (pdf)
Abstract/Overview This document describes methodologies and algorithmic modifications to the fine-grained incomplete factorization that provide the algorithm with resilience to the occurrence of soft faults. Several variants of the original algorithm are proposed, and both symmetric and nonsymmetric problems are considered.

Impacts of Three Soft-Fault Models on Hybrid Parallel Asynchronous Iterative Methods

Evan Coleman, Erik Jensen, and Masha Sosonkina
30th International Symposium on Computer Architecture and High Performance Computing (2018) (pdf)
Abstract/Overview This study seeks to understand the soft error vulnerability of asynchronous iterative methods, with a focus on stationary iterative solvers such as Jacobi. The implementations make use of hybrid parallelism where the computational work is distributed over multiple nodes using MPI and parallelized on each node using openMP. A series of experiments is conducted to measure the impact of an undetected soft fault on an asynchronous iterative method, and to compare and contrast several techniques for simulating the occurrence of a fault and then recovering from the effects of the faults. The data shows that the two numerical soft-fault models tested here more consistently than a “bit-flip” model produce bad enough behavior to test a variety of recovery strategies, such as those based on partial checkpointing.

Simulation Framework for Asynchronous Iterative Methods

Evan Coleman, Erik Jensen, and Masha Sosonkina
Journal of Simulation Engineering (2018) (pdf)
Abstract/Overview As high-performance computing (HPC) platforms progress towards exascale, computational methods must be revamped to successfully leverage them. In particular,(1) asynchronous methods become of great importance because synchronization becomes prohibitively expensive and (2) resilience of computations must be achieved, eg, using checkpointing selectively which may otherwise become prohibitively expensive due to the sheer scale of the computing environment. In this work, a simulation framework for asynchronous iterative methods is proposed and tested on HPC accelerator (shared-memory) architecture. The design proposed here offers a lightweight alternative to existing computational frameworks to allow for easy experimentation with various relaxation iterative techniques, solution updating schemes, and predicted performance. The simulation framework is implemented in MATLAB® using function handles, which offers a modular and easily extensible design. An example of a case study using the simulation framework is presented to examine the efficacy of different checkpointing schemes for asynchronous relaxation methods.

Using Modeling to Improve the Performance of Asynchronous Jacobi

Erik Jensen, Evan Coleman, and Masha Sosonkina
Proceedings of the 24th annual International Conference on Parallel and Distributed Processing Techniques and Applications (2018) (pdf)
Abstract/Overview Asynchronous algorithms may increase performance of parallel applications. This work investigates strategies for implementing asynchronous hybrid parallel MPI-OpenMP Jacobi solvers, and generates a predictive performance model that suggests solver parameters that are well-suited to a specific problem. Results show that making efforts to equalize the number of iterations for all processing elements benefits performance and solution quality.

Convergence and Resilience of the Fault Tolerant Variants of the Fine-Grained Parallel Incomplete LU Factorization for Non-Symmetric Problems

Evan Coleman and Masha Sosonkina
Proceedings of the 2018 Spring Simulation Multiconference (2018) (pdf)
Abstract/Overview This paper presents an investigation into the convergence of the fine-grained parallel algorithm for computing an incomplete LU factorization for non-symmetric and indefinite matrices. The fine-grained parallel incomplete LU factorization is a nonlinear fixed point iteration and convergence has not been extensively studied for problems that are not symmetric positive definite. This work investigates the convergence of the algorithm for these more difficult problems and additionally investigates how the occurrence of a computing fault may impact the convergence of the algorithm for the problems being studied. The results obtained suggest that this class of problems presents challenges for the fine-grained parallel incomplete LU factorization (less than 30% of configurations converge naturally), and that while the occurrence of a fault can cause significant negative effects, the simple algorithmic change advocated here can completely ameliorate the effects of a fault.

Self-Stabilizing Fine-Grained Parallel Incomplete LU Factorization

Evan Coleman and Masha Sosonkina
Sustainable Computing: Informatics and Systems (2018) (pdf)
Abstract/Overview This paper presents an investigation into the use of various mechanisms for improving the resilience of the fine-grained parallel algorithm for computing an incomplete LU factorization. These include various approaches to checkpointing as well as a study into the feasibility of using a self-stabilizing periodic correction step. Results concerning convergence of all of the self-stabilizing variants of the algorithm with respect to the occurrence of faults, and the impact of any sub-optimality in the produced incomplete L and U factors in Krylov subspace solvers are given. Numerical tests show that the simple algorithmic changes suggested here can ensure convergence of the fine-grained parallel incomplete factorization, and improve the performance of the resulting factors as preconditioners in Krylov subspace solvers in the presence of transient soft faults.

A comparison of soft-fault error models in the parallel preconditioned flexible GMRES

Evan Coleman, Aygul Jamal Marc Baboulin, Amal Khabou, and Masha Sosonkina
Proceedings of the 12th International Conference on Parallel Processing and Applied Mathematics (2018) (pdf)
Abstract/Overview The effect of two soft fault error models on the convergence of the parallel flexible GMRES (FGMRES) iterative method solving an elliptical PDE problem on a regular grid is evaluated. We consider two types of preconditioners: an incomplete LU factorization with dual threshold (ILUT), and an algebraic recursive multilevel solver (ARMS) combined with random butterfly transformation (RBT). The experiments quantify the difference between two soft fault error models considered in this study and compare their potential impact on the convergence.

Fault Tolerant Variants of the Fine-Grained Parallel Incomplete LU Factorization

Evan Coleman, Masha Sosonkina, and Edmond Chow
Proceedings of the 2017 Spring Simulation Multiconference (2017) (pdf)
Abstract/Overview This paper presents an investigation into fault tolerance for the fine-grained parallel algorithm for computing an incomplete LU factorization. Results concerning the convergence of the algorithm with respect to the occurrence of faults, and the impact of any sub-optimality in the produced incomplete factors in Krylov subspace solvers are given. Numerical tests show that the simple algorithmic changes suggested here can ensure convergence of the fine-grained parallel incomplete factorization, and improve the performance of the use of the resulting factors as preconditioners in Krylov subspace solvers if faults do occur.

A Comparison and Analysis of Soft-Fault Error Models using FGMRES

Evan Coleman and Masha Sosonkina
Proceedings of the 6th annual Virginia Modeling, Simulation, and Analysis Center Capstone Conference (2016) (pdf)
Abstract/Overview A comparison of the impact of soft fault errors - as given by two distinct soft fault error models - on the GMRES iterative method with flexible preconditioning (called FGMRES) is performed. The effect of the two soft fault error models is evaluated on the convergence of FGMRES in solving an elliptical PDE problem on a regular grid. Two types of preconditioners are explored, featuring an incomplete LU factorization and an algebraic recursive multilevel solver ARMS. The experiments conducted in this study quantify the difference in soft fault error modeling approaches and provide a means to compare the potential impact of soft fault errors of various types.

Evaluating a Persistent Soft Fault Model on Preconditioned Iterative Methods

Evan Coleman and Masha Sosonkina
Proceedings of the 22nd annual International Conference on Parallel and Distributed Processing Techniques and Applications (2016) (pdf)
Abstract/Overview The impact of soft fault errors on the GMRES iterative method with flexible preconditioning (called FGMRES) is explored. In particular, a new method for simulating soft fault errors is implemented directly in FGMRES, and the effect of error magnitude and timing is evaluated for the FGMRES convergence in solving of an elliptical PDE problem on a regular grid. Two types of preconditioners are explored, featuring an incomplete LU factorization and an algebraic recursive multilevel solver ARMS. The experiments have confirmed an intuition that, in general, injecting perturbation-based faults at the matrix-vector operation stage had a greater impact on the convergence rate than doing so at the preconditioning application stage, resulting in more cases when the iterative solver failed. In addition, several cases of better convergence under faults in preconditioning operation were observed and analyzed.