Visible to the public Biblio

Filters: Author is Levy, Scott  [Clear All Filters]
2022-04-01
Marts, W. Pepper, Dosanjh, Matthew G. F., Levy, Scott, Schonbein, Whit, Grant, Ryan E., Bridges, Patrick G..  2021.  MiniMod: A Modular Miniapplication Benchmarking Framework for HPC. 2021 IEEE International Conference on Cluster Computing (CLUSTER). :12–22.
The HPC application community has proposed many new application communication structures, middleware interfaces, and communication models to improve HPC application performance. Modifying proxy applications is the standard practice for the evaluation of these novel methodologies. Currently, this requires the creation of a new version of the proxy application for each combination of the approach being tested. In this article, we present a modular proxy-application framework, MiniMod, that enables evaluation of a combination of independently written computation kernels, data transfer logic, communication access, and threading libraries. MiniMod is designed to allow rapid development of individual modules which can be combined at runtime. Through MiniMod, developers only need a single implementation to evaluate application impact under a variety of scenarios.We demonstrate the flexibility of MiniMod’s design by using it to implement versions of a heat diffusion kernel and the miniFE finite element proxy application, along with a variety of communication, granularity, and threading modules. We examine how changing communication libraries, communication granularities, and threading approaches impact these applications on an HPC system. These experiments demonstrate that MiniMod can rapidly improve the ability to assess new middleware techniques for scientific computing applications and next-generation hardware platforms.
2017-04-24
Levy, Scott, Ferreira, Kurt B..  2016.  An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart. Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale. :35–42.

Fault tolerance is a key challenge to building the first exa\textbackslash-scale system. To understand the potential impacts of failures on next-generation systems, significant effort has been devoted to collecting, characterizing and analyzing failures on current systems. These studies require large volumes of data and complex analysis. Because the occurrence of failures in large-scale systems is unpredictable, failures are commonly modeled as a stochastic process. Failure data from current systems is examined in an attempt to identify the underlying probability distribution and its statistical properties. In this paper, we use modeling to examine the impact of failure distributions on the time-to-solution and the optimal checkpoint interval of applications that use coordinated checkpoint/restart. Using this approach, we show that as failures become more frequent, the failure distribution has a larger influence on application performance. We also show that as failure times are less tightly grouped (i.e., as the standard deviation increases) the underlying probability distribution has a greater impact on application performance. Finally, we show that computing the checkpoint interval based on the assumption that failures are exponentially distributed has a modest impact on application performance even when failures are drawn from a different distribution. Our work provides critical analysis and guidance to the process of analyzing failure data in the context of coordinated checkpoint/restart. Specifically, the data presented in this paper helps to distinguish cases where the failure distribution has a strong influence on application performance from those cases when the failure distribution has relatively little impact.