Visible to the public Slice Swarms for HPC Application Resilience

TitleSlice Swarms for HPC Application Resilience
Publication TypeConference Paper
Year of Publication2017
AuthorsAlazzawe, A., Kant, K.
Conference Name2017 Fifth International Symposium on Computing and Networking (CANDAR)
PublisherIEEE
ISBN Number978-1-5386-2087-8
KeywordsArtificial neural networks, dynamic energy, energy budget, energy conservation, Energy efficiency, exascale systems, Handheld computers, High performance computing, HPC application resilience, HPC applications, HPC systems, large scale computations, Neural Network Resilience, parallel processing, pubcrawl, resilience, Resilience HPC Silent Errors Energy, resilience methods, resilience techniques, Resiliency, slice swarms, software fault tolerance, System recovery, transient error detection, transient error recovery
Abstract

Resilience in High Performance Computing (HPC) is a constraining factor for bringing applications to the upcoming exascale systems. Resilience techniques must be able to scale to handle the increasing number of expected errors in an energy efficient manner. Since the purpose of running applications on HPC systems is to perform large scale computations as quick as possible, resilience methods should not add a large delay to the time to completion of the application. In this paper we introduce a novel technique to detect and recover from transient errors in HPC applications. One of the features of our technique is that the energy budget allocated to resilience can be adjusted depending on the operator's resilience needs. For example, on synthetic data, the technique can detect about 50% of transient errors while only using 20% of the dynamic energy required for running the application. For a 60% energy budget, an application that uses 10k cores and takes 128 hours to run, will only require 10% longer to complete.

URLhttps://ieeexplore.ieee.org/document/8345404/
DOI10.1109/CANDAR.2017.107
Citation Keyalazzawe_slice_2017