Slice Swarms for HPC Application Resilience

Submitted by grigby1 on Thu, 06/07/2018 - 3:06pm

Title	Slice Swarms for HPC Application Resilience
Publication Type	Conference Paper
Year of Publication	2017
Authors	Alazzawe, A., Kant, K.
Conference Name	2017 Fifth International Symposium on Computing and Networking (CANDAR)
Publisher	IEEE
ISBN Number	978-1-5386-2087-8
Keywords	Artificial neural networks, dynamic energy, energy budget, energy conservation, Energy efficiency, exascale systems, Handheld computers, High performance computing, HPC application resilience, HPC applications, HPC systems, large scale computations, Neural Network Resilience, parallel processing, pubcrawl, resilience, Resilience HPC Silent Errors Energy, resilience methods, resilience techniques, Resiliency, slice swarms, software fault tolerance, System recovery, transient error detection, transient error recovery
Abstract	Resilience in High Performance Computing (HPC) is a constraining factor for bringing applications to the upcoming exascale systems. Resilience techniques must be able to scale to handle the increasing number of expected errors in an energy efficient manner. Since the purpose of running applications on HPC systems is to perform large scale computations as quick as possible, resilience methods should not add a large delay to the time to completion of the application. In this paper we introduce a novel technique to detect and recover from transient errors in HPC applications. One of the features of our technique is that the energy budget allocated to resilience can be adjusted depending on the operator's resilience needs. For example, on synthetic data, the technique can detect about 50% of transient errors while only using 20% of the dynamic energy required for running the application. For a 60% energy budget, an application that uses 10k cores and takes 128 hours to run, will only require 10% longer to complete.
URL	https://ieeexplore.ieee.org/document/8345404/
DOI	10.1109/CANDAR.2017.107
Citation Key	alazzawe_slice_2017

Groups:

Science of Security VO