Slice Swarms for HPC Application Resilience
Title | Slice Swarms for HPC Application Resilience |
Publication Type | Conference Paper |
Year of Publication | 2017 |
Authors | Alazzawe, A., Kant, K. |
Conference Name | 2017 Fifth International Symposium on Computing and Networking (CANDAR) |
Publisher | IEEE |
ISBN Number | 978-1-5386-2087-8 |
Keywords | Artificial neural networks, dynamic energy, energy budget, energy conservation, Energy efficiency, exascale systems, Handheld computers, High performance computing, HPC application resilience, HPC applications, HPC systems, large scale computations, Neural Network Resilience, parallel processing, pubcrawl, resilience, Resilience HPC Silent Errors Energy, resilience methods, resilience techniques, Resiliency, slice swarms, software fault tolerance, System recovery, transient error detection, transient error recovery |
Abstract | Resilience in High Performance Computing (HPC) is a constraining factor for bringing applications to the upcoming exascale systems. Resilience techniques must be able to scale to handle the increasing number of expected errors in an energy efficient manner. Since the purpose of running applications on HPC systems is to perform large scale computations as quick as possible, resilience methods should not add a large delay to the time to completion of the application. In this paper we introduce a novel technique to detect and recover from transient errors in HPC applications. One of the features of our technique is that the energy budget allocated to resilience can be adjusted depending on the operator's resilience needs. For example, on synthetic data, the technique can detect about 50% of transient errors while only using 20% of the dynamic energy required for running the application. For a 60% energy budget, an application that uses 10k cores and takes 128 hours to run, will only require 10% longer to complete. |
URL | https://ieeexplore.ieee.org/document/8345404/ |
DOI | 10.1109/CANDAR.2017.107 |
Citation Key | alazzawe_slice_2017 |
- Neural Network Resilience
- transient error recovery
- transient error detection
- System recovery
- software fault tolerance
- slice swarms
- Resiliency
- resilience techniques
- resilience methods
- Resilience HPC Silent Errors Energy
- resilience
- pubcrawl
- parallel processing
- Artificial Neural Networks
- large scale computations
- HPC systems
- HPC applications
- HPC application resilience
- High performance computing
- Handheld computers
- exascale systems
- Energy Efficiency
- energy conservation
- energy budget
- dynamic energy