Skip to Main Content Area

Not a member?

Click here to register!
Forgot username or password?

An Algorithmic Approach to Error Localization and Partial Recomputation for Low-Overhead Fault Tolerance

Submitted by awhitesell on Thu, 02/02/2017 - 1:36pm

Title	An Algorithmic Approach to Error Localization and Partial Recomputation for Low-Overhead Fault Tolerance
Publication Type	Conference Paper
Year of Publication	2013
Authors	Joseph Sloan, University of Illinois at Urbana-Champaign, Rakesh Kumar, University of Illinois at Urbana-Champaign, Greg Bronevetsky, Lawrence Livermore National Laboratory
Conference Name	43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2013)
Date Published	06/2013
Publisher	IEEE Computer Society
Conference Location	Budapest, Hungary
Keywords	algorithmic error correction, error localization, NSA SoS Lablets Materials, numerical methods, partial recomputation, science of security, Secure Platforms via Stochastic Computing, sparse linear algebra, UIUC
Abstract	The increasing size and complexity of massively parallel systems (e.g. HPC systems) is making it increasingly likely that individual circuits will produce erroneous results. For this reason, novel fault tolerance approaches are increasingly needed. Prior fault tolerance approaches often rely on checkpoint-rollback based schemes. Unfortunately, such schemes are primarily limited to rare error event scenarios as the overheads of such schemes become prohibitive if faults are common. In this paper, we propose a novel approach for algorithmic correction of faulty application outputs. The key insight for this approach is that even under high error scenarios, even if the result of an algorithm is erroneous, most of it is correct. Instead of simply rolling back to the most recent checkpoint and repeating the entire segment of computation, our novel resilience approach uses algorithmic error localization and partial recomputation to efficiently correct the corrupted results. We evaluate our approach in the specific algorithmic scenario of linear algebra operations, focusing on matrix-vector multiplication (MVM) and iterative linear solvers. We develop a novel technique for localizing errors in MVM and show how to achieve partial recomputation within this algorithm, and demonstrate that this approach both improves the performance of the Conjugate Gradient solver in high error scenarios by 3x-4x and increases the probability that it completes successfully by up to 60% with parallel experiments up to 100 nodes.
Citation Key	node-31836

Attachment	Taxonomy	Kind	Size
An Algorithmic Approach to Error Localization and Partial Re-computation for Low-Overhead Fault Tolerance.pdf	Science of Security algorithmic error correction error localization numerical methods partial recomputation Science of Security sparse linear algebra UIUC Secure Platforms via Stochastic Computing NSA SoS Lablets Materials	PDF document	384.78 KB	Download Preview

Attachment	Size
	bytes

Groups:

Terms of Use | ©2023. CPS-VO