In-Situ Mitigation of Silent Data Corruption in PDE Solvers
Title | In-Situ Mitigation of Silent Data Corruption in PDE Solvers |
Publication Type | Conference Paper |
Year of Publication | 2016 |
Authors | Salloum, Maher, Mayo, Jackson R., Armstrong, Robert C. |
Conference Name | Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale |
Publisher | ACM |
Conference Location | New York, NY, USA |
ISBN Number | 978-1-4503-4349-7 |
Keywords | algorithm-based fault tolerance, composability, cyber physical systems, False Data Detection, Human Behavior, numerical methods, partial differential equation, pubcrawl, resilience, Resiliency, silent data corruption |
Abstract | We present algorithmic techniques for parallel PDE solvers that leverage numerical smoothness properties of physics simulation to detect and correct silent data corruption within local computations. We initially model such silent hardware errors (which are of concern for extreme scale) via injected DRAM bit flips. Our mitigation approach generalizes previously developed "robust stencils" and uses modified linear algebra operations that spatially interpolate to replace large outlier values. Prototype implementations for 1D hyperbolic and 3D elliptic solvers, tested on up to 2048 cores, show that this error mitigation enables tolerating orders of magnitude higher bit-flip rates. The runtime overhead of the approach generally decreases with greater solver scale and complexity, becoming no more than a few percent in some cases. A key advantage is that silent data corruption can be handled transparently with data in cache, reducing the cost of false-positive detections compared to rollback approaches. |
URL | http://doi.acm.org/10.1145/2909428.2909433 |
DOI | 10.1145/2909428.2909433 |
Citation Key | salloum_-situ_2016 |