Visible to the public Biblio

Filters: Keyword is Secure Platforms via Stochastic Computing  [Clear All Filters]
2017-02-17
Biplab Deka, University of Illinois at Urbana-Champaign, Alex A. Birklykke, Aalborg University, Henry Duwe, University of Illinois at Urbana-Champaign, Vikash K. Mansinghka, Massachusetts Institute of Technology, Rakesh Kumar, University of Illinois at Urbana-Champaign.  2014.  Markov Chain Algorithms: A Template for Building Future Robust Low-power Systems. Philosophical Transactions of the Royal Society A Mathematical, Physical and Engineering Sciences.

Although computational systems are looking towards post CMOS devices in the pursuit of lower power, the expected inherent unreliability of such devices makes it difficult to design robust systems without additional power overheads for guaranteeing robustness. As such, algorithmic structures with inherent ability to tolerate computational errors are of significant interest. We propose to cast applications as stochastic algorithms based on Markov chains (MCs) as such algorithms are both sufficiently general and tolerant to transition errors. We show with four example applications—Boolean satisfiability, sorting, low-density parity-check decoding and clustering—how applications can be cast as MC algorithms. Using algorithmic fault injection techniques, we demonstrate the robustness of these implementations to transition errors with high error rates. Based on these results, we make a case for using MCs as an algorithmic template for future robust low-power systems.

2017-02-02
Joseph Sloan, University of Illinois at Urbana-Champaign, Rakesh Kumar, University of Illinois at Urbana-Champaign, Greg Bronevetsky, Lawrence Livermore National Laboratory.  2013.  An Algorithmic Approach to Error Localization and Partial Recomputation for Low-Overhead Fault Tolerance. 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2013).

The increasing size and complexity of massively parallel systems (e.g. HPC systems) is making it increasingly likely that individual circuits will produce erroneous results. For this reason, novel fault tolerance approaches are increasingly needed. Prior fault tolerance approaches often rely on checkpoint-rollback based schemes. Unfortunately, such schemes are primarily limited to rare error event scenarios as the overheads of such schemes become prohibitive if faults are common. In this paper, we propose a novel approach for algorithmic correction of faulty application outputs. The key insight for this approach is that even under high error scenarios, even if the result of an algorithm is erroneous, most of it is correct. Instead of simply rolling back to the most recent checkpoint and repeating the entire segment of computation, our novel resilience approach uses algorithmic error localization and partial recomputation to efficiently correct the corrupted results. We evaluate our approach in the specific algorithmic scenario of linear algebra operations, focusing on matrix-vector multiplication (MVM) and iterative linear solvers. We develop a novel technique for localizing errors in MVM and show how to achieve partial recomputation within this algorithm, and demonstrate that this approach both improves the performance of the Conjugate Gradient solver in high error scenarios by 3x-4x and increases the probability that it completes successfully by up to 60% with parallel experiments up to 100 nodes.