Visible to the public Biblio

Filters: Keyword is error recovery  [Clear All Filters]
2022-10-28
Ponader, Jonathan, Thomas, Kyle, Kundu, Sandip, Solihin, Yan.  2021.  MILR: Mathematically Induced Layer Recovery for Plaintext Space Error Correction of CNNs. 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). :75–87.
The increased use of Convolutional Neural Networks (CNN) in mission-critical systems has increased the need for robust and resilient networks in the face of both naturally occurring faults as well as security attacks. The lack of robustness and resiliency can lead to unreliable inference results. Current methods that address CNN robustness require hardware modification, network modification, or network duplication. This paper proposes MILR a software-based CNN error detection and error correction system that enables recovery from single and multi-bit errors. The recovery capabilities are based on mathematical relationships between the inputs, outputs, and parameters(weights) of the layers; exploiting these relationships allows the recovery of erroneous parameters (iveights) throughout a layer and the network. MILR is suitable for plaintext-space error correction (PSEC) given its ability to correct whole-weight and even whole-layer errors in CNNs.
2017-05-16
Gensh, Rem, Romanovsky, Alexander, Yakovlev, Alex.  2016.  On Structuring Holistic Fault Tolerance. Proceedings of the 15th International Conference on Modularity. :130–133.

Computer systems are developed taking into account that they should be easily maintained in the future. It is one of the main requirements for the sound architectural design. The existing approaches to introducing fault tolerance rely on recursive system structuring out of functional components – this typically results in non-optimal fault tolerance. The paper proposes a vision of structuring complex many-core systems by introducing a special component supporting system-wide fault tolerance coordination. The component acts as a central module making decisions about fault tolerance strategies to be implemented by individual system components depending on the performance and energy requirements specified as system operating modes.