Biblio
Filters: Keyword is soft error [Clear All Filters]
Evaluating the Impact of Hardware Faults on Program Execution in a Microkernel Environment. 2022 IEEE International Symposium on Hardware Oriented Security and Trust (HOST). :149–152.
.
2022. Safety-critical systems require resiliency against both cyberattacks and environmental faults. Researches have shown that microkernels can isolate components and limit the capabilities of would-be attackers by confining the attack in the component that it is initiated in. This limits the propagation of faults to sensitive components in the system. Nonetheless, the isolation mechanism in microkernels is not fully investigated for its resiliency against hardware faults. This paper investigates whether microkernels provide protection against hardware faults and, if so, to what extent quantitatively. This work is part of an effort in establishing an overlap between security and reliability with the goal of maximizing both while minimizing their impact on performance. In this work, transient faults are emulated on the seL4 microkernel and Linux kernel using debugger-induced bit flips across random timestamps in benchmark applications. Results show differences in the frequency and final outcome of fault to error manifestation in the seL4 environment compared to the Linux environment, including a reduction in silent data corruptions.
Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. :8:1–8:12.
.
2017. Deep learning neural networks (DNNs) have been successful in solving a wide range of machine learning problems. Specialized hardware accelerators have been proposed to accelerate the execution of DNN algorithms for high-performance and energy efficiency. Recently, they have been deployed in datacenters (potentially for business-critical or industrial applications) and safety-critical systems such as self-driving cars. Soft errors caused by high-energy particles have been increasing in hardware systems, and these can lead to catastrophic failures in DNN systems. Traditional methods for building resilient systems, e.g., Triple Modular Redundancy (TMR), are agnostic of the DNN algorithm and the DNN accelerator's architecture. Hence, these traditional resilience approaches incur high overheads, which makes them challenging to deploy. In this paper, we experimentally evaluate the resilience characteristics of DNN systems (i.e., DNN software running on specialized accelerators). We find that the error resilience of a DNN system depends on the data types, values, data reuses, and types of layers in the design. Based on our observations, we propose two efficient protection techniques for DNN systems.