A Lightweight Error-Resiliency Mechanism for Deep Neural Networks

Submitted by grigby1 on Fri, 11/18/2022 - 5:43pm

Title	A Lightweight Error-Resiliency Mechanism for Deep Neural Networks
Publication Type	Conference Paper
Year of Publication	2021
Authors	Goldstein, Brunno F., Ferreira, Victor C., Srinivasan, Sudarshan, Das, Dipankar, Nery, Alexandre S., Kundu, Sandip, França, Felipe M. G.
Conference Name	2021 22nd International Symposium on Quality Electronic Design (ISQED)
Keywords	Bit error rate, error resiliency, Hardware, Neural Network Accelerators, neural network resiliency, Neural networks, Power supplies, pubcrawl, Quantization (signal), Real-time Systems, reliability, Reliability engineering, resilience, Resiliency
Abstract	In recent years, Deep Neural Networks (DNNs) have made inroads into a number of applications involving pattern recognition - from facial recognition to self-driving cars. Some of these applications, such as self-driving cars, have real-time requirements, where specialized DNN hardware accelerators help meet those requirements. Since DNN execution time is dominated by convolution, Multiply-and-Accumulate (MAC) units are at the heart of these accelerators. As hardware accelerators push the performance limits with strict power constraints, reliability is often compromised. In particular, power-constrained DNN accelerators are more vulnerable to transient and intermittent hardware faults due to particle hits, manufacturing variations, and fluctuations in power supply voltage and temperature. Methods such as hardware replication have been used to deal with these reliability problems in the past. Unfortunately, the duplication approach is untenable in a power constrained environment. This paper introduces a low-cost error-resiliency scheme that targets MAC units employed in conventional DNN accelerators. We evaluate the reliability improvements from the proposed architecture using a set of 6 CNNs over varying bit error rates (BER) and demonstrate that our proposed solution can achieve more than 99% of fault coverage with a 5-bits arithmetic code, complying with the ASIL-D level of ISO26262 standards with a negligible area and power overhead. Additionally, we evaluate the proposed detection mechanism coupled with a word masking correction scheme, demonstrating no loss of accuracy up to a BER of 10-2.
DOI	10.1109/ISQED51717.2021.9424287
Citation Key	goldstein_lightweight_2021

Groups:

Science of Security VO