Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications

Submitted by grigby1 on Thu, 06/07/2018 - 3:05pm

Title	Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications
Publication Type	Conference Paper
Year of Publication	2017
Authors	Li, Guanpeng, Hari, Siva Kumar Sastry, Sullivan, Michael, Tsai, Timothy, Pattabiraman, Karthik, Emer, Joel, Keckler, Stephen W.
Conference Name	Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Publisher	ACM
Conference Location	New York, NY, USA
ISBN Number	978-1-4503-5114-0
Keywords	Deep Learning, Neural Network Resilience, pubcrawl, reliability, resilience, Resiliency, silent data corruption, soft error
Abstract	Deep learning neural networks (DNNs) have been successful in solving a wide range of machine learning problems. Specialized hardware accelerators have been proposed to accelerate the execution of DNN algorithms for high-performance and energy efficiency. Recently, they have been deployed in datacenters (potentially for business-critical or industrial applications) and safety-critical systems such as self-driving cars. Soft errors caused by high-energy particles have been increasing in hardware systems, and these can lead to catastrophic failures in DNN systems. Traditional methods for building resilient systems, e.g., Triple Modular Redundancy (TMR), are agnostic of the DNN algorithm and the DNN accelerator's architecture. Hence, these traditional resilience approaches incur high overheads, which makes them challenging to deploy. In this paper, we experimentally evaluate the resilience characteristics of DNN systems (i.e., DNN software running on specialized accelerators). We find that the error resilience of a DNN system depends on the data types, values, data reuses, and types of layers in the design. Based on our observations, we propose two efficient protection techniques for DNN systems.
URL	http://doi.acm.org/10.1145/3126908.3126964
DOI	10.1145/3126908.3126964
Citation Key	li_understanding_2017

Groups:

Science of Security VO