Title | Respipe: Resilient Model-Distributed DNN Training at Edge Networks |
Publication Type | Conference Paper |
Year of Publication | 2021 |
Authors | Li, Pengzhen, Koyuncu, Erdem, Seferoglu, Hulya |
Conference Name | ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Keywords | Deep neural networks (DNN), delays, Distributed databases, distributed training, edge networks, neural network resiliency, Neural networks, pubcrawl, resilience, Resiliency, Signal processing, smart phones, speech processing, Training |
Abstract | The traditional approach to distributed deep neural network (DNN) training is data-distributed learning, which partitions and distributes data to workers. This approach, although has good convergence properties, has high communication cost, which puts a strain especially on edge systems and increases delay. An emerging approach is model-distributed learning, where a training model is distributed across workers. Model-distributed learning is a promising approach to reduce communication and storage costs, which is crucial for edge systems. In this paper, we design ResPipe, a novel resilient model-distributed DNN training mechanism against delayed/failed workers. We analyze the communication cost of ResPipe and demonstrate the trade-off between resiliency and communication cost. We implement ResPipe in a real testbed consisting of Android-based smartphones, and show that it improves the convergence rate and accuracy of training for convolutional neural networks (CNNs). |
DOI | 10.1109/ICASSP39728.2021.9413553 |
Citation Key | li_respipe_2021 |