Visible to the public Biblio

Filters: Keyword is distributed training  [Clear All Filters]
2022-11-18
Li, Pengzhen, Koyuncu, Erdem, Seferoglu, Hulya.  2021.  Respipe: Resilient Model-Distributed DNN Training at Edge Networks. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :3660–3664.
The traditional approach to distributed deep neural network (DNN) training is data-distributed learning, which partitions and distributes data to workers. This approach, although has good convergence properties, has high communication cost, which puts a strain especially on edge systems and increases delay. An emerging approach is model-distributed learning, where a training model is distributed across workers. Model-distributed learning is a promising approach to reduce communication and storage costs, which is crucial for edge systems. In this paper, we design ResPipe, a novel resilient model-distributed DNN training mechanism against delayed/failed workers. We analyze the communication cost of ResPipe and demonstrate the trade-off between resiliency and communication cost. We implement ResPipe in a real testbed consisting of Android-based smartphones, and show that it improves the convergence rate and accuracy of training for convolutional neural networks (CNNs).
2022-11-08
HeydariGorji, Ali, Rezaei, Siavash, Torabzadehkashi, Mahdi, Bobarshad, Hossein, Alves, Vladimir, Chou, Pai H..  2020.  HyperTune: Dynamic Hyperparameter Tuning for Efficient Distribution of DNN Training Over Heterogeneous Systems. 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD). :1–8.
Distributed training is a novel approach to accelerating training of Deep Neural Networks (DNN), but common training libraries fall short of addressing the distributed nature of heterogeneous processors or interruption by other workloads on the shared processing nodes. This paper describes distributed training of DNN on computational storage devices (CSD), which are NAND flash-based, high-capacity data storage with internal processing engines. A CSD-based distributed architecture incorporates the advantages of federated learning in terms of performance scalability, resiliency, and data privacy by eliminating the unnecessary data movement between the storage device and the host processor. The paper also describes Stannis, a DNN training framework that improves on the shortcomings of existing distributed training frameworks by dynamically tuning the training hyperparameters in heterogeneous systems to maintain the maximum overall processing speed in term of processed images per second and energy efficiency. Experimental results on image classification training benchmarks show up to 3.1x improvement in performance and 2.45x reduction in energy consumption when using Stannis plus CSD compare to the generic systems.