Visible to the public Quantifying the impact of network congestion on application performance and network metrics

TitleQuantifying the impact of network congestion on application performance and network metrics
Publication TypeConference Paper
Year of Publication2020
AuthorsZhang, Y., Groves, T., Cook, B., Wright, N. J., Coskun, A. K.
Conference Name2020 IEEE International Conference on Cluster Computing (CLUSTER)
Keywordsapplication performance, Aries network, computer networks, Conferences, Correlation, Degradation, dragonfly topology, HPC, HPC network architecture, intensive MPI operations, Measurement, message passing, modern high-performance computing systems, network architecture, network congestion, network counters, network metrics, packet transmission statistics, parallel processing, Processor scheduling, pubcrawl, Resiliency, Scalability, telecommunication network routing, telecommunication network topology, Topology, work factor metrics
AbstractIn modern high-performance computing (HPC) systems, network congestion is an important factor that contributes to performance degradation. However, how network congestion impacts application performance is not fully understood. As Aries network, a recent HPC network architecture featuring a dragonfly topology, is equipped with network counters measuring packet transmission statistics on each router, these network metrics can potentially be utilized to understand network performance. In this work, by experiments on a large HPC system, we quantify the impact of network congestion on various applications' performance in terms of execution time, and we correlate application performance with network metrics. Our results demonstrate diverse impacts of network congestion: while applications with intensive MPI operations (such as HACC and MILC) suffer from more than 40% extension in their execution times under network congestion, applications with less intensive MPI operations (such as Graph500 and HPCG) are mostly not affected. We also demonstrate that a stall-to-flit ratio metric derived from Aries network counters is positively correlated with performance degradation and, thus, this metric can serve as an indicator of network congestion in HPC systems.
DOI10.1109/CLUSTER49012.2020.00026
Citation Keyzhang_quantifying_2020