Visible to the public Biblio

Filters: Author is Olivier, Stephen L.  [Clear All Filters]
2022-02-22
Olivier, Stephen L., Ellingwood, Nathan D., Berry, Jonathan, Dunlavy, Daniel M..  2021.  Performance Portability of an SpMV Kernel Across Scientific Computing and Data Science Applications. 2021 IEEE High Performance Extreme Computing Conference (HPEC). :1—8.
Both the data science and scientific computing communities are embracing GPU acceleration for their most demanding workloads. For scientific computing applications, the massive volume of code and diversity of hardware platforms at supercomputing centers has motivated a strong effort toward performance portability. This property of a program, denoting its ability to perform well on multiple architectures and varied datasets, is heavily dependent on the choice of parallel programming model and which features of the programming model are used. In this paper, we evaluate performance portability in the context of a data science workload in contrast to a scientific computing workload, evaluating the same sparse matrix kernel on both. Among our implementations of the kernel in different performance-portable programming models, we find that many struggle to consistently achieve performance improvements using the GPU compared to simple one-line OpenMP parallelization on high-end multicore CPUs. We show one that does, and its performance approaches and sometimes even matches that of vendor-provided GPU math libraries.
2018-04-11
Bhalachandra, Sridutt, Porterfield, Allan, Olivier, Stephen L., Prins, Jan F., Fowler, Robert J..  2017.  Improving Energy Efficiency in Memory-Constrained Applications Using Core-Specific Power Control. Proceedings of the 5th International Workshop on Energy Efficient Supercomputing. :6:1–6:8.

Power is increasingly the limiting factor in High Performance Computing (HPC) at Exascale and will continue to influence future advancements in supercomputing. Recent processors equipped with on-board hardware counters allow real time monitoring of operating conditions such as energy and temperature, in addition to performance measures such as instructions retired and memory accesses. An experimental memory study presented on modern CPU architectures, Intel Sandybridge and Haswell, identifies a metric, TORo\_core, that detects bandwidth saturation and increased latency. TORo-Core is used to construct a dynamic policy applied at coarse and fine-grained levels to modulate per-core power controls on Haswell machines. The coarse and fine-grained application of dynamic policy shows best energy savings of 32.1% and 19.5% with a 2% slowdown in both cases. On average for six MPI applications, the fine-grained dynamic policy speeds execution by 1% while the coarse-grained application results in a 3% slowdown. Energy savings through frequency reduction not only provide cost advantages, they also reduce resource contention and create additional thermal headroom for non-throttled cores improving performance.