Performance challenges for heterogeneous distributed tensor decompositions

Submitted by grigby1 on Thu, 12/28/2017 - 1:17pm

Title	Performance challenges for heterogeneous distributed tensor decompositions
Publication Type	Conference Paper
Year of Publication	2017
Authors	Rolinger, T. B., Simon, T. A., Krieger, C. D.
Conference Name	2017 IEEE High Performance Extreme Computing Conference (HPEC)
Date Published	sep
Publisher	IEEE
ISBN Number	978-1-5386-3472-1
Keywords	alternating least squares fitting, canonical decomposition, compositionality, CP-ALS, cuSPARSE library, decomposition, DeFacTo, distributed memory, distributed memory systems, GPU, graphics processing units, heterogeneous distributed tensor decompositions, large-scale data analytics, Least squares approximations, Libraries, math kernel library, Matrix decomposition, matrix multiplication, message passing, Metrics, MPI, multi-threading, multidimensional arrays, OpenMP threads, parallel factorization, parallel processing, parallel programming, parallelization, performance evaluation, pubcrawl, ReFacTo, Signal processing algorithms, Sparse matrices, sparse matrix-vector multiplications, SpMV, Tensile stress, tensor decomposition, tensors
Abstract	Tensor decompositions, which are factorizations of multi-dimensional arrays, are becoming increasingly important in large-scale data analytics. A popular tensor decomposition algorithm is Canonical Decomposition/Parallel Factorization using alternating least squares fitting (CP-ALS). Tensors that model real-world applications are often very large and sparse, driving the need for high performance implementations of decomposition algorithms, such as CP-ALS, that can take advantage of many types of compute resources. In this work we present ReFacTo, a heterogeneous distributed tensor decomposition implementation based on DeFacTo, an existing distributed memory approach to CP-ALS. DFacTo reduces the critical routine of CP-ALS to a series of sparse matrix-vector multiplications (SpMVs). ReFacTo leverages GPUs within a cluster via MPI to perform these SpMVs and uses OpenMP threads to parallelize other routines. We evaluate the performance of ReFacTo when using NVIDIA's GPU-based cuSPARSE library and compare it to an alternative implementation that uses Intel's CPU-based Math Kernel Library (MKL) for the SpMV. Furthermore, we provide a discussion of the performance challenges of heterogeneous distributed tensor decompositions based on the results we observed. We find that on up to 32 nodes, the SpMV of ReFacTo when using MKL is up to 6.8x faster than ReFacTo when using cuSPARSE.
URL	http://ieeexplore.ieee.org/document/8091023/
DOI	10.1109/HPEC.2017.8091023
Citation Key	rolinger_performance_2017

Groups:

Science of Security VO