Visible to the public Biblio

Filters: Keyword is parallelization  [Clear All Filters]
2022-10-20
Wang, Jingyi, Chiang, Nai-Yuan, Petra, Cosmin G..  2021.  An asynchronous distributed-memory optimization solver for two-stage stochastic programming problems. 2021 20th International Symposium on Parallel and Distributed Computing (ISPDC). :33—40.
We present a scalable optimization algorithm and its parallel implementation for two-stage stochastic programming problems of large-scale, particularly the security constrained optimal power flow models routinely used in electrical power grid operations. Such problems can be prohibitively expensive to solve on industrial scale with the traditional methods or in serial. The algorithm decomposes the problem into first-stage and second-stage optimization subproblems which are then scheduled asynchronously for efficient evaluation in parallel. Asynchronous evaluations are crucial in achieving good balancing and parallel efficiency because the second-stage optimization subproblems have highly varying execution times. The algorithm employs simple local second-order approximations of the second-stage optimal value functions together with exact first- and second-order derivatives for the first-stage subproblems to accelerate convergence. To reduce the number of the evaluations of computationally expensive second-stage subproblems required by line search, we devised a flexible mechanism for controlling the step size that can be tuned to improve performance for individual class of problems. The algorithm is implemented in C++ using MPI non-blocking calls to overlap computations with communication and boost parallel efficiency. Numerical experiments of the algorithm are conducted on Summit and Lassen supercomputers at Oak Ridge and Lawrence Livermore National Laboratories and scaling results show good parallel efficiency.
2019-12-05
Leißa, Roland, Boesche, Klaas, Hack, Sebastian, Pérard-Gayot, Arsène, Membarth, Richard, Slusallek, Philipp, Müller, André, Schmidt, Bertil.  2018.  AnyDSL: A Partial Evaluation Framework for Programming High-Performance Libraries. Proc. ACM Program. Lang.. 2:119:1-119:30.

This paper advocates programming high-performance code using partial evaluation. We present a clean-slate programming system with a simple, annotation-based, online partial evaluator that operates on a CPS-style intermediate representation. Our system exposes code generation for accelerators (vectorization/parallelization for CPUs and GPUs) via compiler-known higher-order functions that can be subjected to partial evaluation. This way, generic implementations can be instantiated with target-specific code at compile time. In our experimental evaluation we present three extensive case studies from image processing, ray tracing, and genome sequence alignment. We demonstrate that using partial evaluation, we obtain high-performance implementations for CPUs and GPUs from one language and one code base in a generic way. The performance of our codes is mostly within 10%, often closer to the performance of multi man-year, industry-grade, manually-optimized expert codes that are considered to be among the top contenders in their fields.

2019-09-26
Tang, Yiming, Khatchadourian, Raffi, Bagherzadeh, Mehdi, Ahmed, Syed.  2018.  Towards Safe Refactoring for Intelligent Parallelization of Java 8 Streams. Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings. :206-207.

The Java 8 Stream API sets forth a promising new programming model that incorporates functional-like, MapReduce-style features into a mainstream programming language. However, using streams correctly and efficiently may involve subtle considerations. In this poster, we present our ongoing work and preliminary results towards an automated refactoring approach that assists developers in writing optimal stream code. The approach, based on ordering and typestate analysis, determines when it is safe and advantageous to convert streams to parallel and optimize parallel streams.

2018-06-20
Searles, R., Xu, L., Killian, W., Vanderbruggen, T., Forren, T., Howe, J., Pearson, Z., Shannon, C., Simmons, J., Cavazos, J..  2017.  Parallelization of Machine Learning Applied to Call Graphs of Binaries for Malware Detection. 2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP). :69–77.

Malicious applications have become increasingly numerous. This demands adaptive, learning-based techniques for constructing malware detection engines, instead of the traditional manual-based strategies. Prior work in learning-based malware detection engines primarily focuses on dynamic trace analysis and byte-level n-grams. Our approach in this paper differs in that we use compiler intermediate representations, i.e., the callgraph representation of binaries. Using graph-based program representations for learning provides structure of the program, which can be used to learn more advanced patterns. We use the Shortest Path Graph Kernel (SPGK) to identify similarities between call graphs extracted from binaries. The output similarity matrix is fed into a Support Vector Machine (SVM) algorithm to construct highly-accurate models to predict whether a binary is malicious or not. However, SPGK is computationally expensive due to the size of the input graphs. Therefore, we evaluate different parallelization methods for CPUs and GPUs to speed up this kernel, allowing us to continuously construct up-to-date models in a timely manner. Our hybrid implementation, which leverages both CPU and GPU, yields the best performance, achieving up to a 14.2x improvement over our already optimized OpenMP version. We compared our generated graph-based models to previously state-of-the-art feature vector 2-gram and 3-gram models on a dataset consisting of over 22,000 binaries. We show that our classification accuracy using graphs is over 19% higher than either n-gram model and gives a false positive rate (FPR) of less than 0.1%. We are also able to consider large call graphs and dataset sizes because of the reduced execution time of our parallelized SPGK implementation.

2018-02-21
Grgić, K., Kovačevic, Z., Čik, V. K..  2017.  Performance analysis of symmetric block cryptosystems on Android platform. 2017 International Conference on Smart Systems and Technologies (SST). :155–159.

The symmetric block ciphers, which represent a core element for building cryptographic communications systems and protocols, are used in providing message confidentiality, authentication and integrity. Various limitations in hardware and software resources, especially in terminal devices used in mobile communications, affect the selection of appropriate cryptosystem and its parameters. In this paper, an implementation of three symmetric ciphers (DES, 3DES, AES) used in different operating modes are analyzed on Android platform. The cryptosystems' performance is analyzed in different scenarios using several variable parameters: cipher, key size, plaintext size and number of threads. Also, the influence of parallelization supported by multi-core CPUs on cryptosystem performance is analyzed. Finally, some conclusions about the parameter selection for optimal efficiency are given.

2017-12-28
Rolinger, T. B., Simon, T. A., Krieger, C. D..  2017.  Performance challenges for heterogeneous distributed tensor decompositions. 2017 IEEE High Performance Extreme Computing Conference (HPEC). :1–7.

Tensor decompositions, which are factorizations of multi-dimensional arrays, are becoming increasingly important in large-scale data analytics. A popular tensor decomposition algorithm is Canonical Decomposition/Parallel Factorization using alternating least squares fitting (CP-ALS). Tensors that model real-world applications are often very large and sparse, driving the need for high performance implementations of decomposition algorithms, such as CP-ALS, that can take advantage of many types of compute resources. In this work we present ReFacTo, a heterogeneous distributed tensor decomposition implementation based on DeFacTo, an existing distributed memory approach to CP-ALS. DFacTo reduces the critical routine of CP-ALS to a series of sparse matrix-vector multiplications (SpMVs). ReFacTo leverages GPUs within a cluster via MPI to perform these SpMVs and uses OpenMP threads to parallelize other routines. We evaluate the performance of ReFacTo when using NVIDIA's GPU-based cuSPARSE library and compare it to an alternative implementation that uses Intel's CPU-based Math Kernel Library (MKL) for the SpMV. Furthermore, we provide a discussion of the performance challenges of heterogeneous distributed tensor decompositions based on the results we observed. We find that on up to 32 nodes, the SpMV of ReFacTo when using MKL is up to 6.8× faster than ReFacTo when using cuSPARSE.

2015-04-30
Smith, S., Woodward, C., Liang Min, Chaoyang Jing, Del Rosso, A..  2014.  On-line transient stability analysis using high performance computing. Innovative Smart Grid Technologies Conference (ISGT), 2014 IEEE PES. :1-5.

In this paper, parallelization and high performance computing are utilized to enable ultrafast transient stability analysis that can be used in a real-time environment to quickly perform “what-if” simulations involving system dynamics phenomena. EPRI's Extended Transient Midterm Simulation Program (ETMSP) is modified and enhanced for this work. The contingency analysis is scaled for large-scale contingency analysis using Message Passing Interface (MPI) based parallelization. Simulations of thousands of contingencies on a high performance computing machine are performed, and results show that parallelization over contingencies with MPI provides good scalability and computational gains. Different ways to reduce the Input/Output (I/O) bottleneck are explored, and findings indicate that architecting a machine with a larger local disk and maintaining a local file system significantly improve the scaling results. Thread-parallelization of the sparse linear solve is explored also through use of the SuperLU_MT library.