Biblio
This work seeks to advance the state of the art in HPC I/O performance analysis and interpretation. In particular, we demonstrate effective techniques to: (1) model output performance in the presence of I/O interference from production loads; (2) build features from write patterns and key parameters of the system architecture and configurations; (3) employ suitable machine learning algorithms to improve model accuracy. We train models with five popular regression algorithms and conduct experiments on two distinct production HPC platforms. We find that the lasso and random forest models predict output performance with high accuracy on both of the target systems. We also explore use of the models to guide adaptation in I/O middleware systems, and show potential for improvements of at least 15% from model-guided adaptation on 70% of samples, and improvements up to 10 x on some samples for both of the target systems.
Science gateways bring out the possibility of reproducible science as they are integrated into reusable techniques, data and workflow management systems, security mechanisms, and high performance computing (HPC). We introduce BioinfoPortal, a science gateway that integrates a suite of different bioinformatics applications using HPC and data management resources provided by the Brazilian National HPC System (SINAPAD). BioinfoPortal follows the Software as a Service (SaaS) model and the web server is freely available for academic use. The goal of this paper is to describe the science gateway and its usage, addressing challenges of designing a multiuser computational platform for parallel/distributed executions of large-scale bioinformatics applications using the Brazilian HPC resources. We also present a study of performance and scalability of some bioinformatics applications executed in the HPC environments and perform machine learning analyses for predicting features for the HPC allocation/usage that could better perform the bioinformatics applications via BioinfoPortal.
This paper presents a computational platform for dynamic security assessment (DSA) of large electricity grids, developed as part of the iTesla project. It leverages High Performance Computing (HPC) to analyze large power systems, with many scenarios and possible contingencies, thus paving the way for pan-European operational stability analysis. The results of the DSA are summarized by decision trees of 11 stability indicators. The platform's workflow and parallel implementation architecture is described in detail, including the way commercial tools are integrated into a plug-in architecture. A case study of the French grid is presented, with over 8000 scenarios and 1980 contingencies. Performance data of the case study (using 10,000 parallel cores) is analyzed, including task timings and data flows. Finally, the generated decision trees are compared with test data to quantify the functional performance of the DSA platform.
Cloud computing is the major paradigm in today's IT world with the capabilities of security management, high performance, flexibility, scalability. Customers valuing these features can better benefit if they use a cloud environment built using HPC fabric architecture. However, security is still a major concern, not only on the software side but also on the hardware side. There are multiple studies showing that the malicious users can affect the regular customers through the hardware if they are co-located on the same physical system. Therefore, solving possible security concerns on the HPC fabric architecture will clearly make the fabric industries leader in this area. In this paper, we propose an autonomic HPC fabric architecture that leverages both resilient computing capabilities and adaptive anomaly analysis for further security.
In this paper we document our approach to overcoming service discovery and configuration of Apache Hadoop and Spark frameworks with dynamic resource allocations in a batch oriented Advanced Research Computing (ARC) High Performance Computing (HPC) environment. ARC efforts have produced a wide variety of HPC architectures. A common HPC architectural pattern is multi-node compute clusters with low-latency, high-performance interconnect fabrics and shared central storage. This pattern enables processing of workloads with high data co-dependency, frequently solved with message passing interface (MPI) programming models, and then executed as batch jobs. Unfortunately, many HPC programming paradigms are not well suited to big data workloads which are often easily separable. Our approach lowers barriers of entry to HPC environments by enabling end users to utilize Apache Hadoop and Spark frameworks that support big data oriented programming paradigms appropriate for separable workloads in batch oriented HPC environments.
Resilience in High Performance Computing (HPC) is a constraining factor for bringing applications to the upcoming exascale systems. Resilience techniques must be able to scale to handle the increasing number of expected errors in an energy efficient manner. Since the purpose of running applications on HPC systems is to perform large scale computations as quick as possible, resilience methods should not add a large delay to the time to completion of the application. In this paper we introduce a novel technique to detect and recover from transient errors in HPC applications. One of the features of our technique is that the energy budget allocated to resilience can be adjusted depending on the operator's resilience needs. For example, on synthetic data, the technique can detect about 50% of transient errors while only using 20% of the dynamic energy required for running the application. For a 60% energy budget, an application that uses 10k cores and takes 128 hours to run, will only require 10% longer to complete.
Graphics processing unit (GPU) has been applied successfully in many scientific computing realms due to its superior performances on float-pointing calculation and memory bandwidth, and has great potential in power system applications. The N-1 static security analysis (SSA) appears to be a candidate application in which massive alternating current power flow (ACPF) problems need to be solved. However, when applying existing GPU-accelerated algorithms to solve N-1 SSA problem, the degree of parallelism is limited because existing researches have been devoted to accelerating the solution of a single ACPF. This paper therefore proposes a GPU-accelerated solution that creates an additional layer of parallelism among batch ACPFs and consequently achieves a much higher level of overall parallelism. First, this paper establishes two basic principles for determining well-designed GPU algorithms, through which the limitation of GPU-accelerated sequential-ACPF solution is demonstrated. Next, being the first of its kind, this paper proposes a novel GPU-accelerated batch-QR solver, which packages massive number of QR tasks to formulate a new larger-scale problem and then achieves higher level of parallelism and better coalesced memory accesses. To further improve the efficiency of solving SSA, a GPU-accelerated batch-Jacobian-Matrix generating and contingency screening is developed and carefully optimized. Lastly, the complete process of the proposed GPU-accelerated batch-ACPF solution for SSA is presented. Case studies on an 8503-bus system show dramatic computation time reduction is achieved compared with all reported existing GPU-accelerated methods. In comparison to UMFPACK-library-based single-CPU counterpart using Intel Xeon E5-2620, the proposed GPU-accelerated SSA framework using NVIDIA K20C achieves up to 57.6 times speedup. It can even achieve four times speedup when compared to one of the fastest multi-core CPU parallel computing solution using KLU library. The prop- sed batch-solving method is practically very promising and lays a critical foundation for many other power system applications that need to deal with massive subtasks, such as Monte-Carlo simulation and probabilistic power flow.
We propose that to address the growing problems with complexity and data volumes in HPC security wee need to refactor how we look at data by creating tools that not only select data, but analyze and represent it in a manner well suited for intuitive analysis. We propose a set of rules describing what this means, and provide a number of production quality tools that represent our current best effort in implementing these ideas.
In this paper, parallelization and high performance computing are utilized to enable ultrafast transient stability analysis that can be used in a real-time environment to quickly perform “what-if” simulations involving system dynamics phenomena. EPRI's Extended Transient Midterm Simulation Program (ETMSP) is modified and enhanced for this work. The contingency analysis is scaled for large-scale contingency analysis using Message Passing Interface (MPI) based parallelization. Simulations of thousands of contingencies on a high performance computing machine are performed, and results show that parallelization over contingencies with MPI provides good scalability and computational gains. Different ways to reduce the Input/Output (I/O) bottleneck are explored, and findings indicate that architecting a machine with a larger local disk and maintaining a local file system significantly improve the scaling results. Thread-parallelization of the sparse linear solve is explored also through use of the SuperLU_MT library.