Visible to the public Biblio

Filters: Keyword is HPC  [Clear All Filters]
2023-02-02
Debnath, Jayanta K., Xie, Derock.  2022.  CVSS-based Vulnerability and Risk Assessment for High Performance Computing Networks. 2022 IEEE International Systems Conference (SysCon). :1–8.
Common Vulnerability Scoring System (CVSS) is intended to capture the key characteristics of a vulnerability and correspondingly produce a numerical score to indicate the severity. Important efforts are conducted for building a CVSS stochastic model in order to provide a high-level risk assessment to better support cybersecurity decision-making. However, these efforts consider nothing regarding HPC (High-Performance Computing) networks using a Science Demilitary Zone (DMZ) architecture that has special design principles to facilitate data transition, analysis, and store through in a broadband backbone. In this paper, an HPCvul (CVSS-based vulnerability and risk assessment) approach is proposed for HPC networks in order to provide an understanding of the ongoing awareness of the HPC security situation under a dynamic cybersecurity environment. For such a purpose, HPCvul advocates the standardization of the collected security-related data from the network to achieve data portability. HPCvul adopts an attack graph to model the likelihood of successful exploitation of a vulnerability. It is able to merge multiple attack graphs from different HPC subnets to yield a full picture of a large HPC network. Substantial results are presented in this work to demonstrate HPCvul design and its performance.
2022-04-01
Akram, Ayaz, Giannakou, Anna, Akella, Venkatesh, Lowe-Power, Jason, Peisert, Sean.  2021.  Performance Analysis of Scientific Computing Workloads on General Purpose TEEs. 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). :1066–1076.
Scientific computing sometimes involves computation on sensitive data. Depending on the data and the execution environment, the HPC (high-performance computing) user or data provider may require confidentiality and/or integrity guarantees. To study the applicability of hardware-based trusted execution environments (TEEs) to enable secure scientific computing, we deeply analyze the performance impact of general purpose TEEs, AMD SEV, and Intel SGX, for diverse HPC benchmarks including traditional scientific computing, machine learning, graph analytics, and emerging scientific computing workloads. We observe three main findings: 1) SEV requires careful memory placement on large scale NUMA machines (1×-3.4× slowdown without and 1×-1.15× slowdown with NUMA aware placement), 2) virtualization-a prerequisite for SEV- results in performance degradation for workloads with irregular memory accesses and large working sets (1×-4× slowdown compared to native execution for graph applications) and 3) SGX is inappropriate for HPC given its limited secure memory size and inflexible programming model (1.2×-126× slowdown over unsecure execution). Finally, we discuss forthcoming new TEE designs and their potential impact on scientific computing.
2022-03-23
Matellán, Vicente, Rodríguez-Lera, Francisco-J., Guerrero-Higueras, Ángel-M., Rico, Francisco-Martín, Ginés, Jonatan.  2021.  The Role of Cybersecurity and HPC in the Explainability of Autonomous Robots Behavior. 2021 IEEE International Conference on Advanced Robotics and Its Social Impacts (ARSO). :1–5.
Autonomous robots are increasingly widespread in our society. These robots need to be safe, reliable, respectful of privacy, not manipulable by external agents, and capable of offering explanations of their behavior in order to be accountable and acceptable in our societies. Companies offering robotic services will need to provide mechanisms to address these issues using High Performance Computing (HPC) facilities, where logs and off-line forensic analysis could be addressed if required, but these solutions are still not available in software development frameworks for robots. The aim of this paper is to discuss the implications and interactions among cybersecurity, safety, and explainability with the goal of making autonomous robots more trustworthy.
2021-03-01
Zhang, Y., Groves, T., Cook, B., Wright, N. J., Coskun, A. K..  2020.  Quantifying the impact of network congestion on application performance and network metrics. 2020 IEEE International Conference on Cluster Computing (CLUSTER). :162–168.
In modern high-performance computing (HPC) systems, network congestion is an important factor that contributes to performance degradation. However, how network congestion impacts application performance is not fully understood. As Aries network, a recent HPC network architecture featuring a dragonfly topology, is equipped with network counters measuring packet transmission statistics on each router, these network metrics can potentially be utilized to understand network performance. In this work, by experiments on a large HPC system, we quantify the impact of network congestion on various applications' performance in terms of execution time, and we correlate application performance with network metrics. Our results demonstrate diverse impacts of network congestion: while applications with intensive MPI operations (such as HACC and MILC) suffer from more than 40% extension in their execution times under network congestion, applications with less intensive MPI operations (such as Graph500 and HPCG) are mostly not affected. We also demonstrate that a stall-to-flit ratio metric derived from Aries network counters is positively correlated with performance degradation and, thus, this metric can serve as an indicator of network congestion in HPC systems.
2020-12-02
Gliksberg, J., Capra, A., Louvet, A., García, P. J., Sohier, D..  2019.  High-Quality Fault-Resiliency in Fat-Tree Networks (Extended Abstract). 2019 IEEE Symposium on High-Performance Interconnects (HOTI). :9—12.
Coupling regular topologies with optimized routing algorithms is key in pushing the performance of interconnection networks of HPC systems. In this paper we present Dmodc, a fast deterministic routing algorithm for Parallel Generalized Fat-Trees (PGFTs) which minimizes congestion risk even under massive topology degradation caused by equipment failure. It applies a modulo-based computation of forwarding tables among switches closer to the destination, using only knowledge of subtrees for pre-modulo division. Dmodc allows complete re-routing of topologies with tens of thousands of nodes in less than a second, which greatly helps centralized fabric management react to faults with high-quality routing tables and no impact to running applications in current and future very large-scale HPC clusters. We compare Dmodc against routing algorithms available in the InfiniBand control software (OpenSM) first for routing execution time to show feasibility at scale, and then for congestion risk under degradation to demonstrate robustness. The latter comparison is done using static analysis of routing tables under random permutation (RP), shift permutation (SP) and all-to-all (A2A) traffic patterns. Results for Dmodc show A2A and RP congestion risks similar under heavy degradation as the most stable algorithms compared, and near-optimal SP congestion risk up to 1% of random degradation.
2020-10-30
Basu, Kanad, Elnaggar, Rana, Chakrabarty, Krishnendu, Karri, Ramesh.  2019.  PREEMPT: PReempting Malware by Examining Embedded Processor Traces. 2019 56th ACM/IEEE Design Automation Conference (DAC). :1—6.

Anti-virus software (AVS) tools are used to detect Malware in a system. However, software-based AVS are vulnerable to attacks. A malicious entity can exploit these vulnerabilities to subvert the AVS. Recently, hardware components such as Hardware Performance Counters (HPC) have been used for Malware detection. In this paper, we propose PREEMPT, a zero overhead, high-accuracy and low-latency technique to detect Malware by re-purposing the embedded trace buffer (ETB), a debug hardware component available in most modern processors. The ETB is used for post-silicon validation and debug and allows us to control and monitor the internal activities of a chip, beyond what is provided by the Input/Output pins. PREEMPT combines these hardware-level observations with machine learning-based classifiers to preempt Malware before it can cause damage. There are many benefits of re-using the ETB for Malware detection. It is difficult to hack into hardware compared to software, and hence, PREEMPT is more robust against attacks than AVS. PREEMPT does not incur performance penalties. Finally, PREEMPT has a high True Positive value of 94% and maintains a low False Positive value of 2%.

2020-03-16
Iuhasz, Gabriel, Petcu, Dana.  2019.  Perspectives on Anomaly and Event Detection in Exascale Systems. 2019 IEEE 5th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS). :225–229.
The design and implementation of exascale system is nowadays an important challenge. Such a system is expected to combine HPC with Big Data methods and technologies to allow the execution of scientific workloads which are not tractable at this present time. In this paper we focus on an event and anomaly detection framework which is crucial in giving a global overview of a exascale system (which in turn is necessary for the successful implementation and exploitation of the system). We propose an architecture for such a framework and show how it can be used to handle failures during job execution.
Udod, Kyryll, Kushnarenko, Volodymyr, Wesner, Stefan, Svjatnyj, Volodymyr.  2019.  Preservation System for Scientific Experiments in High Performance Computing: Challenges and Proposed Concept. 2019 10th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS). 2:809–813.
Continuously growing amount of research experiments using High Performance Computing (HPC) leads to the questions of research data management and in particular how to preserve a scientific experiment including all related data for long term for its future reproduction. This paper covers some challenges and possible solutions related to the preservation of scientific experiments on HPC systems and represents a concept of the preservation system for HPC computations. Storage of the experiment itself with some related data is not only enough for its future reproduction, especially in the long term. For that case preservation of the whole experiment's environment (operating system, used libraries, environment variables, input data, etc.) via containerization technology (e.g. using Docker, Singularity) is proposed. This approach allows to preserve the entire environment, but is not always possible on every HPC system because of security issues. And it also leaves a question, how to deal with commercial software that was used within the experiment. As a possible solution we propose to run a preservation process outside of the computing system on the web-server and to replace all commercial software inside the created experiment's image with open source analogues that should allow future reproduction of the experiment without any legal issues. The prototype of such a system was developed, the paper provides the scheme of the system, its main features and describes the first experimental results and further research steps.
2019-03-22
Dooley, Rion, Brandt, Steven R., Fonner, John.  2018.  The Agave Platform: An Open, Science-as-a-Service Platform for Digital Science. Proceedings of the Practice and Experience on Advanced Research Computing. :28:1-28:8.

The Agave Platform first appeared in 2011 as a pilot project for the iPlant Collaborative [11]. In its first two years, Foundation saw over 40% growth per month, supporting 1000+ clients, 600+ applications, 4 HPC systems at 3 centers across the US. It also gained users outside of plant biology. To better serve the needs of the general open science community, we rewrote Foundation as a scalable, cloud native application and named it the Agave Platform. In this paper we present the Agave Platform, a Science-as-a-Service (ScaaS) platform for reproducible science. We provide a brief history and technical overview of the project, and highlight three case studies leveraging the platform to create synergistic value for their users.

2017-12-12
Suh, Y. K., Ma, J..  2017.  SuperMan: A Novel System for Storing and Retrieving Scientific-Simulation Provenance for Efficient Job Executions on Computing Clusters. 2017 IEEE 2nd International Workshops on Foundations and Applications of Self* Systems (FAS*W). :283–288.

Compute-intensive simulations typically charge substantial workloads on an online simulation platform backed by limited computing clusters and storage resources. Some (or most) of the simulations initiated by users may accompany input parameters/files that have been already provided by other (or same) users in the past. Unfortunately, these duplicate simulations may aggravate the performance of the platform by drastic consumption of the limited resources shared by a number of users on the platform. To minimize or avoid conducting repeated simulations, we present a novel system, called SUPERMAN (SimUlation ProvEnance Recycling MANager) that can record simulation provenances and recycle the results of past simulations. This system presents a great opportunity to not only reutilize existing results but also perform various analytics helpful for those who are not familiar with the platform. The system also offers interoperability across other systems by collecting the provenances in a standardized format. In our simulated experiments we found that over half of past computing jobs could be answered without actual executions by our system.

2017-12-04
Hwang, T..  2017.  NSF GENI cloud enabled architecture for distributed scientific computing. 2017 IEEE Aerospace Conference. :1–8.

GENI (Global Environment for Network Innovations) is a National Science Foundation (NSF) funded program which provides a virtual laboratory for networking and distributed systems research and education. It is well suited for exploring networks at a scale, thereby promoting innovations in network science, security, services and applications. GENI allows researchers obtain compute resources from locations around the United States, connect compute resources using 100G Internet2 L2 service, install custom software or even custom operating systems on these compute resources, control how network switches in their experiment handle traffic flows, and run their own L3 and above protocols. GENI architecture incorporates cloud federation. With the federation, cloud resources can be federated and/or community of clouds can be formed. The heart of federation is user identity and an ability to “advertise” cloud resources into community including compute, storage, and networking. GENI administrators can carve out what resources are available to the community and hence a portion of GENI resources are reserved for internal consumption. GENI architecture also provides “stitching” of compute and storage resources researchers request. This provides L2 network domain over Internet2's 100G network. And researchers can run their Software Defined Networking (SDN) controllers on the provisioned L2 network domain for a complete control of networking traffic. This capability is useful for large science data transfer (bypassing security devices for high throughput). Renaissance Computing Institute (RENCI), a research institute in the state of North Carolina, has developed ORCA (Open Resource Control Architecture), a GENI control framework. ORCA is a distributed resource orchestration system to serve science experiments. ORCA provides compute resources as virtual machines and as well as baremetals. ORCA based GENI ra- k was designed to serve both High Throughput Computing (HTC) and High Performance Computing (HPC) type of computes. Although, GENI is primarily used in various universities and research entities today, GENI architecture can be leveraged in the commercial, aerospace and government settings. This paper will go over the architecture of GENI and discuss the GENI architecture for scientific computing experiments.

Johnston, B., Lee, B., Angove, L., Rendell, A..  2017.  Embedded Accelerators for Scientific High-Performance Computing: An Energy Study of OpenCL Gaussian Elimination Workloads. 2017 46th International Conference on Parallel Processing Workshops (ICPPW). :59–68.

Energy efficient High-Performance Computing (HPC) is becoming increasingly important. Recent ventures into this space have introduced an unlikely candidate to achieve exascale scientific computing hardware with a small energy footprint. ARM processors and embedded GPU accelerators originally developed for energy efficiency in mobile devices, where battery life is critical, are being repurposed and deployed in the next generation of supercomputers. Unfortunately, the performance of executing scientific workloads on many of these devices is largely unknown, yet the bulk of computation required in high-performance supercomputers is scientific. We present an analysis of one such scientific code, in the form of Gaussian Elimination, and evaluate both execution time and energy used on a range of embedded accelerator SoCs. These include three ARM CPUs and two mobile GPUs. Understanding how these low power devices perform on scientific workloads will be critical in the selection of appropriate hardware for these supercomputers, for how can we estimate the performance of tens of thousands of these chips if the performance of one is largely unknown?

2017-05-18
Tan, Li, Chen, Zizhong, Song, Shuaiwen Leon.  2015.  Scalable Energy Efficiency with Resilience for High Performance Computing Systems: A Quantitative Methodology. ACM Trans. Archit. Code Optim.. 12:35:1–35:27.

Ever-growing performance of supercomputers nowadays brings demanding requirements of energy efficiency and resilience, due to rapidly expanding size and duration in use of the large-scale computing systems. Many application/architecture-dependent parameters that determine energy efficiency and resilience individually have causal effects with each other, which directly affect the trade-offs among performance, energy efficiency and resilience at scale. To enable high-efficiency management for large-scale High-Performance Computing (HPC) systems nowadays, quantitatively understanding the entangled effects among performance, energy efficiency, and resilience is thus required. While previous work focuses on exploring energy-saving and resilience-enhancing opportunities separately, little has been done to theoretically and empirically investigate the interplay between energy efficiency and resilience at scale. In this article, by extending the Amdahl’s Law and the Karp-Flatt Metric, taking resilience into consideration, we quantitatively model the integrated energy efficiency in terms of performance per Watt and showcase the trade-offs among typical HPC parameters, such as number of cores, frequency/voltage, and failure rates. Experimental results for a wide spectrum of HPC benchmarks on two HPC systems show that the proposed models are accurate in extrapolating resilience-aware performance and energy efficiency, and capable of capturing the interplay among various energy-saving and resilience factors. Moreover, the models can help find the optimal HPC configuration for the highest integrated energy efficiency, in the presence of failures and applied resilience techniques.