Visible to the public Biblio

Filters: Keyword is high-performance computing  [Clear All Filters]
2023-05-12
Li, Shushan, Wang, Meng, Zhang, Hong.  2022.  Deadlock Detection for MPI Programs Based on Refined Match-sets. 2022 IEEE International Conference on Cluster Computing (CLUSTER). :82–93.

Deadlock is one of the critical problems in the message passing interface. At present, most techniques for detecting the MPI deadlock issue rely on exhausting all execution paths of a program, which is extremely inefficient. In addition, with the increasing number of wildcards that receive events and processes, the number of execution paths raises exponentially, further worsening the situation. To alleviate the problem, we propose a deadlock detection approach called SAMPI based on match-sets to avoid exploring execution paths. In this approach, a match detection rule is employed to form the rough match-sets based on Lazy Lamport Clocks Protocol. Then we design three refining algorithms based on the non-overtaking rule and MPI communication mechanism to refine the match-sets. Finally, deadlocks are detected by analyzing the refined match-sets. We performed the experimental evaluation on 15 various programs, and the experimental results show that SAMPI is really efficient in detecting deadlocks in MPI programs, especially in handling programs with many interleavings.

ISSN: 2168-9253

2023-02-03
Suzumura, Toyotaro, Sugiki, Akiyoshi, Takizawa, Hiroyuki, Imakura, Akira, Nakamura, Hiroshi, Taura, Kenjiro, Kudoh, Tomohiro, Hanawa, Toshihiro, Sekiya, Yuji, Kobayashi, Hiroki et al..  2022.  mdx: A Cloud Platform for Supporting Data Science and Cross-Disciplinary Research Collaborations. 2022 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech). :1–7.
The growing amount of data and advances in data science have created a need for a new kind of cloud platform that provides users with flexibility, strong security, and the ability to couple with supercomputers and edge devices through high-performance networks. We have built such a nation-wide cloud platform, called "mdx" to meet this need. The mdx platform's virtualization service, jointly operated by 9 national universities and 2 national research institutes in Japan, launched in 2021, and more features are in development. Currently mdx is used by researchers in a wide variety of domains, including materials informatics, geo-spatial information science, life science, astronomical science, economics, social science, and computer science. This paper provides an overview of the mdx platform, details the motivation for its development, reports its current status, and outlines its future plans.
2021-09-16
Al-Jody, Taha, Holmes, Violeta, Antoniades, Alexandros, Kazkouzeh, Yazan.  2020.  Bearicade: Secure Access Gateway to High Performance Computing Systems. 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). :1420–1427.
Cyber security is becoming a vital part of many information technologies and computing systems. Increasingly, High-Performance Computing systems are used in scientific research, academia and industry. High-Performance Computing applications are specifically designed to take advantage of the parallel nature of High-Performance Computing systems. Current research into High-Performance Computing systems focuses on the improvements in software development, parallel algorithms and computer systems architecture. However, there are no significant efforts in developing common High-Performance Computing security standards. Security of the High-Performance Computing resources is often an add-on to existing varied institutional policies that do not take into account additional requirements for High-Performance Computing security. Also, the users' terminals or portals used to access the High-Performance Computing resources are frequently insecure or they are being used in unprotected networks. In this paper we present Bearicade - a Data-driven Security Orchestration Automation and Response system. Bearicade collects data from the HPC systems and its users, enabling the use of Machine Learning based solutions to address current security issues in the High-Performance Computing systems. The system security is achieved through monitoring, analysis and interpretation of data such as users' activity, server requests, devices used and geographic locations. Any anomaly in users' behaviour is detected using machine learning algorithms, and would be visible to system administrators to help mediate the threats. The system was tested on a university campus grid system by administrators and users. Two case studies, Anomaly detection of user behaviour and Classification of Malicious Linux Terminal Command, have demonstrated machine learning approaches in identifying potential security threats. Bearicade's data was used in the experiments. The results demonstrated that detailed information is provided to the HPC administrators to detect possible security attacks and to act promptly.
2021-09-01
Gegan, Ross, Mao, Christina, Ghosal, Dipak, Bishop, Matt, Peisert, Sean.  2020.  Anomaly Detection for Science DMZs Using System Performance Data. 2020 International Conference on Computing, Networking and Communications (ICNC). :492—496.
Science DMZs are specialized networks that enable large-scale distributed scientific research, providing efficient and guaranteed performance while transferring large amounts of data at high rates. The high-speed performance of a Science DMZ is made viable via data transfer nodes (DTNs), therefore they are a critical point of failure. DTNs are usually monitored with network intrusion detection systems (NIDS). However, NIDS do not consider system performance data, such as network I/O interrupts and context switches, which can also be useful in revealing anomalous system performance potentially arising due to external network based attacks or insider attacks. In this paper, we demonstrate how system performance metrics can be applied towards securing a DTN in a Science DMZ network. Specifically, we evaluate the effectiveness of system performance data in detecting TCP-SYN flood attacks on a DTN using DBSCAN (a density-based clustering algorithm) for anomaly detection. Our results demonstrate that system interrupts and context switches can be used to successfully detect TCP-SYN floods, suggesting that system performance data could be effective in detecting a variety of attacks not easily detected through network monitoring alone.
2019-10-28
Trunov, Artem S., Voronova, Lilia I., Voronov, Vyacheslav I., Ayrapetov, Dmitriy P..  2018.  Container Cluster Model Development for Legacy Applications Integration in Scientific Software System. 2018 IEEE International Conference "Quality Management, Transport and Information Security, Information Technologies" (IT QM IS). :815–819.
Feature of modern scientific information systems is their integration with computing applications, providing distributed computer simulation and intellectual processing of Big Data using high-efficiency computing. Often these software systems include legacy applications in different programming languages, with non-standardized interfaces. To solve the problem of applications integration, containerization systems are using that allow to configure environment in the shortest time to deploy software system. However, there are no such systems for computer simulation systems with large number of nodes. The article considers the actual task of combining containers into a cluster, integrating legacy applications to manage the distributed software system MD-SLAG-MELT v.14, which supports high-performance computing and visualization of the computer experiments results. Testing results of the container cluster including automatic load sharing module for MD-SLAG-MELT system v.14. are given.
2017-12-12
Dai, D., Chen, Y., Carns, P., Jenkins, J., Ross, R..  2017.  Lightweight Provenance Service for High-Performance Computing. 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). :117–129.

Provenance describes detailed information about the history of a piece of data, containing the relationships among elements such as users, processes, jobs, and workflows that contribute to the existence of data. Provenance is key to supporting many data management functionalities that are increasingly important in operations such as identifying data sources, parameters, or assumptions behind a given result; auditing data usage; or understanding details about how inputs are transformed into outputs. Despite its importance, however, provenance support is largely underdeveloped in highly parallel architectures and systems. One major challenge is the demanding requirements of providing provenance service in situ. The need to remain lightweight and to be always on often conflicts with the need to be transparent and offer an accurate catalog of details regarding the applications and systems. To tackle this challenge, we introduce a lightweight provenance service, called LPS, for high-performance computing (HPC) systems. LPS leverages a kernel instrument mechanism to achieve transparency and introduces representative execution and flexible granularity to capture comprehensive provenance with controllable overhead. Extensive evaluations and use cases have confirmed its efficiency and usability. We believe that LPS can be integrated into current and future HPC systems to support a variety of data management needs.

2017-12-04
Hwang, T..  2017.  NSF GENI cloud enabled architecture for distributed scientific computing. 2017 IEEE Aerospace Conference. :1–8.

GENI (Global Environment for Network Innovations) is a National Science Foundation (NSF) funded program which provides a virtual laboratory for networking and distributed systems research and education. It is well suited for exploring networks at a scale, thereby promoting innovations in network science, security, services and applications. GENI allows researchers obtain compute resources from locations around the United States, connect compute resources using 100G Internet2 L2 service, install custom software or even custom operating systems on these compute resources, control how network switches in their experiment handle traffic flows, and run their own L3 and above protocols. GENI architecture incorporates cloud federation. With the federation, cloud resources can be federated and/or community of clouds can be formed. The heart of federation is user identity and an ability to “advertise” cloud resources into community including compute, storage, and networking. GENI administrators can carve out what resources are available to the community and hence a portion of GENI resources are reserved for internal consumption. GENI architecture also provides “stitching” of compute and storage resources researchers request. This provides L2 network domain over Internet2's 100G network. And researchers can run their Software Defined Networking (SDN) controllers on the provisioned L2 network domain for a complete control of networking traffic. This capability is useful for large science data transfer (bypassing security devices for high throughput). Renaissance Computing Institute (RENCI), a research institute in the state of North Carolina, has developed ORCA (Open Resource Control Architecture), a GENI control framework. ORCA is a distributed resource orchestration system to serve science experiments. ORCA provides compute resources as virtual machines and as well as baremetals. ORCA based GENI ra- k was designed to serve both High Throughput Computing (HTC) and High Performance Computing (HPC) type of computes. Although, GENI is primarily used in various universities and research entities today, GENI architecture can be leveraged in the commercial, aerospace and government settings. This paper will go over the architecture of GENI and discuss the GENI architecture for scientific computing experiments.

Johnston, B., Lee, B., Angove, L., Rendell, A..  2017.  Embedded Accelerators for Scientific High-Performance Computing: An Energy Study of OpenCL Gaussian Elimination Workloads. 2017 46th International Conference on Parallel Processing Workshops (ICPPW). :59–68.

Energy efficient High-Performance Computing (HPC) is becoming increasingly important. Recent ventures into this space have introduced an unlikely candidate to achieve exascale scientific computing hardware with a small energy footprint. ARM processors and embedded GPU accelerators originally developed for energy efficiency in mobile devices, where battery life is critical, are being repurposed and deployed in the next generation of supercomputers. Unfortunately, the performance of executing scientific workloads on many of these devices is largely unknown, yet the bulk of computation required in high-performance supercomputers is scientific. We present an analysis of one such scientific code, in the form of Gaussian Elimination, and evaluate both execution time and energy used on a range of embedded accelerator SoCs. These include three ARM CPUs and two mobile GPUs. Understanding how these low power devices perform on scientific workloads will be critical in the selection of appropriate hardware for these supercomputers, for how can we estimate the performance of tens of thousands of these chips if the performance of one is largely unknown?

2017-09-01
Ning Liu, Illinois Institute of Technology, Adnan Haider, Illinois Institute of Technology, Dong Jin, Illinois Institute of Technology, Xian He Sun, Illinois Institute of Technology.  2017.  Modeling and Simulation of Extreme-Scale Fat-Tree Networks for HPC Systems and Data Centers. ACM Transactions on Modeling and Computer Simulation (TOMACS). 27(July 2017):2.

As parallel and distributed systems are evolving toward extreme scale, for example, high-performance computing systems involve millions of cores and billion-way parallelism, and high- capacity storage systems require efficient access to petabyte or exabyte of data, many new challenges are posed on designing and deploying next-generation interconnection communication networks in these systems. Fat-tree networks have been widely used in both data centers and high-performance computing (HPC) systems in the past decades and are promising candidates of the next-generation extreme-scale networks. In this article, we present FatTreeSim, a simulation framework that supports modeling and simulation of extreme-scale fattree networks with the goal of understanding the design constraints of next-generation HPC and distributed systems and aiding the design and performance optimization of the applications running on these systems. We have systematically experimented FatTreeSim on Emulab and Blue Gene/Q and analyzed the scalability and fidelity of FatTreeSim with various network configurations. On the Blue Gene/Q Mira, FatTreeSim can achieve a peak performance of 305 million events per second using 16,384 cores. Finally, we have applied FatTreeSim to simulate several large-scale Hadoop YARN applications to demonstrate its usability.

2017-04-24
Laguna, Ignacio, Schulz, Martin, Richards, David F., Calhoun, Jon, Olson, Luke.  2016.  IPAS: Intelligent Protection Against Silent Output Corruption in Scientific Applications. Proceedings of the 2016 International Symposium on Code Generation and Optimization. :227–238.

This paper presents IPAS, an instruction duplication technique that protects scientific applications from silent data corruption (SDC) in their output. The motivation for IPAS is that, due to natural error masking, only a subset of SDC errors actually affects the output of scientific codes—we call these errors silent output corruption (SOC) errors. Thus applications require duplication only on code that, when affected by a fault, yields SOC. We use machine learning to learn code instructions that must be protected to avoid SOC, and, using a compiler, we protect only those vulnerable instructions by duplication, thus significantly reducing the overhead that is introduced by instruction duplication. In our experiments with five workloads, IPAS reduces the percentage of SOC by up to 90% with a slowdown that ranges between 1.04x and 1.35x, which corresponds to as much as 47% less slowdown than state-of-the-art instruction duplication techniques.

Levy, Scott, Ferreira, Kurt B..  2016.  An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart. Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale. :35–42.

Fault tolerance is a key challenge to building the first exa\textbackslash-scale system. To understand the potential impacts of failures on next-generation systems, significant effort has been devoted to collecting, characterizing and analyzing failures on current systems. These studies require large volumes of data and complex analysis. Because the occurrence of failures in large-scale systems is unpredictable, failures are commonly modeled as a stochastic process. Failure data from current systems is examined in an attempt to identify the underlying probability distribution and its statistical properties. In this paper, we use modeling to examine the impact of failure distributions on the time-to-solution and the optimal checkpoint interval of applications that use coordinated checkpoint/restart. Using this approach, we show that as failures become more frequent, the failure distribution has a larger influence on application performance. We also show that as failure times are less tightly grouped (i.e., as the standard deviation increases) the underlying probability distribution has a greater impact on application performance. Finally, we show that computing the checkpoint interval based on the assumption that failures are exponentially distributed has a modest impact on application performance even when failures are drawn from a different distribution. Our work provides critical analysis and guidance to the process of analyzing failure data in the context of coordinated checkpoint/restart. Specifically, the data presented in this paper helps to distinguish cases where the failure distribution has a strong influence on application performance from those cases when the failure distribution has relatively little impact.