Biblio
Deadlock is one of the critical problems in the message passing interface. At present, most techniques for detecting the MPI deadlock issue rely on exhausting all execution paths of a program, which is extremely inefficient. In addition, with the increasing number of wildcards that receive events and processes, the number of execution paths raises exponentially, further worsening the situation. To alleviate the problem, we propose a deadlock detection approach called SAMPI based on match-sets to avoid exploring execution paths. In this approach, a match detection rule is employed to form the rough match-sets based on Lazy Lamport Clocks Protocol. Then we design three refining algorithms based on the non-overtaking rule and MPI communication mechanism to refine the match-sets. Finally, deadlocks are detected by analyzing the refined match-sets. We performed the experimental evaluation on 15 various programs, and the experimental results show that SAMPI is really efficient in detecting deadlocks in MPI programs, especially in handling programs with many interleavings.
ISSN: 2168-9253
Provenance describes detailed information about the history of a piece of data, containing the relationships among elements such as users, processes, jobs, and workflows that contribute to the existence of data. Provenance is key to supporting many data management functionalities that are increasingly important in operations such as identifying data sources, parameters, or assumptions behind a given result; auditing data usage; or understanding details about how inputs are transformed into outputs. Despite its importance, however, provenance support is largely underdeveloped in highly parallel architectures and systems. One major challenge is the demanding requirements of providing provenance service in situ. The need to remain lightweight and to be always on often conflicts with the need to be transparent and offer an accurate catalog of details regarding the applications and systems. To tackle this challenge, we introduce a lightweight provenance service, called LPS, for high-performance computing (HPC) systems. LPS leverages a kernel instrument mechanism to achieve transparency and introduces representative execution and flexible granularity to capture comprehensive provenance with controllable overhead. Extensive evaluations and use cases have confirmed its efficiency and usability. We believe that LPS can be integrated into current and future HPC systems to support a variety of data management needs.
GENI (Global Environment for Network Innovations) is a National Science Foundation (NSF) funded program which provides a virtual laboratory for networking and distributed systems research and education. It is well suited for exploring networks at a scale, thereby promoting innovations in network science, security, services and applications. GENI allows researchers obtain compute resources from locations around the United States, connect compute resources using 100G Internet2 L2 service, install custom software or even custom operating systems on these compute resources, control how network switches in their experiment handle traffic flows, and run their own L3 and above protocols. GENI architecture incorporates cloud federation. With the federation, cloud resources can be federated and/or community of clouds can be formed. The heart of federation is user identity and an ability to “advertise” cloud resources into community including compute, storage, and networking. GENI administrators can carve out what resources are available to the community and hence a portion of GENI resources are reserved for internal consumption. GENI architecture also provides “stitching” of compute and storage resources researchers request. This provides L2 network domain over Internet2's 100G network. And researchers can run their Software Defined Networking (SDN) controllers on the provisioned L2 network domain for a complete control of networking traffic. This capability is useful for large science data transfer (bypassing security devices for high throughput). Renaissance Computing Institute (RENCI), a research institute in the state of North Carolina, has developed ORCA (Open Resource Control Architecture), a GENI control framework. ORCA is a distributed resource orchestration system to serve science experiments. ORCA provides compute resources as virtual machines and as well as baremetals. ORCA based GENI ra- k was designed to serve both High Throughput Computing (HTC) and High Performance Computing (HPC) type of computes. Although, GENI is primarily used in various universities and research entities today, GENI architecture can be leveraged in the commercial, aerospace and government settings. This paper will go over the architecture of GENI and discuss the GENI architecture for scientific computing experiments.
Energy efficient High-Performance Computing (HPC) is becoming increasingly important. Recent ventures into this space have introduced an unlikely candidate to achieve exascale scientific computing hardware with a small energy footprint. ARM processors and embedded GPU accelerators originally developed for energy efficiency in mobile devices, where battery life is critical, are being repurposed and deployed in the next generation of supercomputers. Unfortunately, the performance of executing scientific workloads on many of these devices is largely unknown, yet the bulk of computation required in high-performance supercomputers is scientific. We present an analysis of one such scientific code, in the form of Gaussian Elimination, and evaluate both execution time and energy used on a range of embedded accelerator SoCs. These include three ARM CPUs and two mobile GPUs. Understanding how these low power devices perform on scientific workloads will be critical in the selection of appropriate hardware for these supercomputers, for how can we estimate the performance of tens of thousands of these chips if the performance of one is largely unknown?
As parallel and distributed systems are evolving toward extreme scale, for example, high-performance computing systems involve millions of cores and billion-way parallelism, and high- capacity storage systems require efficient access to petabyte or exabyte of data, many new challenges are posed on designing and deploying next-generation interconnection communication networks in these systems. Fat-tree networks have been widely used in both data centers and high-performance computing (HPC) systems in the past decades and are promising candidates of the next-generation extreme-scale networks. In this article, we present FatTreeSim, a simulation framework that supports modeling and simulation of extreme-scale fattree networks with the goal of understanding the design constraints of next-generation HPC and distributed systems and aiding the design and performance optimization of the applications running on these systems. We have systematically experimented FatTreeSim on Emulab and Blue Gene/Q and analyzed the scalability and fidelity of FatTreeSim with various network configurations. On the Blue Gene/Q Mira, FatTreeSim can achieve a peak performance of 305 million events per second using 16,384 cores. Finally, we have applied FatTreeSim to simulate several large-scale Hadoop YARN applications to demonstrate its usability.
This paper presents IPAS, an instruction duplication technique that protects scientific applications from silent data corruption (SDC) in their output. The motivation for IPAS is that, due to natural error masking, only a subset of SDC errors actually affects the output of scientific codes—we call these errors silent output corruption (SOC) errors. Thus applications require duplication only on code that, when affected by a fault, yields SOC. We use machine learning to learn code instructions that must be protected to avoid SOC, and, using a compiler, we protect only those vulnerable instructions by duplication, thus significantly reducing the overhead that is introduced by instruction duplication. In our experiments with five workloads, IPAS reduces the percentage of SOC by up to 90% with a slowdown that ranges between 1.04x and 1.35x, which corresponds to as much as 47% less slowdown than state-of-the-art instruction duplication techniques.
Fault tolerance is a key challenge to building the first exa\textbackslash-scale system. To understand the potential impacts of failures on next-generation systems, significant effort has been devoted to collecting, characterizing and analyzing failures on current systems. These studies require large volumes of data and complex analysis. Because the occurrence of failures in large-scale systems is unpredictable, failures are commonly modeled as a stochastic process. Failure data from current systems is examined in an attempt to identify the underlying probability distribution and its statistical properties. In this paper, we use modeling to examine the impact of failure distributions on the time-to-solution and the optimal checkpoint interval of applications that use coordinated checkpoint/restart. Using this approach, we show that as failures become more frequent, the failure distribution has a larger influence on application performance. We also show that as failure times are less tightly grouped (i.e., as the standard deviation increases) the underlying probability distribution has a greater impact on application performance. Finally, we show that computing the checkpoint interval based on the assumption that failures are exponentially distributed has a modest impact on application performance even when failures are drawn from a different distribution. Our work provides critical analysis and guidance to the process of analyzing failure data in the context of coordinated checkpoint/restart. Specifically, the data presented in this paper helps to distinguish cases where the failure distribution has a strong influence on application performance from those cases when the failure distribution has relatively little impact.