Visible to the public Biblio

Filters: Keyword is scientific information systems  [Clear All Filters]
2020-12-11
Xie, J., Zhang, M., Ma, Y..  2019.  Using Format Migration and Preservation Metadata to Support Digital Preservation of Scientific Data. 2019 IEEE 10th International Conference on Software Engineering and Service Science (ICSESS). :1—6.

With the development of e-Science and data intensive scientific discovery, it needs to ensure scientific data available for the long-term, with the goal that the valuable scientific data should be discovered and re-used for downstream investigations, either alone, or in combination with newly generated data. As such, the preservation of scientific data enables that not only might experiment be reproducible and verifiable, but also new questions can be raised by other scientists to promote research and innovation. In this paper, we focus on the two main problems of digital preservation that are format migration and preservation metadata. Format migration includes both format verification and object transformation. The system architecture of format migration and preservation metadata is presented, mapping rules of object transformation are analyzed, data fixity and integrity and authenticity, digital signature and so on are discussed and an example is shown in detail.

2020-03-30
Souza, Renan, Azevedo, Leonardo, Lourenço, Vítor, Soares, Elton, Thiago, Raphael, Brandão, Rafael, Civitarese, Daniel, Brazil, Emilio, Moreno, Marcio, Valduriez, Patrick et al..  2019.  Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering. 2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS). :1–10.
Machine Learning (ML) has become essential in several industries. In Computational Science and Engineering (CSE), the complexity of the ML lifecycle comes from the large variety of data, scientists' expertise, tools, and workflows. If data are not tracked properly during the lifecycle, it becomes unfeasible to recreate a ML model from scratch or to explain to stackholders how it was created. The main limitation of provenance tracking solutions is that they cannot cope with provenance capture and integration of domain and ML data processed in the multiple workflows in the lifecycle, while keeping the provenance capture overhead low. To handle this problem, in this paper we contribute with a detailed characterization of provenance data in the ML lifecycle in CSE; a new provenance data representation, called PROV-ML, built on top of W3C PROV and ML Schema; and extensions to a system that tracks provenance from multiple workflows to address the characteristics of ML and CSE, and to allow for provenance queries with a standard vocabulary. We show a practical use in a real case in the O&G industry, along with its evaluation using 239,616 CUDA cores in parallel.
2020-03-16
Zhou, Yaqiu, Ren, Yongmao, Zhou, Xu, Yang, Wanghong, Qin, Yifang.  2019.  A Scientific Data Traffic Scheduling Algorithm Based on Software-Defined Networking. 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). :62–67.
Compared to ordinary Internet applications, the transfer of scientific data flows often has higher requirements for network performance. The network security devices and systems often affect the efficiency of scientific data transfer. As a new type of network architecture, Software-defined Networking (SDN) decouples the data plane from the control plane. Its programmability allows users to customize the network transfer path and makes the network more intelligent. The Science DMZ model is a private network for scientific data flow transfer, which can improve performance under the premise of ensuring network security. This paper combines SDN with Science DMZ, designs and implements an SDN-based traffic scheduling algorithm considering the load of link. In addition to distinguishing scientific data flow from common data flow, the algorithm further distinguishes the scientific data flows of different applications and performs different traffic scheduling of scientific data for specific link states. Experiments results proved that the algorithm can effectively improve the transmission performance of scientific data flow.
2019-10-28
Huang, Jingwei.  2018.  From Big Data to Knowledge: Issues of Provenance, Trust, and Scientific Computing Integrity. 2018 IEEE International Conference on Big Data (Big Data). :2197–2205.
This paper addresses the nature of data and knowledge, the relation between them, the variety of views as a characteristic of Big Data regarding that data may come from many different sources/views from different viewpoints, and the associated essential issues of data provenance, knowledge provenance, scientific computing integrity, and trust in the data science process. Towards the direction of data-intensive science and engineering, it is of paramount importance to ensure Scientific Computing Integrity (SCI). A failure of SCI may be caused by malicious attacks, natural environmental changes, faults of scientists, operations mistakes, faults of supporting systems, faults of processes, and errors in the data or theories on which a research relies. The complexity of scientific workflows and large provenance graphs as well as various causes for SCI failures make ensuring SCI extremely difficult. Provenance and trust play critical role in evaluating SCI. This paper reports our progress in building a model for provenance-based trust reasoning about SCI.
Trunov, Artem S., Voronova, Lilia I., Voronov, Vyacheslav I., Ayrapetov, Dmitriy P..  2018.  Container Cluster Model Development for Legacy Applications Integration in Scientific Software System. 2018 IEEE International Conference "Quality Management, Transport and Information Security, Information Technologies" (IT QM IS). :815–819.
Feature of modern scientific information systems is their integration with computing applications, providing distributed computer simulation and intellectual processing of Big Data using high-efficiency computing. Often these software systems include legacy applications in different programming languages, with non-standardized interfaces. To solve the problem of applications integration, containerization systems are using that allow to configure environment in the shortest time to deploy software system. However, there are no such systems for computer simulation systems with large number of nodes. The article considers the actual task of combining containers into a cluster, integrating legacy applications to manage the distributed software system MD-SLAG-MELT v.14, which supports high-performance computing and visualization of the computer experiments results. Testing results of the container cluster including automatic load sharing module for MD-SLAG-MELT system v.14. are given.
2019-09-23
Chen, W., Liang, X., Li, J., Qin, H., Mu, Y., Wang, J..  2018.  Blockchain Based Provenance Sharing of Scientific Workflows. 2018 IEEE International Conference on Big Data (Big Data). :3814–3820.
In a research community, the provenance sharing of scientific workflows can enhance distributed research cooperation, experiment reproducibility verification and experiment repeatedly doing. Considering that scientists in such a community are often in a loose relation and distributed geographically, traditional centralized provenance sharing architectures have shown their disadvantages in poor trustworthiness, reliabilities and efficiency. Additionally, they are also difficult to protect the rights and interests of data providers. All these have been largely hindering the willings of distributed scientists to share their workflow provenance. Considering the big advantages of blockchain in decentralization, trustworthiness and high reliability, an approach to sharing scientific workflow provenance based on blockchain in a research community is proposed. To make the approach more practical, provenance is handled on-chain and original data is delivered off-chain. A kind of block structure to support efficient provenance storing and retrieving is designed, and an algorithm for scientists to search workflow segments from provenance as well as an algorithm for experiments backtracking are provided to enhance the experiment result sharing, save computing resource and time cost by avoiding repeated experiments as far as possible. Analyses show that the approach is efficient and effective.
2018-02-21
Lim, H., Ni, A., Kim, D., Ko, Y. B..  2017.  Named data networking testbed for scientific data. 2017 2nd International Conference on Computer and Communication Systems (ICCCS). :65–69.

Named Data Networking (NDN) is one of the future internet architectures, which is a clean-slate approach. NDN provides intelligent data retrieval using the principles of name-based symmetrical forwarding of Interest/Data packets and innetwork caching. The continually increasing demand for rapid dissemination of large-scale scientific data is driving the use of NDN in data-intensive science experiments. In this paper, we establish an intercontinental NDN testbed. In the testbed, an NDN-based application that targets climate science as an example data intensive science application is designed and implemented, which has differentiated features compared to those of previous studies. We verify experimental justification of using NDN for climate science in the intercontinental network, through performance comparisons between classical delivery techniques and NDN-based climate data delivery.

2017-12-12
Suh, Y. K., Ma, J..  2017.  SuperMan: A Novel System for Storing and Retrieving Scientific-Simulation Provenance for Efficient Job Executions on Computing Clusters. 2017 IEEE 2nd International Workshops on Foundations and Applications of Self* Systems (FAS*W). :283–288.

Compute-intensive simulations typically charge substantial workloads on an online simulation platform backed by limited computing clusters and storage resources. Some (or most) of the simulations initiated by users may accompany input parameters/files that have been already provided by other (or same) users in the past. Unfortunately, these duplicate simulations may aggravate the performance of the platform by drastic consumption of the limited resources shared by a number of users on the platform. To minimize or avoid conducting repeated simulations, we present a novel system, called SUPERMAN (SimUlation ProvEnance Recycling MANager) that can record simulation provenances and recycle the results of past simulations. This system presents a great opportunity to not only reutilize existing results but also perform various analytics helpful for those who are not familiar with the platform. The system also offers interoperability across other systems by collecting the provenances in a standardized format. In our simulated experiments we found that over half of past computing jobs could be answered without actual executions by our system.

Stephan, E., Raju, B., Elsethagen, T., Pouchard, L., Gamboa, C..  2017.  A scientific data provenance harvester for distributed applications. 2017 New York Scientific Data Summit (NYSDS). :1–9.

Data provenance provides a way for scientists to observe how experimental data originates, conveys process history, and explains influential factors such as experimental rationale and associated environmental factors from system metrics measured at runtime. The US Department of Energy Office of Science Integrated end-to-end Performance Prediction and Diagnosis for Extreme Scientific Workflows (IPPD) project has developed a provenance harvester that is capable of collecting observations from file based evidence typically produced by distributed applications. To achieve this, file based evidence is extracted and transformed into an intermediate data format inspired in part by W3C CSV on the Web recommendations, called the Harvester Provenance Application Interface (HAPI) syntax. This syntax provides a general means to pre-stage provenance into messages that are both human readable and capable of being written to a provenance store, Provenance Environment (ProvEn). HAPI is being applied to harvest provenance from climate ensemble runs for Accelerated Climate Modeling for Energy (ACME) project funded under the U.S. Department of Energy's Office of Biological and Environmental Research (BER) Earth System Modeling (ESM) program. ACME informally provides provenance in a native form through configuration files, directory structures, and log files that contain success/failure indicators, code traces, and performance measurements. Because of its generic format, HAPI is also being applied to harvest tabular job management provenance from Belle II DIRAC scheduler relational database tables as well as other scientific applications that log provenance related information.

2017-12-04
Hwang, T..  2017.  NSF GENI cloud enabled architecture for distributed scientific computing. 2017 IEEE Aerospace Conference. :1–8.

GENI (Global Environment for Network Innovations) is a National Science Foundation (NSF) funded program which provides a virtual laboratory for networking and distributed systems research and education. It is well suited for exploring networks at a scale, thereby promoting innovations in network science, security, services and applications. GENI allows researchers obtain compute resources from locations around the United States, connect compute resources using 100G Internet2 L2 service, install custom software or even custom operating systems on these compute resources, control how network switches in their experiment handle traffic flows, and run their own L3 and above protocols. GENI architecture incorporates cloud federation. With the federation, cloud resources can be federated and/or community of clouds can be formed. The heart of federation is user identity and an ability to “advertise” cloud resources into community including compute, storage, and networking. GENI administrators can carve out what resources are available to the community and hence a portion of GENI resources are reserved for internal consumption. GENI architecture also provides “stitching” of compute and storage resources researchers request. This provides L2 network domain over Internet2's 100G network. And researchers can run their Software Defined Networking (SDN) controllers on the provisioned L2 network domain for a complete control of networking traffic. This capability is useful for large science data transfer (bypassing security devices for high throughput). Renaissance Computing Institute (RENCI), a research institute in the state of North Carolina, has developed ORCA (Open Resource Control Architecture), a GENI control framework. ORCA is a distributed resource orchestration system to serve science experiments. ORCA provides compute resources as virtual machines and as well as baremetals. ORCA based GENI ra- k was designed to serve both High Throughput Computing (HTC) and High Performance Computing (HPC) type of computes. Although, GENI is primarily used in various universities and research entities today, GENI architecture can be leveraged in the commercial, aerospace and government settings. This paper will go over the architecture of GENI and discuss the GENI architecture for scientific computing experiments.