Visible to the public A scientific data provenance harvester for distributed applications

TitleA scientific data provenance harvester for distributed applications
Publication TypeConference Paper
Year of Publication2017
AuthorsStephan, E., Raju, B., Elsethagen, T., Pouchard, L., Gamboa, C.
Conference Name2017 New York Scientific Data Summit (NYSDS)
Date Publishedaug
KeywordsAccelerated Climate Modeling, Accelerated Climate Modeling for Energy project, Atmospheric measurements, Atmospheric modeling, Biological and Environmental Research, Cognitive science, composability, Data analysis, Data models, distributed applications, Earth System Modeling program, electronic data interchange, Energy project, environmental factors, experimental rationale, Extreme Scientific Workflows project, file based evidence, HAPI, Harvester Provenance Application Interface syntax, harvesting, Human Behavior, Internet, Metrics, Performance, performance measurements, prestage provenance, process history, ProvEn, Provenance, Provenance Environment, provenance related information, provenance store, pubcrawl, relational databases, Resiliency, Science Integrated end-to-end Performance Prediction, scientific applications, scientific data provenance harvester, scientific information systems, Syntactics, tabular job management provenance, US Department of Energy Office of Science
Abstract

Data provenance provides a way for scientists to observe how experimental data originates, conveys process history, and explains influential factors such as experimental rationale and associated environmental factors from system metrics measured at runtime. The US Department of Energy Office of Science Integrated end-to-end Performance Prediction and Diagnosis for Extreme Scientific Workflows (IPPD) project has developed a provenance harvester that is capable of collecting observations from file based evidence typically produced by distributed applications. To achieve this, file based evidence is extracted and transformed into an intermediate data format inspired in part by W3C CSV on the Web recommendations, called the Harvester Provenance Application Interface (HAPI) syntax. This syntax provides a general means to pre-stage provenance into messages that are both human readable and capable of being written to a provenance store, Provenance Environment (ProvEn). HAPI is being applied to harvest provenance from climate ensemble runs for Accelerated Climate Modeling for Energy (ACME) project funded under the U.S. Department of Energy's Office of Biological and Environmental Research (BER) Earth System Modeling (ESM) program. ACME informally provides provenance in a native form through configuration files, directory structures, and log files that contain success/failure indicators, code traces, and performance measurements. Because of its generic format, HAPI is also being applied to harvest tabular job management provenance from Belle II DIRAC scheduler relational database tables as well as other scientific applications that log provenance related information.

DOI10.1109/NYSDS.2017.8085041
Citation Keystephan_scientific_2017