Visible to the public Biblio

Filters: Keyword is data provenance  [Clear All Filters]
2022-02-25
Sebastian-Cardenas, D., Gourisetti, S., Mylrea, M., Moralez, A., Day, G., Tatireddy, V., Allwardt, C., Singh, R., Bishop, R., Kaur, K. et al..  2021.  Digital data provenance for the power grid based on a Keyless Infrastructure Security Solution. 2021 Resilience Week (RWS). :1–10.
In this work a data provenance system for grid-oriented applications is presented. The proposed Keyless Infrastructure Security Solution (KISS) provides mechanisms to store and maintain digital data fingerprints that can later be used to validate and assert data provenance using a time-based, hash tree mechanism. The developed solution has been designed to satisfy the stringent requirements of the modern power grid including execution time and storage necessities. Its applicability has been tested using a lab-scale, proof-of-concept deployment that secures an energy management system against the attack sequence observed on the 2016 Ukrainian power grid cyberattack. The results demonstrate a strong potential for enabling data provenance in a wide array of applications, including speed-sensitive applications such as those found in control room environments.
Phua, Thye Way, Patros, Panos, Kumar, Vimal.  2021.  Towards Embedding Data Provenance in Files. 2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC). :1319–1325.
Data provenance (keeping track of who did what, where, when and how) boasts of various attractive use cases for distributed systems, such as intrusion detection, forensic analysis and secure information dependability. This potential, however, can only be realized if provenance is accessible by its primary stakeholders: the end-users. Existing provenance systems are designed in a `all-or-nothing' fashion, making provenance inaccessible, difficult to extract and crucially, not controlled by its key stakeholders. To mitigate this, we propose that provenance be separated into system, data-specific and file-metadata provenance. Furthermore, we expand data-specific provenance as changes at a fine-grain level, or provenance-per-change, that is recorded alongside its source. We show that with the use of delta-encoding, provenance-per-change is viable, asserting our proposed architecture to be effectively realizable.
2021-11-29
Song, ZHANG, Yang, Li, Gaoyang, LI, Han, YU, Baozhong, HAO, Jinwei, SONG, Jingang, FAN.  2020.  An Improved Data Provenance Framework Integrating Blockchain and PROV Model. 2020 International Conference on Computer Science and Management Technology (ICCSMT). :323–327.
Data tracing is an important topic in the era of digital economy when data are considered as one of the core factors in economic activities. However, the current data traceability systems often fail to obtain public trust due to their centralization and opaqueness. Blockchain possesses natural technical features such as data tampering resistance, anonymity, encryption security, etc., and shows great potential of improving the data tracing credibility. In this paper, we propose a blockchain-PROV-based multi-center data provenance solution in where the PROV model standardizes the data record storage and provenance on the blockchain automatically and intelligently. The solution improves the transparency and credibility of the provenance data, such as to help the efficient control and open sharing of data assets.
2020-03-30
Scherzinger, Stefanie, Seifert, Christin, Wiese, Lena.  2019.  The Best of Both Worlds: Challenges in Linking Provenance and Explainability in Distributed Machine Learning. 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). :1620–1629.
Machine learning experts prefer to think of their input as a single, homogeneous, and consistent data set. However, when analyzing large volumes of data, the entire data set may not be manageable on a single server, but must be stored on a distributed file system instead. Moreover, with the pressing demand to deliver explainable models, the experts may no longer focus on the machine learning algorithms in isolation, but must take into account the distributed nature of the data stored, as well as the impact of any data pre-processing steps upstream in their data analysis pipeline. In this paper, we make the point that even basic transformations during data preparation can impact the model learned, and that this is exacerbated in a distributed setting. We then sketch our vision of end-to-end explainability of the model learned, taking the pre-processing into account. In particular, we point out the potentials of linking the contributions of research on data provenance with the efforts on explainability in machine learning. In doing so, we highlight pitfalls we may experience in a distributed system on the way to generating more holistic explanations for our machine learning models.
Narendra, Nanjangud C., Shukla, Anshu, Nayak, Sambit, Jagadish, Asha, Kalkur, Rachana.  2019.  Genoma: Distributed Provenance as a Service for IoT-based Systems. 2019 IEEE 5th World Forum on Internet of Things (WF-IoT). :755–760.
One of the key aspects of IoT-based systems, which we believe has not been getting the attention it deserves, is provenance. Provenance refers to those actions that record the usage of data in the system, along with the rationale for said usage. Historically, most provenance methods in distributed systems have been tightly coupled with those of the underlying data processing frameworks in such systems. However, in this paper, we argue that IoT provenance requires a different treatment, given the heterogeneity and dynamism of IoT-based systems. In particular, provenance in IoT-based systems should be decoupled as far as possible from the underlying data processing substrates in IoT-based systems.To that end, in this paper, we present Genoma, our ongoing work on a system for provenance-as-a-service in IoT-based systems. By "provenance-as-a-service" we mean the following: distributed provenance across IoT devices, edge and cloud; and agnostic of the underlying data processing substrate. Genoma comprises a set of services that act together to provide useful provenance information to users across the system. We also show how we are realizing Genoma via an implementation prototype built on Apache Atlas and Tinkergraph, through which we are investigating several key research issues in distributed IoT provenance.
Kim, Sejin, Oh, Jisun, Kim, Yoonhee.  2019.  Data Provenance for Experiment Management of Scientific Applications on GPU. 2019 20th Asia-Pacific Network Operations and Management Symposium (APNOMS). :1–4.
Graphics Processing Units (GPUs) are getting popularly utilized for multi-purpose applications in order to enhance highly performed parallelism of computation. As memory virtualization methods in GPU nodes are not efficiently provided to deal with diverse memory usage patterns for these applications, the success of their execution depends on exclusive and limited use of physical memory in GPU environments. Therefore, it is important to predict a pattern change of GPU memory usage during runtime execution of an application. Data provenance extracted from application characteristics, GPU runtime environments, input, and execution patterns from runtime monitoring, is defined for supporting application management to set runtime configuration and predict an experimental result, and utilize resource with co-located applications. In this paper, we define data provenance of an application on GPUs and manage data by profiling the execution of CUDA scientific applications. Data provenance management helps to predict execution patterns of other similar experiments and plan efficient resource configuration.
Tabassum, Anika, Nady, Anannya Islam, Rezwanul Huq, Mohammad.  2019.  Mathematical Formulation and Implementation of Query Inversion Techniques in RDBMS for Tracking Data Provenance. 2019 7th International Conference on Information and Communication Technology (ICoICT). :1–6.
Nowadays the massive amount of data is produced from different sources and lots of applications are processing these data to discover insights. Sometimes we may get unexpected results from these applications and it is not feasible to trace back to the data origin manually to find the source of errors. To avoid this problem, data must be accompanied by the context of how they are processed and analyzed. Especially, data-intensive applications like e-Science always require transparency and therefore, we need to understand how data has been processed and transformed. In this paper, we propose mathematical formulation and implementation of query inversion techniques to trace the provenance of data in a relational database management system (RDBMS). We build mathematical formulations of inverse queries for most of the relational algebra operations and show the formula for join operations in this paper. We, then, implement these formulas of inversion techniques and the experiment shows that our proposed inverse queries can successfully trace back to original data i.e. finding data provenance.
2019-11-25
Cui, Hongyan, Chen, Zunming, Xi, Yu, Chen, Hao, Hao, Jiawang.  2019.  IoT Data Management and Lineage Traceability: A Blockchain-based Solution. 2019 IEEE/CIC International Conference on Communications Workshops in China (ICCC Workshops). :239–244.

The Internet of Things is stepping out of its infancy into full maturity, requiring massive data processing and storage. Unfortunately, because of the unique characteristics of resource constraints, short-range communication, and self-organization in IoT, it always resorts to the cloud or fog nodes for outsourced computation and storage, which has brought about a series of novel challenging security and privacy threats. For this reason, one of the critical challenges of having numerous IoT devices is the capacity to manage them and their data. A specific concern is from which devices or Edge clouds to accept join requests or interaction requests. This paper discusses a design concept for developing the IoT data management platform, along with a data management and lineage traceability implementation of the platform based on blockchain and smart contracts, which approaches the two major challenges: how to implement effective data management and enrich rational interoperability for trusted groups of linked Things; And how to settle conflicts between untrusted IoT devices and its requests taking into account security and privacy preserving. Experimental results show that the system scales well with the loss of computing and communication performance maintaining within the acceptable range, works well to effectively defend against unauthorized access and empower data provenance and transparency, which verifies the feasibility and efficiency of the design concept to provide privacy, fine-grained, and integrity data management over the IoT devices by introducing the blockchain-based data management platform.

2019-10-28
Huang, Jingwei.  2018.  From Big Data to Knowledge: Issues of Provenance, Trust, and Scientific Computing Integrity. 2018 IEEE International Conference on Big Data (Big Data). :2197–2205.
This paper addresses the nature of data and knowledge, the relation between them, the variety of views as a characteristic of Big Data regarding that data may come from many different sources/views from different viewpoints, and the associated essential issues of data provenance, knowledge provenance, scientific computing integrity, and trust in the data science process. Towards the direction of data-intensive science and engineering, it is of paramount importance to ensure Scientific Computing Integrity (SCI). A failure of SCI may be caused by malicious attacks, natural environmental changes, faults of scientists, operations mistakes, faults of supporting systems, faults of processes, and errors in the data or theories on which a research relies. The complexity of scientific workflows and large provenance graphs as well as various causes for SCI failures make ensuring SCI extremely difficult. Provenance and trust play critical role in evaluating SCI. This paper reports our progress in building a model for provenance-based trust reasoning about SCI.
2019-09-23
Ramijak, Dusan, Pal, Amitangshu, Kant, Krishna.  2018.  Pattern Mining Based Compression of IoT Data. Proceedings of the Workshop Program of the 19th International Conference on Distributed Computing and Networking. :12:1–12:6.
The increasing proliferation of the Internet of Things (IoT) devices and systems result in large amounts of highly heterogeneous data to be collected. Although at least some of the collected sensor data is often consumed by the real-time decision making and control of the IoT system, that is not the only use of such data. Invariably, the collected data is stored, perhaps in some filtered or downselected fashion, so that it can be used for a variety of lower-frequency operations. It is expected that in a smart city environment with numerous IoT deployments, the volume of such data can become enormous. Therefore, mechanisms for lossy data compression that provide a trade-off between compression ratio and data usefulness for offline statistical analysis becomes necessary. In this paper, we discuss several simple pattern mining based compression strategies for multi-attribute IoT data streams. For each method, we evaluate the compressibility of the method vs. the level of similarity between original and compressed time series in the context of the home energy management system.
Kalokyri, Varvara, Borgida, Alexander, Marian, Amélie.  2018.  YourDigitalSelf: A Personal Digital Trace Integration Tool. Proceedings of the 27th ACM International Conference on Information and Knowledge Management. :1963–1966.
Personal information is typically fragmented across multiple, heterogeneous, distributed sources and saved as small, heterogeneous data objects, or traces. The DigitalSelf project at Rutgers University focuses on developing tools and techniques to manage (organize, search, summarize, make inferences on and personalize) such heterogeneous collections of personal digital traces. We propose to demonstrate YourDigitalSelf, a mobile phone-based personal information organization application developed as part of the DigitalSelf project. The demonstration will use a sample user data set to show how several disparate data traces can be integrated and combined to create personal narratives, or coherent episodes, of the user's activities. Conference attendees will be given the option to install YourDigitalSelf on their own devices to interact with their own data.
Whittaker, Michael, Teodoropol, Cristina, Alvaro, Peter, Hellerstein, Joseph M..  2018.  Debugging Distributed Systems with Why-Across-Time Provenance. Proceedings of the ACM Symposium on Cloud Computing. :333–346.
Systematically reasoning about the fine-grained causes of events in a real-world distributed system is challenging. Causality, from the distributed systems literature, can be used to compute the causal history of an arbitrary event in a distributed system, but the event's causal history is an over-approximation of the true causes. Data provenance, from the database literature, precisely describes why a particular tuple appears in the output of a relational query, but data provenance is limited to the domain of static relational databases. In this paper, we present wat-provenance: a novel form of provenance that provides the benefits of causality and data provenance. Given an arbitrary state machine, wat-provenance describes why the state machine produces a particular output when given a particular input. This enables system developers to reason about the causes of events in real-world distributed systems. We observe that automatically extracting the wat-provenance of a state machine is often infeasible. Fortunately, many distributed systems components have simple interfaces from which a developer can directly specify wat-provenance using a technique we call wat-provenance specifications. Leveraging the theoretical foundations of wat-provenance, we implement a prototype distributed debugging framework called Watermelon.
Yazici, I. M., Karabulut, E., Aktas, M. S..  2018.  A Data Provenance Visualization Approach. 2018 14th International Conference on Semantics, Knowledge and Grids (SKG). :84–91.
Data Provenance has created an emerging requirement for technologies that enable end users to access, evaluate, and act on the provenance of data in recent years. In the era of Big Data, the amount of data created by corporations around the world has grown each year. As an example, both in the Social Media and e-Science domains, data is growing at an unprecedented rate. As the data has grown rapidly, information on the origin and lifecycle of the data has also grown. In turn, this requires technologies that enable the clarification and interpretation of data through the use of data provenance. This study proposes methodologies towards the visualization of W3C-PROV-O Specification compatible provenance data. The visualizations are done by summarization and comparison of the data provenance. We facilitated the testing of these methodologies by providing a prototype, extending an existing open source visualization tool. We discuss the usability of the proposed methodologies with an experimental study; our initial results show that the proposed approach is usable, and its processing overhead is negligible.
Suriarachchi, I., Withana, S., Plale, B..  2018.  Big Provenance Stream Processing for Data Intensive Computations. 2018 IEEE 14th International Conference on e-Science (e-Science). :245–255.
In the business and research landscape of today, data analysis consumes public and proprietary data from numerous sources, and utilizes any one or more of popular data-parallel frameworks such as Hadoop, Spark and Flink. In the Data Lake setting these frameworks co-exist. Our earlier work has shown that data provenance in Data Lakes can aid with both traceability and management. The sheer volume of fine-grained provenance generated in a multi-framework application motivates the need for on-the-fly provenance processing. We introduce a new parallel stream processing algorithm that reduces fine-grained provenance while preserving backward and forward provenance. The algorithm is resilient to provenance events arriving out-of-order. It is evaluated using several strategies for partitioning a provenance stream. The evaluation shows that the parallel algorithm performs well in processing out-of-order provenance streams, with good scalability and accuracy.
2019-06-28
Gulzar, Muhammad Ali.  2018.  Interactive and Automated Debugging for Big Data Analytics. Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings. :509-511.

An abundance of data in many disciplines of science, engineering, national security, health care, and business has led to the emerging field of Big Data Analytics that run in a cloud computing environment. To process massive quantities of data in the cloud, developers leverage Data-Intensive Scalable Computing (DISC) systems such as Google's MapReduce, Hadoop, and Spark. Currently, developers do not have easy means to debug DISC applications. The use of cloud computing makes application development feel more like batch jobs and the nature of debugging is therefore post-mortem. Developers of big data applications write code that implements a data processing pipeline and test it on their local workstation with a small sample data, downloaded from a TB-scale data warehouse. They cross fingers and hope that the program works in the expensive production cloud. When a job fails or they get a suspicious result, data scientists spend hours guessing at the source of the error, digging through post-mortem logs. In such cases, the data scientists may want to pinpoint the root cause of errors by investigating a subset of corresponding input records. The vision of my work is to provide interactive, real-time and automated debugging services for big data processing programs in modern DISC systems with minimum performance impact. My work investigates the following research questions in the context of big data analytics: (1) What are the necessary debugging primitives for interactive big data processing? (2) What scalable fault localization algorithms are needed to help the user to localize and characterize the root causes of errors? (3) How can we improve testing efficiency during iterative development of DISC applications by reasoning the semantics of dataflow operators and user-defined functions used inside dataflow operators in tandem? To answer these questions, we synthesize and innovate ideas from software engineering, big data systems, and program analysis, and coordinate innovations across the software stack from the user-facing API all the way down to the systems infrastructure.

2018-10-15
Benjamin E. Ujcich, University of Illinois at Urbana-Champaign, Samuel Jero, MIT Lincoln Laboratory, Anne Edmundson, Princeton University, Qi Wang, University of Illinois at Urbana-Champaign, Richard Skowyra, MIT Lincoln Laboratory, James Landry, MIT Lincoln Laboratory, Adam Bates, University of Illinois at Urbana-Champaign, William H. Sanders, University of Illinois at Urbana-Champaign, Cristina Nita-Rotaru, Northeastern University, Hamed Okhravi, MIT Lincoln Laboratroy.  2018.  Cross-App Poisoning in Software-Defined Networking. 2018 ACM Conference on Computer and Communications Security.

Software-defined networking (SDN) continues to grow in popularity because of its programmable and extensible control plane realized through network applications (apps). However, apps introduce significant security challenges that can systemically disrupt network operations, since apps must access or modify data in a shared control plane state. If our understanding of how such data propagate within the control plane is inadequate, apps can co-opt other apps, causing them to poison the control plane’s integrity. 

We present a class of SDN control plane integrity attacks that we call cross-app poisoning (CAP), in which an unprivileged app manipulates the shared control plane state to trick a privileged app into taking actions on its behalf. We demonstrate how role-based access control (RBAC) schemes are insufficient for preventing such attacks because they neither track information flow nor enforce information flow control (IFC). We also present a defense, ProvSDN, that uses data provenance to track information flow and serves as an online reference monitor to prevent CAP attacks. We implement ProvSDN on the ONOS SDN controller and demonstrate that information flow can be tracked with low-latency overheads.

2018-05-24
Tosh, D. K., Shetty, S., Liang, X., Kamhoua, C. A., Kwiat, K. A., Njilla, L..  2017.  Security Implications of Blockchain Cloud with Analysis of Block Withholding Attack. 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). :458–467.

The blockchain technology has emerged as an attractive solution to address performance and security issues in distributed systems. Blockchain's public and distributed peer-to-peer ledger capability benefits cloud computing services which require functions such as, assured data provenance, auditing, management of digital assets, and distributed consensus. Blockchain's underlying consensus mechanism allows to build a tamper-proof environment, where transactions on any digital assets are verified by set of authentic participants or miners. With use of strong cryptographic methods, blocks of transactions are chained together to enable immutability on the records. However, achieving consensus demands computational power from the miners in exchange of handsome reward. Therefore, greedy miners always try to exploit the system by augmenting their mining power. In this paper, we first discuss blockchain's capability in providing assured data provenance in cloud and present vulnerabilities in blockchain cloud. We model the block withholding (BWH) attack in a blockchain cloud considering distinct pool reward mechanisms. BWH attack provides rogue miner ample resources in the blockchain cloud for disrupting honest miners' mining efforts, which was verified through simulations.

2018-05-09
Gulzar, Muhammad Ali, Interlandi, Matteo, Han, Xueyuan, Li, Mingda, Condie, Tyson, Kim, Miryung.  2017.  Automated Debugging in Data-Intensive Scalable Computing. Proceedings of the 2017 Symposium on Cloud Computing. :520–534.

Developing Big Data Analytics workloads often involves trial and error debugging, due to the unclean nature of datasets or wrong assumptions made about data. When errors (e.g., program crash, outlier results, etc.) arise, developers are often interested in identifying a subset of the input data that is able to reproduce the problem. BigSift is a new faulty data localization approach that combines insights from automated fault isolation in software engineering and data provenance in database systems to find a minimum set of failure-inducing inputs. BigSift redefines data provenance for the purpose of debugging using a test oracle function and implements several unique optimizations, specifically geared towards the iterative nature of automated debugging workloads. BigSift improves the accuracy of fault localizability by several orders-of-magnitude ($\sim$103 to 107×) compared to Titian data provenance, and improves performance by up to 66× compared to Delta Debugging, an automated fault-isolation technique. For each faulty output, BigSift is able to localize fault-inducing data within 62% of the original job running time.

2018-03-05
Pasquier, Thomas, Han, Xueyuan, Goldstein, Mark, Moyer, Thomas, Eyers, David, Seltzer, Margo, Bacon, Jean.  2017.  Practical Whole-System Provenance Capture. Proceedings of the 2017 Symposium on Cloud Computing. :405–418.

Data provenance describes how data came to be in its present form. It includes data sources and the transformations that have been applied to them. Data provenance has many uses, from forensics and security to aiding the reproducibility of scientific experiments. We present CamFlow, a whole-system provenance capture mechanism that integrates easily into a PaaS offering. While there have been several prior whole-system provenance systems that captured a comprehensive, systemic and ubiquitous record of a system's behavior, none have been widely adopted. They either A) impose too much overhead, B) are designed for long-outdated kernel releases and are hard to port to current systems, C) generate too much data, or D) are designed for a single system. CamFlow addresses these shortcoming by: 1) leveraging the latest kernel design advances to achieve efficiency; 2) using a self-contained, easily maintainable implementation relying on a Linux Security Module, NetFilter, and other existing kernel facilities; 3) providing a mechanism to tailor the captured provenance data to the needs of the application; and 4) making it easy to integrate provenance across distributed systems. The provenance we capture is streamed and consumed by tenant-built auditor applications. We illustrate the usability of our implementation by describing three such applications: demonstrating compliance with data regulations; performing fault/intrusion detection; and implementing data loss prevention. We also show how CamFlow can be leveraged to capture meaningful provenance without modifying existing applications.

McDonald, J. T., Manikyam, R., Glisson, W. B., Andel, T. R., Gu, Y. X..  2017.  Enhanced Operating System Protection to Support Digital Forensic Investigations. 2017 IEEE Trustcom/BigDataSE/ICESS. :650–659.

Digital forensic investigators today are faced with numerous problems when recovering footprints of criminal activity that involve the use of computer systems. Investigators need the ability to recover evidence in a forensically sound manner, even when criminals actively work to alter the integrity, veracity, and provenance of data, applications and software that are used to support illicit activities. In many ways, operating systems (OS) can be strengthened from a technological viewpoint to support verifiable, accurate, and consistent recovery of system data when needed for forensic collection efforts. In this paper, we extend the ideas for forensic-friendly OS design by proposing the use of a practical form of computing on encrypted data (CED) and computing with encrypted functions (CEF) which builds upon prior work on component encryption (in circuits) and white-box cryptography (in software). We conduct experiments on sample programs to provide analysis of the approach based on security and efficiency, illustrating how component encryption can strengthen key OS functions and improve tamper-resistance to anti-forensic activities. We analyze the tradeoff space for use of the algorithm in a holistic approach that provides additional security and comparable properties to fully homomorphic encryption (FHE).

2018-02-02
Moyer, T., Chadha, K., Cunningham, R., Schear, N., Smith, W., Bates, A., Butler, K., Capobianco, F., Jaeger, T., Cable, P..  2016.  Leveraging Data Provenance to Enhance Cyber Resilience. 2016 IEEE Cybersecurity Development (SecDev). :107–114.

Building secure systems used to mean ensuring a secure perimeter, but that is no longer the case. Today's systems are ill-equipped to deal with attackers that are able to pierce perimeter defenses. Data provenance is a critical technology in building resilient systems that will allow systems to recover from attackers that manage to overcome the "hard-shell" defenses. In this paper, we provide background information on data provenance, details on provenance collection, analysis, and storage techniques and challenges. Data provenance is situated to address the challenging problem of allowing a system to "fight-through" an attack, and we help to identify necessary work to ensure that future systems are resilient.

2018-01-10
Aman, Muhammad Naveed, Chua, Kee Chaing, Sikdar, Biplab.  2017.  Secure Data Provenance for the Internet of Things. Proceedings of the 3rd ACM International Workshop on IoT Privacy, Trust, and Security. :11–14.

The vision of smart environments, systems, and services is driven by the development of the Internet of Things (IoT). IoT devices produce large amounts of data and this data is used to make critical decisions in many systems. The data produced by these devices has to satisfy various security related requirements in order to be useful in practical scenarios. One of these requirements is data provenance which allows a user to trust the data regarding its origin and location. The low cost of many IoT devices and the fact that they may be deployed in unprotected spaces requires security protocols to be efficient and secure against physical attacks. This paper proposes a light-weight protocol for data provenance in the IoT. The proposed protocol uses physical unclonable functions (PUFs) to provide physical security and uniquely identify an IoT device. Moreover, wireless channel characteristics are used to uniquely identify a wireless link between an IoT device and a server/user. A brief security and performance analysis are presented to give a preliminary validation of the protocol.

2017-12-12
Miller, J. A., Peng, H., Cotterell, M. E..  2017.  Adding Support for Theory in Open Science Big Data. 2017 IEEE World Congress on Services (SERVICES). :71–75.

Open Science Big Data is emerging as an important area of research and software development. Although there are several high quality frameworks for Big Data, additional capabilities are needed for Open Science Big Data. These include data provenance, citable reusable data, data sources providing links to research literature, relationships to other data and theories, transparent analysis/reproducibility, data privacy, new optimizations/advanced algorithms, data curation, data storage and transfer. An important part of science is explanation of results, ideally leading to theory formation. In this paper, we examine means for supporting the use of theory in big data analytics as well as using big data to assist in theory formation. One approach is to fit data in a way that is compatible with some theory, existing or new. Functional Data Analysis allows precise fitting of data as well as penalties for lack of smoothness or even departure from theoretical expectations. This paper discusses principal differential analysis and related techniques for fitting data where, for example, a time-based process is governed by an ordinary differential equation. Automation in theory formation is also considered. Case studies in the fields of computational economics and finance are considered.

Davis, D. B., Featherston, J., Fukuda, M., Asuncion, H. U..  2017.  Data Provenance for Multi-Agent Models. 2017 IEEE 13th International Conference on e-Science (e-Science). :39–48.

Multi-agent simulations are useful for exploring collective patterns of individual behavior in social, biological, economic, network, and physical systems. However, there is no provenance support for multi-agent models (MAMs) in a distributed setting. To this end, we introduce ProvMASS, a novel approach to capture provenance of MAMs in a distributed memory by combining inter-process identification, lightweight coordination of in-memory provenance storage, and adaptive provenance capture. ProvMASS is built on top of the Multi-Agent Spatial Simulation (MASS) library, a framework that combines multi-agent systems with large-scale fine-grained agent-based models, or MAMs. Unlike other environments supporting MAMs, MASS parallelizes simulations with distributed memory, where agents and spatial data are shared application resources. We evaluate our approach with provenance queries to support three use cases and performance measures. Initial results indicate that our approach can support various provenance queries for MAMs at reasonable performance overhead.

2017-05-16
Herschel, Melanie, Hlawatsch, Marcel.  2016.  Provenance: On and Behind the Screens. Proceedings of the 2016 International Conference on Management of Data. :2213–2217.

Collecting and processing provenance, i.e., information describing the production process of some end product, is important in various applications, e.g., to assess quality, to ensure reproducibility, or to reinforce trust in the end product. In the past, different types of provenance meta-data have been proposed, each with a different scope. The first part of the proposed tutorial provides an overview and comparison of these different types of provenance. To put provenance to good use, it is essential to be able to interact with and present provenance data in a user-friendly way. Often, users interested in provenance are not necessarily experts in databases or query languages, as they are typically domain experts of the product and production process for which provenance is collected (biologists, journalists, etc.). Furthermore, in some scenarios, it is difficult to use solely queries for analyzing and exploring provenance data. The second part of this tutorial therefore focuses on enabling users to leverage provenance through adapted visualizations. To this end, we will present some fundamental concepts of visualization before we discuss possible visualizations for provenance.