Biblio
The Internet of Things is stepping out of its infancy into full maturity, requiring massive data processing and storage. Unfortunately, because of the unique characteristics of resource constraints, short-range communication, and self-organization in IoT, it always resorts to the cloud or fog nodes for outsourced computation and storage, which has brought about a series of novel challenging security and privacy threats. For this reason, one of the critical challenges of having numerous IoT devices is the capacity to manage them and their data. A specific concern is from which devices or Edge clouds to accept join requests or interaction requests. This paper discusses a design concept for developing the IoT data management platform, along with a data management and lineage traceability implementation of the platform based on blockchain and smart contracts, which approaches the two major challenges: how to implement effective data management and enrich rational interoperability for trusted groups of linked Things; And how to settle conflicts between untrusted IoT devices and its requests taking into account security and privacy preserving. Experimental results show that the system scales well with the loss of computing and communication performance maintaining within the acceptable range, works well to effectively defend against unauthorized access and empower data provenance and transparency, which verifies the feasibility and efficiency of the design concept to provide privacy, fine-grained, and integrity data management over the IoT devices by introducing the blockchain-based data management platform.
An abundance of data in many disciplines of science, engineering, national security, health care, and business has led to the emerging field of Big Data Analytics that run in a cloud computing environment. To process massive quantities of data in the cloud, developers leverage Data-Intensive Scalable Computing (DISC) systems such as Google's MapReduce, Hadoop, and Spark. Currently, developers do not have easy means to debug DISC applications. The use of cloud computing makes application development feel more like batch jobs and the nature of debugging is therefore post-mortem. Developers of big data applications write code that implements a data processing pipeline and test it on their local workstation with a small sample data, downloaded from a TB-scale data warehouse. They cross fingers and hope that the program works in the expensive production cloud. When a job fails or they get a suspicious result, data scientists spend hours guessing at the source of the error, digging through post-mortem logs. In such cases, the data scientists may want to pinpoint the root cause of errors by investigating a subset of corresponding input records. The vision of my work is to provide interactive, real-time and automated debugging services for big data processing programs in modern DISC systems with minimum performance impact. My work investigates the following research questions in the context of big data analytics: (1) What are the necessary debugging primitives for interactive big data processing? (2) What scalable fault localization algorithms are needed to help the user to localize and characterize the root causes of errors? (3) How can we improve testing efficiency during iterative development of DISC applications by reasoning the semantics of dataflow operators and user-defined functions used inside dataflow operators in tandem? To answer these questions, we synthesize and innovate ideas from software engineering, big data systems, and program analysis, and coordinate innovations across the software stack from the user-facing API all the way down to the systems infrastructure.
Software-defined networking (SDN) continues to grow in popularity because of its programmable and extensible control plane realized through network applications (apps). However, apps introduce significant security challenges that can systemically disrupt network operations, since apps must access or modify data in a shared control plane state. If our understanding of how such data propagate within the control plane is inadequate, apps can co-opt other apps, causing them to poison the control plane’s integrity.
We present a class of SDN control plane integrity attacks that we call cross-app poisoning (CAP), in which an unprivileged app manipulates the shared control plane state to trick a privileged app into taking actions on its behalf. We demonstrate how role-based access control (RBAC) schemes are insufficient for preventing such attacks because they neither track information flow nor enforce information flow control (IFC). We also present a defense, ProvSDN, that uses data provenance to track information flow and serves as an online reference monitor to prevent CAP attacks. We implement ProvSDN on the ONOS SDN controller and demonstrate that information flow can be tracked with low-latency overheads.
The blockchain technology has emerged as an attractive solution to address performance and security issues in distributed systems. Blockchain's public and distributed peer-to-peer ledger capability benefits cloud computing services which require functions such as, assured data provenance, auditing, management of digital assets, and distributed consensus. Blockchain's underlying consensus mechanism allows to build a tamper-proof environment, where transactions on any digital assets are verified by set of authentic participants or miners. With use of strong cryptographic methods, blocks of transactions are chained together to enable immutability on the records. However, achieving consensus demands computational power from the miners in exchange of handsome reward. Therefore, greedy miners always try to exploit the system by augmenting their mining power. In this paper, we first discuss blockchain's capability in providing assured data provenance in cloud and present vulnerabilities in blockchain cloud. We model the block withholding (BWH) attack in a blockchain cloud considering distinct pool reward mechanisms. BWH attack provides rogue miner ample resources in the blockchain cloud for disrupting honest miners' mining efforts, which was verified through simulations.
Developing Big Data Analytics workloads often involves trial and error debugging, due to the unclean nature of datasets or wrong assumptions made about data. When errors (e.g., program crash, outlier results, etc.) arise, developers are often interested in identifying a subset of the input data that is able to reproduce the problem. BigSift is a new faulty data localization approach that combines insights from automated fault isolation in software engineering and data provenance in database systems to find a minimum set of failure-inducing inputs. BigSift redefines data provenance for the purpose of debugging using a test oracle function and implements several unique optimizations, specifically geared towards the iterative nature of automated debugging workloads. BigSift improves the accuracy of fault localizability by several orders-of-magnitude ($\sim$103 to 107×) compared to Titian data provenance, and improves performance by up to 66× compared to Delta Debugging, an automated fault-isolation technique. For each faulty output, BigSift is able to localize fault-inducing data within 62% of the original job running time.
Data provenance describes how data came to be in its present form. It includes data sources and the transformations that have been applied to them. Data provenance has many uses, from forensics and security to aiding the reproducibility of scientific experiments. We present CamFlow, a whole-system provenance capture mechanism that integrates easily into a PaaS offering. While there have been several prior whole-system provenance systems that captured a comprehensive, systemic and ubiquitous record of a system's behavior, none have been widely adopted. They either A) impose too much overhead, B) are designed for long-outdated kernel releases and are hard to port to current systems, C) generate too much data, or D) are designed for a single system. CamFlow addresses these shortcoming by: 1) leveraging the latest kernel design advances to achieve efficiency; 2) using a self-contained, easily maintainable implementation relying on a Linux Security Module, NetFilter, and other existing kernel facilities; 3) providing a mechanism to tailor the captured provenance data to the needs of the application; and 4) making it easy to integrate provenance across distributed systems. The provenance we capture is streamed and consumed by tenant-built auditor applications. We illustrate the usability of our implementation by describing three such applications: demonstrating compliance with data regulations; performing fault/intrusion detection; and implementing data loss prevention. We also show how CamFlow can be leveraged to capture meaningful provenance without modifying existing applications.
Digital forensic investigators today are faced with numerous problems when recovering footprints of criminal activity that involve the use of computer systems. Investigators need the ability to recover evidence in a forensically sound manner, even when criminals actively work to alter the integrity, veracity, and provenance of data, applications and software that are used to support illicit activities. In many ways, operating systems (OS) can be strengthened from a technological viewpoint to support verifiable, accurate, and consistent recovery of system data when needed for forensic collection efforts. In this paper, we extend the ideas for forensic-friendly OS design by proposing the use of a practical form of computing on encrypted data (CED) and computing with encrypted functions (CEF) which builds upon prior work on component encryption (in circuits) and white-box cryptography (in software). We conduct experiments on sample programs to provide analysis of the approach based on security and efficiency, illustrating how component encryption can strengthen key OS functions and improve tamper-resistance to anti-forensic activities. We analyze the tradeoff space for use of the algorithm in a holistic approach that provides additional security and comparable properties to fully homomorphic encryption (FHE).
Building secure systems used to mean ensuring a secure perimeter, but that is no longer the case. Today's systems are ill-equipped to deal with attackers that are able to pierce perimeter defenses. Data provenance is a critical technology in building resilient systems that will allow systems to recover from attackers that manage to overcome the "hard-shell" defenses. In this paper, we provide background information on data provenance, details on provenance collection, analysis, and storage techniques and challenges. Data provenance is situated to address the challenging problem of allowing a system to "fight-through" an attack, and we help to identify necessary work to ensure that future systems are resilient.
The vision of smart environments, systems, and services is driven by the development of the Internet of Things (IoT). IoT devices produce large amounts of data and this data is used to make critical decisions in many systems. The data produced by these devices has to satisfy various security related requirements in order to be useful in practical scenarios. One of these requirements is data provenance which allows a user to trust the data regarding its origin and location. The low cost of many IoT devices and the fact that they may be deployed in unprotected spaces requires security protocols to be efficient and secure against physical attacks. This paper proposes a light-weight protocol for data provenance in the IoT. The proposed protocol uses physical unclonable functions (PUFs) to provide physical security and uniquely identify an IoT device. Moreover, wireless channel characteristics are used to uniquely identify a wireless link between an IoT device and a server/user. A brief security and performance analysis are presented to give a preliminary validation of the protocol.
Open Science Big Data is emerging as an important area of research and software development. Although there are several high quality frameworks for Big Data, additional capabilities are needed for Open Science Big Data. These include data provenance, citable reusable data, data sources providing links to research literature, relationships to other data and theories, transparent analysis/reproducibility, data privacy, new optimizations/advanced algorithms, data curation, data storage and transfer. An important part of science is explanation of results, ideally leading to theory formation. In this paper, we examine means for supporting the use of theory in big data analytics as well as using big data to assist in theory formation. One approach is to fit data in a way that is compatible with some theory, existing or new. Functional Data Analysis allows precise fitting of data as well as penalties for lack of smoothness or even departure from theoretical expectations. This paper discusses principal differential analysis and related techniques for fitting data where, for example, a time-based process is governed by an ordinary differential equation. Automation in theory formation is also considered. Case studies in the fields of computational economics and finance are considered.
Multi-agent simulations are useful for exploring collective patterns of individual behavior in social, biological, economic, network, and physical systems. However, there is no provenance support for multi-agent models (MAMs) in a distributed setting. To this end, we introduce ProvMASS, a novel approach to capture provenance of MAMs in a distributed memory by combining inter-process identification, lightweight coordination of in-memory provenance storage, and adaptive provenance capture. ProvMASS is built on top of the Multi-Agent Spatial Simulation (MASS) library, a framework that combines multi-agent systems with large-scale fine-grained agent-based models, or MAMs. Unlike other environments supporting MAMs, MASS parallelizes simulations with distributed memory, where agents and spatial data are shared application resources. We evaluate our approach with provenance queries to support three use cases and performance measures. Initial results indicate that our approach can support various provenance queries for MAMs at reasonable performance overhead.
Collecting and processing provenance, i.e., information describing the production process of some end product, is important in various applications, e.g., to assess quality, to ensure reproducibility, or to reinforce trust in the end product. In the past, different types of provenance meta-data have been proposed, each with a different scope. The first part of the proposed tutorial provides an overview and comparison of these different types of provenance. To put provenance to good use, it is essential to be able to interact with and present provenance data in a user-friendly way. Often, users interested in provenance are not necessarily experts in databases or query languages, as they are typically domain experts of the product and production process for which provenance is collected (biologists, journalists, etc.). Furthermore, in some scenarios, it is difficult to use solely queries for analyzing and exploring provenance data. The second part of this tutorial therefore focuses on enabling users to leverage provenance through adapted visualizations. To this end, we will present some fundamental concepts of visualization before we discuss possible visualizations for provenance.