Visible to the public Biblio

Found 12044 results

Filters: Keyword is Resiliency  [Clear All Filters]
2017-05-17
Huang, Zhenqi, Wang, Yu, Mitra, Sayan, Dullerud, Geir.  2016.  Controller Synthesis for Linear Dynamical Systems with Adversaries. Proceedings of the Symposium and Bootcamp on the Science of Security. :53–62.

We present a controller synthesis algorithm for a reach-avoid problem in the presence of adversaries. Our model of the adversary abstractly captures typical malicious attacks envisioned on cyber-physical systems such as sensor spoofing, controller corruption, and actuator intrusion. After formulating the problem in a general setting, we present a sound and complete algorithm for the case with linear dynamics and an adversary with a budget on the total L2-norm of its actions. The algorithm relies on a result from linear control theory that enables us to decompose and compute the reachable states of the system in terms of a symbolic simulation of the adversary-free dynamics and the total uncertainty induced by the adversary. With this decomposition, the synthesis problem eliminates the universal quantifier on the adversary's choices and the symbolic controller actions can be effectively solved using an SMT solver. The constraints induced by the adversary are computed by solving second-order cone programmings. The algorithm is later extended to synthesize state-dependent controller and to generate attacks for the adversary. We present preliminary experimental results that show the effectiveness of this approach on several example problems.

Kang, Eunsuk, Adepu, Sridhar, Jackson, Daniel, Mathur, Aditya P..  2016.  Model-based Security Analysis of a Water Treatment System. Proceedings of the 2Nd International Workshop on Software Engineering for Smart Cyber-Physical Systems. :22–28.

An approach to analyzing the security of a cyber-physical system (CPS) is proposed, where the behavior of a physical plant and its controller are captured in approximate models, and their interaction is rigorously checked to discover potential attacks that involve a varying number of compromised sensors and actuators. As a preliminary study, this approach has been applied to a fully functional water treatment testbed constructed at the Singapore University of Technology and Design. The analysis revealed previously unknown attacks that were confirmed to pose serious threats to the safety of the testbed, and suggests a number of research challenges and opportunities for applying a similar type of formal analysis to cyber-physical security.

Wang, Kun, Du, Miao, Yang, Dejun, Zhu, Chunsheng, Shen, Jian, Zhang, Yan.  2016.  Game-Theory-Based Active Defense for Intrusion Detection in Cyber-Physical Embedded Systems. ACM Trans. Embed. Comput. Syst.. 16:18:1–18:21.

Cyber-Physical Embedded Systems (CPESs) are distributed embedded systems integrated with various actuators and sensors. When it comes to the issue of CPES security, the most significant problem is the security of Embedded Sensor Networks (ESNs). With the continuous growth of ESNs, the security of transferring data from sensors to their destinations has become an important research area. Due to the limitations in power, storage, and processing capabilities, existing security mechanisms for wired or wireless networks cannot apply directly to ESNs. Meanwhile, ESNs are likely to be attacked by different kinds of attacks in industrial scenarios. Therefore, there is a need to develop new techniques or modify the current security mechanisms to overcome these problems. In this article, we focus on Intrusion Detection (ID) techniques and propose a new attack-defense game model to detect malicious nodes using a repeated game approach. As a direct consequence of the game model, attackers and defenders make different strategies to achieve optimal payoffs. Importantly, error detection and missing detection are taken into consideration in Intrusion Detection Systems (IDSs), where a game tree model is introduced to solve this problem. In addition, we analyze and prove the existence of pure Nash equilibrium and mixed Nash equilibrium. Simulations show that the proposed model can both reduce energy consumption by up to 50% compared with the existing All Monitor (AM) model and improve the detection rate by up to 10% to 15% compared with the existing Cluster Head (CH) monitor model.

Ali, Sk Subidh, Ibrahim, Mohamed, Sinanoglu, Ozgur, Chakrabarty, Krishnendu, Karri, Ramesh.  2016.  Security Assessment of Cyberphysical Digital Microfluidic Biochips. IEEE/ACM Trans. Comput. Biol. Bioinformatics. 13:445–458.

A digital microfluidic biochip (DMFB) is an emerging technology that enables miniaturized analysis systems for point-of-care clinical diagnostics, DNA sequencing, and environmental monitoring. A DMFB reduces the rate of sample and reagent consumption, and automates the analysis of assays. In this paper, we provide the first assessment of the security vulnerabilities of DMFBs. We identify result-manipulation attacks on a DMFB that maliciously alter the assay outcomes. Two practical result-manipulation attacks are shown on a DMFB platform performing enzymatic glucose assay on serum. In the first attack, the attacker adjusts the concentration of the glucose sample and thereby modifies the final result. In the second attack, the attacker tampers with the calibration curve of the assay operation. We then identify denial-of-service attacks, where the attacker can disrupt the assay operation by tampering either with the droplet-routing algorithm or with the actuation sequence. We demonstrate these attacks using a digital microfluidic synthesis simulator. The results show that the attacks are easy to implement and hard to detect. Therefore, this work highlights the need for effective protections against malicious modifications in DMFBs.

Bertino, Elisa, Choo, Kim-Kwang Raymond, Georgakopolous, Dimitrios, Nepal, Surya.  2016.  Internet of Things (IoT): Smart and Secure Service Delivery. ACM Trans. Internet Technol.. 16:22:1–22:7.

The Internet of Things (IoT) is the latest Internet evolution that incorporates a diverse range of things such as sensors, actuators, and services deployed by different organizations and individuals to support a variety of applications. The information captured by IoT present an unprecedented opportunity to solve large-scale problems in those application domains to deliver services; example applications include precision agriculture, environment monitoring, smart health, smart manufacturing, and smart cities. Like all other Internet based services in the past, IoT-based services are also being developed and deployed without security consideration. By nature, IoT devices and services are vulnerable to malicious cyber threats as they cannot be given the same protection that is received by enterprise services within an enterprise perimeter. While IoT services will play an important role in our daily life resulting in improved productivity and quality of life, the trend has also “encouraged” cyber-exploitation and evolution and diversification of malicious cyber threats. Hence, there is a need for coordinated efforts from the research community to address resulting concerns, such as those presented in this special section. Several potential research topics are also identified in this special section.

Fremantle, Paul.  2016.  Privacy-enhancing Federated Middleware for the Internet of Things. Proceedings of the Doctoral Symposium of the 17th International Middleware Conference. :4:1–4:4.

The Internet of Things (IoT) offers new opportunities, but alongside those come many challenges for security and privacy. Most IoT devices offer no choice to users of where data is published, which data is made available and what identities are used for both devices and users. The aim of this work is to explore new middleware models and techniques that can provide users with more choice as well as enhance privacy and security. This paper outlines a new model and a prototype of a middleware system that implements this model.

2017-05-16
Ren, Kun, Diamond, Thaddeus, Abadi, Daniel J., Thomson, Alexander.  2016.  Low-Overhead Asynchronous Checkpointing in Main-Memory Database Systems. Proceedings of the 2016 International Conference on Management of Data. :1539–1551.

As it becomes increasingly common for transaction processing systems to operate on datasets that fit within the main memory of a single machine or a cluster of commodity machines, traditional mechanisms for guaranteeing transaction durability–-which typically involve synchronous log flushes–-incur increasingly unappealing costs to otherwise lightweight transactions. Many applications have turned to periodically checkpointing full database state. However, existing checkpointing methods–-even those which avoid freezing the storage layer–-often come with significant costs to operation throughput, end-to-end latency, and total memory usage. This paper presents Checkpointing Asynchronously using Logical Consistency (CALC), a lightweight, asynchronous technique for capturing database snapshots that does not require a physical point of consistency to create a checkpoint, and avoids conspicuous latency spikes incurred by other database snapshotting schemes. Our experiments show that CALC can capture frequent checkpoints across a variety of transactional workloads with extremely small cost to transactional throughput and low additional memory usage compared to other state-of-the-art checkpointing systems.

Mendizabal, Odorico M., Dotti, Fernando Luís, Pedone, Fernando.  2016.  Analysis of Checkpointing Overhead in Parallel State Machine Replication. Proceedings of the 31st Annual ACM Symposium on Applied Computing. :534–537.

State machine replication (SMR) is a well-established technique to fault-tolerant systems. In part, this is explained by the simplicity of the approach and its strong consistency guarantees. Recently, several proposals have suggested parallelizing the execution of state machine replicas to achieve high throughput. Concurrent execution of commands has many implications, including the recovery of replicas from failures. Conventional checkpointing techniques, for example, must be revisited in parallelized models. In this paper, we review parallel variations of state machine replication and discuss how checkpointing procedures apply to these models. Moreover, we evaluate the impact caused by checkpointing techniques on recovery through simulations.

Wu, Hao, Mao, Jiangyun, Sun, Weiwei, Zheng, Baihua, Zhang, Hanyuan, Chen, Ziyang, Wang, Wei.  2016.  Probabilistic Robust Route Recovery with Spatio-Temporal Dynamics. Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. :1915–1924.

Vehicle trajectories are one of the most important data in location-based services. The quality of trajectories directly affects the services. However, in the real applications, trajectory data are not always sampled densely. In this paper, we study the problem of recovering the entire route between two distant consecutive locations in a trajectory. Most existing works solve the problem without using those informative historical data or solve it in an empirical way. We claim that a data-driven and probabilistic approach is actually more suitable as long as data sparsity can be well handled. We propose a novel route recovery system in a fully probabilistic way which incorporates both temporal and spatial dynamics and addresses all the data sparsity problem introduced by the probabilistic method. It outperforms the existing works with a high accuracy (over 80%) and shows a strong robustness even when the length of routes to be recovered is very long (about 30 road segments) or the data is very sparse.

Najafi, Ali, Rudell, Jacques C., Sathe, Visvesh.  2016.  Regenerative Breaking: Recovering Stored Energy from Inactive Voltage Domains for Energy-efficient Systems-on-Chip. Proceedings of the 2016 International Symposium on Low Power Electronics and Design. :94–99.

Modern Systems-on-Chip(SoCs) frequently power-off individual voltage domains to save leakage power across a variety of applications, from large-scale heterogeneous computing to ultra-low power systems in IoT applications. However, the considerable energy stored within the capacitance of the powered-off domain is lost through leakage. In this paper, we present an approach to leverage existing voltage regulators to recover this energy from the disabled voltage-domain back into the supply using a low-overhead all-digital runtime control system. Simulation experiments conducted in an industrial 65nm CMOS process indicate that over 90% of the stored energy can be recovered across a range of operating system voltages from 0.4V–1V.

Yao, Chang, Agrawal, Divyakant, Chen, Gang, Ooi, Beng Chin, Wu, Sai.  2016.  Adaptive Logging: Optimizing Logging and Recovery Costs in Distributed In-memory Databases. Proceedings of the 2016 International Conference on Management of Data. :1119–1134.

By maintaining the data in main memory, in-memory databases dramatically reduce the I/O cost of transaction processing. However, for recovery purposes, in-memory systems still need to flush the log to disk, which incurs a substantial number of I/Os. Recently, command logging has been proposed to replace the traditional data log (e.g., ARIES logging) in in-memory databases. Instead of recording how the tuples are updated, command logging only tracks the transactions that are being executed, thereby effectively reducing the size of the log and improving the performance. However, when a failure occurs, all the transactions in the log after the last checkpoint must be redone sequentially and this significantly increases the cost of recovery. In this paper, we first extend the command logging technique to a distributed system, where all the nodes can perform their recovery in parallel. We show that in a distributed system, the only bottleneck of recovery caused by command logging is the synchronization process that attempts to resolve the data dependency among the transactions. We then propose an adaptive logging approach by combining data logging and command logging. The percentage of data logging versus command logging becomes a tuning knob between the performance of transaction processing and recovery to meet different OLTP requirements, and a model is proposed to guide such tuning. Our experimental study compares the performance of our proposed adaptive logging, ARIES-style data logging and command logging on top of H-Store. The results show that adaptive logging can achieve a 10x boost for recovery and a transaction throughput that is comparable to that of command logging.

Lee, Sungwon, Moon, Eunbae, Kim, Dongkyun.  2016.  Consistency of Path Based Upward Path Recovery Method to Reduce Path Recovery Delay for RPL. Proceedings of the International Conference on Research in Adaptive and Convergent Systems. :117–120.

In IoT (Internet of Things) networks, RPL (IPv6 Routing protocol for Low Power and Lossy Networks) is preferred for reducing routing overhead. In RPL, a node selects one parent node which includes the lowest routing metric among its neighbors and the other neighbors are stored as immediate successors. If the selected parent node is lost, the node selects a new parent node among the immediate successors. However, if the new path also includes the same intermediate node which is lost in previous path, it also fails to transmit upward packets. This procedure might be repeated until the new path is selected which does not include the lost immediate node. In this paper, we therefore propose a new path recovery method to reduce the unnecessary repetition for upward path recovery. When a node receives routing message, it calculates the hash value and sets 1 to a new field in the routing message. Based on the field, the node estimates an approximate number of ancestors that are shared between each paths. When loss of upward path is detected, the node selects a new path according to both approximate number and the routing metric. Therefore, a new path which dose not include same ancestors with the previous path is selected and data packet can be resumed immediately.

Mokhtar, Maizura, Hunt, Ian, Burns, Stephen, Ross, Dave.  2016.  Optimising a Waste Heat Recovery System Using Multi-Objective Evolutionary Algorithm. Proceedings of the 2016 on Genetic and Evolutionary Computation Conference Companion. :913–920.

A waste heat recovery system (WHRS) on a process with variable output, is an example of an intermittent renewable process. WHRS recycles waste heat into usable energy. As an example, waste heat produced from refrigeration can be used to provide hot water. However, consistent with most intermittent renewable energy systems, the likelihood of waste heat availability at times of demand is low. For this reason, the WHRS may be coupled with a hot water reservoir (HWR) acting as the energy storage system that aims to maintain desired hot water temperature Td (and therefore energy) at time of demand. The coupling of the WHRS and the HWR must be optimised to ensure higher efficiency given the intermittent mismatch of demand and heat availability. Efficiency of an WHRS can be defined as achieving multiple objectives, including to minimise the need for back-up energy to achieve Td, and to minimise waste heat not captured (when the reservoir volume Vres is too small). This paper investigates the application of a Multi Objective Evolutionary Algorithm (MOEA) to optimise the parameters of the WHRS, including the Vres and depth of discharge (DoD), that affect the WHRS efficiency. Results show that one of the optimum solutions obtained requires the combination of high Vres, high DoD, low water feed in rate, low power external back-up heater and high excess temperature for the HWR to ensure efficiency of the WHRS.

Koskinen, Eric, Yang, Junfeng.  2016.  Reducing Crash Recoverability to Reachability. Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. :97–108.

Software applications run on a variety of platforms (filesystems, virtual slices, mobile hardware, etc.) that do not provide 100% uptime. As such, these applications may crash at any unfortunate moment losing volatile data and, when re-launched, they must be able to correctly recover from potentially inconsistent states left on persistent storage. From a verification perspective, crash recovery bugs can be particularly frustrating because, even when it has been formally proved for a program that it satisfies a property, the proof is foiled by these external events that crash and restart the program. In this paper we first provide a hierarchical formal model of what it means for a program to be crash recoverable. Our model captures the recoverability of many real world programs, including those in our evaluation which use sophisticated recovery algorithms such as shadow paging and write-ahead logging. Next, we introduce a novel technique capable of automatically proving that a program correctly recovers from a crash via a reduction to reachability. Our technique takes an input control-flow automaton and transforms it into an encoding that blends the capture of snapshots of pre-crash states into a symbolic search for a proof that recovery terminates and every recovered execution simulates some crash-free execution. Our encoding is designed to enable one to apply existing abstraction techniques in order to do the work that is necessary to prove recoverability. We have implemented our technique in a tool called Eleven82, capable of analyzing C programs to detect recoverability bugs or prove their absence. We have applied our tool to benchmark examples drawn from industrial file systems and databases, including GDBM, LevelDB, LMDB, PostgreSQL, SQLite, VMware and ZooKeeper. Within minutes, our tool is able to discover bugs or prove that these fragments are crash recoverable.

Gu, Tianxiao, Sun, Chengnian, Ma, Xiaoxing, Lü, Jian, Su, Zhendong.  2016.  Automatic Runtime Recovery via Error Handler Synthesis. Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. :684–695.

Software systems are often subject to unexpected runtime errors. Automatic runtime recovery (ARR) techniques aim at recovering them from erroneous states and maintaining them functional in the field. This paper proposes Ares , a novel, practical approach to performing ARR. Our key insight is to leverage a system's already built-in error handling support to recover from unexpected errors. To this end, we synthesize error handlers via two methods: error transformation and early return. We also equip Ares with a lightweight in-vivo testing infrastructure to select the right synthesis methods and avoid potentially dangerous error handlers. Unlike existing ARR techniques based on heavyweight mechanisms (e.g., checkpoint-restart and runtime monitoring), our approach expands the intrinsic capability of runtime error resilience already existing in software systems to handle unexpected errors. Ares's lightweight mechanism makes it practical and easy to be integrated into production environments. We have implemented Ares on top of both the Java HotSpot VM and Android ART, and applied it to 52 real-world bugs. The results are promising — Ares successfully recovers from 39 of them and incurs low overhead.

Gensh, Rem, Romanovsky, Alexander, Yakovlev, Alex.  2016.  On Structuring Holistic Fault Tolerance. Proceedings of the 15th International Conference on Modularity. :130–133.

Computer systems are developed taking into account that they should be easily maintained in the future. It is one of the main requirements for the sound architectural design. The existing approaches to introducing fault tolerance rely on recursive system structuring out of functional components – this typically results in non-optimal fault tolerance. The paper proposes a vision of structuring complex many-core systems by introducing a special component supporting system-wide fault tolerance coordination. The component acts as a central module making decisions about fault tolerance strategies to be implemented by individual system components depending on the performance and energy requirements specified as system operating modes.

McClatchey, Richard, Branson, Andrew, Shamdasani, Jetendr.  2016.  Provenance Support for Biomedical Big Data Analytics. Proceedings of the 20th International Database Engineering & Applications Symposium. :386–391.

One essential requirement for supporting analytics for Big Medical Data systems is the provision of a suitable level of traceability to data or processes ('Items') in large volumes of data. Systems should be designed from the outset to support usage of such Items across the spectrum of medical use and over time in order to promote traceability, to simplify maintenance and to assist analytics. The philosophy proposed in this paper is to design medical data systems using a 'description-driven' approach in which meta-data and the description of medical items are saved alongside the data, simplifying item re-use over time and thereby enabling the traceability of these items over time and their use in analytics. Details are given of a big data system in neuroimaging to demonstrate aspects of provenance data capture, collaborative analysis and longitudinal information traceability. Evidence is presented that the description-driven approach leads to simplicity of design and ease of maintenance following the adoption of a unified approach to Item management.

Arab, Bahareh Sadat, Gawlick, Dieter, Krishnaswamy, Vasudha, Radhakrishnan, Venkatesh, Glavic, Boris.  2016.  Reenactment for Read-Committed Snapshot Isolation. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. :841–850.

Provenance for transactional updates is critical for many applications such as auditing and debugging of transactions. Recently, we have introduced MV-semirings, an extension of the semiring provenance model that supports updates and transactions. Furthermore, we have proposed reenactment, a declarative form of replay with provenance capture, as an efficient and non-invasive method for computing this type of provenance. However, this approach is limited to the snapshot isolation (SI) concurrency control protocol while many real world applications apply the read committed version of snapshot isolation (RC-SI) to improve performance at the cost of consistency. We present non trivial extensions of the model and reenactment approach to be able to compute provenance of RC-SI transactions efficiently. In addition, we develop techniques for applying reenactment across multiple RC-SI transactions. Our experiments demonstrate that our implementation in the GProM system supports efficient re-construction and querying of provenance.

Chirigati, Fernando, Rampin, Rémi, Shasha, Dennis, Freire, Juliana.  2016.  ReproZip: Computational Reproducibility With Ease. Proceedings of the 2016 International Conference on Management of Data. :2085–2088.

We present ReproZip, the recommended packaging tool for the SIGMOD Reproducibility Review. ReproZip was designed to simplify the process of making an existing computational experiment reproducible across platforms, even when the experiment was put together without reproducibility in mind. The tool creates a self-contained package for an experiment by automatically tracking and identifying all its required dependencies. The researcher can share the package with others, who can then use ReproZip to unpack the experiment, reproduce the findings on their favorite operating system, as well as modify the original experiment for reuse in new research, all with little effort. The demo will consist of examples of non-trivial experiments, showing how these can be packed in a Linux machine and reproduced on different machines and operating systems. Demo visitors will also be able to pack and reproduce their own experiments.

Stevens, Ryan, Crussell, Jonathan, Chen, Hao.  2016.  On the Origin of Mobile Apps: Network Provenance for Android Applications. Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy. :160–171.

Many mobile services consist of two components: a server providing an API, and an application running on smartphones and communicating with the API. An unresolved problem in this design is that it is difficult for the server to authenticate which app is accessing the API. This causes many security problems. For example, the provider of a private network API has to embed secrets in its official app to ensure that only this app can access the API; however, attackers can uncover the secret by reverse-engineering. As another example, malicious apps may send automatic requests to ad servers to commit ad fraud. In this work, we propose a system that allows network API to authenticate the mobile app that sends each request so that the API can make an informed access control decision. Our system, the Mobile Trusted-Origin Policy, consists of two parts: 1) an app provenance mechanism that annotates outgoing HTTP(S) requests with information about which app generated the network traffic, and 2) a code isolation mechanism that separates code within an app that should have different app provenance signatures into mobile origin. As motivation for our work, we present two previously-unknown families of apps that perform click fraud, and examine how the lack of mobile origin information enables the attacks. Based on our observations, we propose Trusted Cross-Origin Requests to handle point (1), which automatically includes mobile origin information in outgoing HTTP requests. Servers may then decide, based on the mobile origin data, whether to process the request or not. We implement a prototype of our system for Android and evaluate its performance, security, and deployability. We find that our system can achieve our security and utility goals with negligible overhead.

Tian, Dave(Jing), Bates, Adam, Butler, Kevin R.B., Rangaswami, Raju.  2016.  ProvUSB: Block-level Provenance-Based Data Protection for USB Storage Devices. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. :242–253.

Defenders of enterprise networks have a critical need to quickly identify the root causes of malware and data leakage. Increasingly, USB storage devices are the media of choice for data exfiltration, malware propagation, and even cyber-warfare. We observe that a critical aspect of explaining and preventing such attacks is understanding the provenance of data (i.e., the lineage of data from its creation to current state) on USB devices as a means of ensuring their safe usage. Unfortunately, provenance tracking is not offered by even sophisticated modern devices. This work presents ProvUSB, an architecture for fine-grained provenance collection and tracking on smart USB devices. ProvUSB maintains data provenance by recording reads and writes at the block layer and reliably identifying hosts editing those blocks through attestation over the USB channel. Our evaluation finds that ProvUSB imposes a one-time 850 ms overhead during USB enumeration, but approaches nearly-bare-metal runtime performance (90% of throughput) on larger files during normal execution, and less than 0.1% storage overhead for provenance in real-world workloads. ProvUSB thus provides essential new techniques in the defense of computer systems and USB storage devices.

Karsai, Linus, Fekete, Alan, Kay, Judy, Missier, Paolo.  2016.  Clustering Provenance Facilitating Provenance Exploration Through Data Abstraction. Proceedings of the Workshop on Human-In-the-Loop Data Analytics. :6:1–6:5.

As digital objects become increasingly important in people's lives, people may need to understand the provenance, or lineage and history, of an important digital object, to understand how it was produced. This is particularly important for objects created from large, multi-source collections of personal data. As the metadata describing provenance, Provenance Data, is commonly represented as a labelled directed acyclic graph, the challenge is to create effective interfaces onto such graphs so that people can understand the provenance of key digital objects. This unsolved problem is especially challenging for the case of novice and intermittent users and complex provenance graphs. We tackle this by creating an interface based on a clustering approach. This was designed to enable users to view provenance graphs, and to simplify complex graphs by combining several nodes. Our core contribution is the design of a prototype interface that supports clustering and its analytic evaluation in terms of desirable properties of visualisation interfaces.

Maruseac, Mihai, Ghinita, Gabriel.  2016.  Privacy-Preserving Mining of Sequential Association Rules from Provenance Workflows. Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy. :127–129.

Provenance workflows capture movement and transformation of data in complex environments, such as document management in large organizations, content generation and sharing in in social media, scientific computations, etc. Sharing and processing of provenance workflows brings numerous benefits, e.g., improving productivity in an organization, understanding social media interaction patterns, etc. However, directly sharing provenance may also disclose sensitive information such as confidential business practices, or private details about participants in a social network. We propose an algorithm that privately extracts sequential association rules from provenance workflow datasets. Finding such rules has numerous practical applications, such as capacity planning or identifying hot-spots in provenance graphs. Our approach provides good accuracy and strong privacy, by leveraging on the exponential mechanism of differential privacy. We propose an heuristic that identifies promising candidate rules and makes judicious use of the privacy budget. Experimental results show that the our approach is fast and accurate, and clearly outperforms the state-of-the-art. We also identify influential factors in improving accuracy, which helps in choosing promising directions for future improvement.

Chen, Ang, Wu, Yang, Haeberlen, Andreas, Zhou, Wenchao, Loo, Boon Thau.  2016.  The Good, the Bad, and the Differences: Better Network Diagnostics with Differential Provenance. Proceedings of the 2016 ACM SIGCOMM Conference. :115–128.

In this paper, we propose a new approach to diagnosing problems in complex distributed systems. Our approach is based on the insight that many of the trickiest problems are anomalies. For instance, in a network, problems often affect only a small fraction of the traffic (e.g., perhaps a certain subnet), or they only manifest infrequently. Thus, it is quite common for the operator to have “examples” of both working and non-working traffic readily available – perhaps a packet that was misrouted, and a similar packet that was routed correctly. In this case, the cause of the problem is likely to be wherever the two packets were treated differently by the network. We present the design of a debugger that can leverage this information using a novel concept that we call differential provenance. Differential provenance tracks the causal connections between network states and state changes, just like classical provenance, but it can additionally perform root-cause analysis by reasoning about the differences between two provenance trees. We have built a diagnostic tool that is based on differential provenance, and we have used our tool to debug a number of complex, realistic problems in two scenarios: software-defined networks and MapReduce jobs. Our results show that differential provenance can be maintained at relatively low cost, and that it can deliver very precise diagnostic information; in many cases, it can even identify the precise root cause of the problem.

Herschel, Melanie, Hlawatsch, Marcel.  2016.  Provenance: On and Behind the Screens. Proceedings of the 2016 International Conference on Management of Data. :2213–2217.

Collecting and processing provenance, i.e., information describing the production process of some end product, is important in various applications, e.g., to assess quality, to ensure reproducibility, or to reinforce trust in the end product. In the past, different types of provenance meta-data have been proposed, each with a different scope. The first part of the proposed tutorial provides an overview and comparison of these different types of provenance. To put provenance to good use, it is essential to be able to interact with and present provenance data in a user-friendly way. Often, users interested in provenance are not necessarily experts in databases or query languages, as they are typically domain experts of the product and production process for which provenance is collected (biologists, journalists, etc.). Furthermore, in some scenarios, it is difficult to use solely queries for analyzing and exploring provenance data. The second part of this tutorial therefore focuses on enabling users to leverage provenance through adapted visualizations. To this end, we will present some fundamental concepts of visualization before we discuss possible visualizations for provenance.