Visible to the public CAREER: Scalable Information Flow Monitoring and Enforcement through Data Provenance UnificationConflict Detection Enabled

Project Details

Lead PI

Performance Period

Apr 01, 2018 - Mar 31, 2023

Institution(s)

University of Illinois at Urbana-Champaign

Sponsor(s)

National Science Foundation

Award Number


System intrusions have becoming more subtle and complex. Attackers now covertly observe and probe systems for prolonged periods before launching devastating attacks. In such an environment, it has grown prohibitively difficult for system administrators to identify suspicious events, correlate these events into an attack pattern, and determine an appropriate response. Data Provenance is a method of modeling a system's execution in the form of a causal relationship graph, allowing investigators to trace the ancestry of data objects and identify relationships between seemingly independent events. The goal of the proposed work is to develop techniques that enable the use of data provenance as an expressive and efficient monitoring tool in large distributed systems. These mechanisms will enable unprecedented capability to reason about system events, centrally monitor activities within data centers, and express fine-grained enforcement of security properties based on the historical flow of data. Research and software artifacts will be made available to the broader community through the Linux provenance web site.

The proposed work will examine central challenges related to expressivity and scalability that currently prevent the further proliferation of provenance-based auditing techniques. To address the semantic gap that has traditionally prevented system-layer auditing from being able to explain higher-level application behaviors, this project pursues the design of universal provenance mechanisms that leverage binary analysis to transparently identify siloed application-layer logging activities, extract their semantics, and graft the information onto a causal relationship graph that encodes the entire system's execution. Grammar induction techniques will be leveraged to overcome the tremendous storage burden of provenance and provide a scalable central monitoring framework for data centers. After enriching system-layer auditing and enabling the efficient communication of suspicious activities via provenance traces, data provenance will be integrated into enforcement mechanisms to address critical security challenges including regulatory compliance, information flow control, and fault attribution. The advancement of state-of-the-art of provenance-based tracing and enforcement should establish a new baseline for reasoning about the flow of data in today's complex computing systems.