Visible to the public Understanding Data Science Lifecycle Provenance via Graph Segmentation and Summarization

TitleUnderstanding Data Science Lifecycle Provenance via Graph Segmentation and Summarization
Publication TypeConference Paper
Year of Publication2019
AuthorsMiao, Hui, Deshpande, Amol
Conference Name2019 IEEE 35th International Conference on Data Engineering (ICDE)
Date Publishedapr
KeywordsApproximation algorithms, approximation theory, boundary criteria, composability, computational complexity, context free language, context-free grammar, context-free grammars, data handling, Data models, Data Science, data science lifecycle provenance, data science platforms, Databases, destination vertices, graph data models, graph query, graph query model, graph segmentation operator, graph summarization operator, graph theory, high-level graph query operators, Human Behavior, meta data, metadata, Metrics, model management, optimisation, Pipelines, Provenance, provenance ingestion mechanisms, provenance management, PSPACE-complete, pubcrawl, query processing, query techniques, Resiliency, Semantics, Skeleton, source vertices, Writing
AbstractIncreasingly modern data science platforms today have non-intrusive and extensible provenance ingestion mechanisms to collect rich provenance and context information, handle modifications to the same file using distinguishable versions, and use graph data models (e.g., property graphs) and query languages (e.g., Cypher) to represent and manipulate the stored provenance/context information. Due to the schema-later nature of the metadata, multiple versions of the same files, and unfamiliar artifacts introduced by team members, the resulting "provenance graphs" are quite verbose and evolving; further, it is very difficult for the users to compose queries and utilize this valuable information just using standard graph query model. In this paper, we propose two high-level graph query operators to address the verboseness and evolving nature of such provenance graphs. First, we introduce a graph segmentation operator, which queries the retrospective provenance between a set of source vertices and a set of destination vertices via flexible boundary criteria to help users get insight about the derivation relationships among those vertices. We show the semantics of such a query in terms of a context-free grammar, and develop efficient algorithms that run orders of magnitude faster than state-of-the-art. Second, we propose a graph summarization operator that combines similar segments together to query prospective provenance of the underlying project. The operator allows tuning the summary by ignoring vertex details and characterizing local structures, and ensures the provenance meaning using path constraints. We show the optimal summary problem is PSPACE-complete and develop effective approximation algorithms. We implement the operators on top of Neo4j, evaluate our query techniques extensively, and show the effectiveness and efficiency of the proposed methods.
DOI10.1109/ICDE.2019.00179
Citation Keymiao_understanding_2019