Fine-Grained Provenance for Matching ETL

Submitted by aekwall on Mon, 09/23/2019 - 10:40am

Title	Fine-Grained Provenance for Matching ETL
Publication Type	Conference Paper
Year of Publication	2019
Authors	Zheng, N., Alawini, A., Ives, Z. G.
Conference Name	2019 IEEE 35th International Conference on Data Engineering (ICDE)
Keywords	algebra, application program interfaces, arbitrary code, code versions, common ETL, composability, Data analysis, data mining, data objects, data provenance tools, data science tasks, data types, data warehouses, database management systems, database provenance tools, database-style provenance techniques, ETL, Human Behavior, Instruments, matching computations, matching tasks, Metrics, Optimization, Provenance, provenance APIs, provenance-driven troubleshooting tool, PROVision, pubcrawl, Record linking, Resiliency, Semantics, support optimization, Task Analysis, Tools, track provenance, tuple-level provenance, workflow, workflow management software, workflow provenance systems
Abstract	Data provenance tools capture the steps used to produce analyses. However, scientists must choose among workflow provenance systems, which allow arbitrary code but only track provenance at the granularity of files; provenance APIs, which provide tuple-level provenance, but incur overhead in all computations; and database provenance tools, which track tuple-level provenance through relational operators and support optimization, but support a limited subset of data science tasks. None of these solutions are well suited for tracing errors introduced during common ETL, record alignment, and matching tasks - for data types such as strings, images, etc. Scientists need new capabilities to identify the sources of errors, find why different code versions produce different results, and identify which parameter values affect output. We propose PROVision, a provenance-driven troubleshooting tool that supports ETL and matching computations and traces extraction of content within data objects. PROVision extends database-style provenance techniques to capture equivalences, support optimizations, and enable selective evaluation. We formalize our extensions, implement them in the PROVision system, and validate their effectiveness and scalability for common ETL and matching tasks.
DOI	10.1109/ICDE.2019.00025
Citation Key	zheng_fine-grained_2019

Groups:

Science of Security VO