Title | Fine-Grained Provenance for Matching ETL |
Publication Type | Conference Paper |
Year of Publication | 2019 |
Authors | Zheng, N., Alawini, A., Ives, Z. G. |
Conference Name | 2019 IEEE 35th International Conference on Data Engineering (ICDE) |
Keywords | algebra, application program interfaces, arbitrary code, code versions, common ETL, composability, Data analysis, data mining, data objects, data provenance tools, data science tasks, data types, data warehouses, database management systems, database provenance tools, database-style provenance techniques, ETL, Human Behavior, Instruments, matching computations, matching tasks, Metrics, Optimization, Provenance, provenance APIs, provenance-driven troubleshooting tool, PROVision, pubcrawl, Record linking, Resiliency, Semantics, support optimization, Task Analysis, Tools, track provenance, tuple-level provenance, workflow, workflow management software, workflow provenance systems |
Abstract | Data provenance tools capture the steps used to produce analyses. However, scientists must choose among workflow provenance systems, which allow arbitrary code but only track provenance at the granularity of files; provenance APIs, which provide tuple-level provenance, but incur overhead in all computations; and database provenance tools, which track tuple-level provenance through relational operators and support optimization, but support a limited subset of data science tasks. None of these solutions are well suited for tracing errors introduced during common ETL, record alignment, and matching tasks - for data types such as strings, images, etc. Scientists need new capabilities to identify the sources of errors, find why different code versions produce different results, and identify which parameter values affect output. We propose PROVision, a provenance-driven troubleshooting tool that supports ETL and matching computations and traces extraction of content within data objects. PROVision extends database-style provenance techniques to capture equivalences, support optimizations, and enable selective evaluation. We formalize our extensions, implement them in the PROVision system, and validate their effectiveness and scalability for common ETL and matching tasks. |
DOI | 10.1109/ICDE.2019.00025 |
Citation Key | zheng_fine-grained_2019 |