Title | Big Provenance Stream Processing for Data Intensive Computations |
Publication Type | Conference Paper |
Year of Publication | 2018 |
Authors | Suriarachchi, I., Withana, S., Plale, B. |
Conference Name | 2018 IEEE 14th International Conference on e-Science (e-Science) |
Keywords | Big Data, Big Provenance, big provenance stream processing, Business, composability, Data analysis, data integrity, data intensive computations, Data Lake, data provenance, data-parallel frameworks, Human Behavior, Lakes, Metrics, on-the-fly provenance processing, Out of order, out-of-order provenance streams, parallel algorithm, parallel algorithms, parallel stream, Partitioning algorithms, Provenance, provenance events, pubcrawl, Resiliency, Sparks, stream processing, tagging, Twitter |
Abstract | In the business and research landscape of today, data analysis consumes public and proprietary data from numerous sources, and utilizes any one or more of popular data-parallel frameworks such as Hadoop, Spark and Flink. In the Data Lake setting these frameworks co-exist. Our earlier work has shown that data provenance in Data Lakes can aid with both traceability and management. The sheer volume of fine-grained provenance generated in a multi-framework application motivates the need for on-the-fly provenance processing. We introduce a new parallel stream processing algorithm that reduces fine-grained provenance while preserving backward and forward provenance. The algorithm is resilient to provenance events arriving out-of-order. It is evaluated using several strategies for partitioning a provenance stream. The evaluation shows that the parallel algorithm performs well in processing out-of-order provenance streams, with good scalability and accuracy. |
DOI | 10.1109/eScience.2018.00039 |
Citation Key | suriarachchi_big_2018 |