Big Provenance Stream Processing for Data Intensive Computations

Submitted by aekwall on Mon, 09/23/2019 - 10:40am

Title	Big Provenance Stream Processing for Data Intensive Computations
Publication Type	Conference Paper
Year of Publication	2018
Authors	Suriarachchi, I., Withana, S., Plale, B.
Conference Name	2018 IEEE 14th International Conference on e-Science (e-Science)
Keywords	Big Data, Big Provenance, big provenance stream processing, Business, composability, Data analysis, data integrity, data intensive computations, Data Lake, data provenance, data-parallel frameworks, Human Behavior, Lakes, Metrics, on-the-fly provenance processing, Out of order, out-of-order provenance streams, parallel algorithm, parallel algorithms, parallel stream, Partitioning algorithms, Provenance, provenance events, pubcrawl, Resiliency, Sparks, stream processing, tagging, Twitter
Abstract	In the business and research landscape of today, data analysis consumes public and proprietary data from numerous sources, and utilizes any one or more of popular data-parallel frameworks such as Hadoop, Spark and Flink. In the Data Lake setting these frameworks co-exist. Our earlier work has shown that data provenance in Data Lakes can aid with both traceability and management. The sheer volume of fine-grained provenance generated in a multi-framework application motivates the need for on-the-fly provenance processing. We introduce a new parallel stream processing algorithm that reduces fine-grained provenance while preserving backward and forward provenance. The algorithm is resilient to provenance events arriving out-of-order. It is evaluated using several strategies for partitioning a provenance stream. The evaluation shows that the parallel algorithm performs well in processing out-of-order provenance streams, with good scalability and accuracy.
DOI	10.1109/eScience.2018.00039
Citation Key	suriarachchi_big_2018

Groups:

Science of Security VO