Visible to the public Big Provenance Stream Processing for Data Intensive Computations

TitleBig Provenance Stream Processing for Data Intensive Computations
Publication TypeConference Paper
Year of Publication2018
AuthorsSuriarachchi, I., Withana, S., Plale, B.
Conference Name2018 IEEE 14th International Conference on e-Science (e-Science)
KeywordsBig Data, Big Provenance, big provenance stream processing, Business, composability, Data analysis, data integrity, data intensive computations, Data Lake, data provenance, data-parallel frameworks, Human Behavior, Lakes, Metrics, on-the-fly provenance processing, Out of order, out-of-order provenance streams, parallel algorithm, parallel algorithms, parallel stream, Partitioning algorithms, Provenance, provenance events, pubcrawl, Resiliency, Sparks, stream processing, tagging, Twitter
AbstractIn the business and research landscape of today, data analysis consumes public and proprietary data from numerous sources, and utilizes any one or more of popular data-parallel frameworks such as Hadoop, Spark and Flink. In the Data Lake setting these frameworks co-exist. Our earlier work has shown that data provenance in Data Lakes can aid with both traceability and management. The sheer volume of fine-grained provenance generated in a multi-framework application motivates the need for on-the-fly provenance processing. We introduce a new parallel stream processing algorithm that reduces fine-grained provenance while preserving backward and forward provenance. The algorithm is resilient to provenance events arriving out-of-order. It is evaluated using several strategies for partitioning a provenance stream. The evaluation shows that the parallel algorithm performs well in processing out-of-order provenance streams, with good scalability and accuracy.
DOI10.1109/eScience.2018.00039
Citation Keysuriarachchi_big_2018