Visible to the public Storage and Querying of Large Provenance Graphs Using NoSQL DSE

TitleStorage and Querying of Large Provenance Graphs Using NoSQL DSE
Publication TypeConference Paper
Year of Publication2020
AuthorsKashliev, Andrii
Conference Name2020 IEEE 6th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS)
KeywordsBig Data, composability, DSE, graph, Human Behavior, Metrics, NoSQL, Provenance, pubcrawl, Resiliency
AbstractProvenance metadata captures history of derivation of an entity, such as a dataset obtained through numerous data transformations. It is of great importance for science, among other fields, as it enables reproducibility and greater intelligibility of research results. With the avalanche of provenance produced by today's society, there is a pressing need for storing and low-latency querying of large provenance graphs. To address this need, in this paper we present a scalable approach to storing and querying provenance graphs using a popular NoSQL column family database system called DataStax Enterprise (DSE). Specifically, we i) propose a storage scheme, including two novel indices that enable efficient traversal of provenance graphs along causality lines, ii) present an algorithm for building our proposed indices for a given provenance graph, iii) implement our algorithm and conduct a performance study in which we store and query a provenance graph with over five million vertices using a DSE cluster running in AWS cloud. Our performance study results further validate scalability and performance efficiency of our approach.
DOI10.1109/BigDataSecurity-HPSC-IDS49724.2020.00054
Citation Keykashliev_storage_2020