Title | Storage and Querying of Large Provenance Graphs Using NoSQL DSE |
Publication Type | Conference Paper |
Year of Publication | 2020 |
Authors | Kashliev, Andrii |
Conference Name | 2020 IEEE 6th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS) |
Keywords | Big Data, composability, DSE, graph, Human Behavior, Metrics, NoSQL, Provenance, pubcrawl, Resiliency |
Abstract | Provenance metadata captures history of derivation of an entity, such as a dataset obtained through numerous data transformations. It is of great importance for science, among other fields, as it enables reproducibility and greater intelligibility of research results. With the avalanche of provenance produced by today's society, there is a pressing need for storing and low-latency querying of large provenance graphs. To address this need, in this paper we present a scalable approach to storing and querying provenance graphs using a popular NoSQL column family database system called DataStax Enterprise (DSE). Specifically, we i) propose a storage scheme, including two novel indices that enable efficient traversal of provenance graphs along causality lines, ii) present an algorithm for building our proposed indices for a given provenance graph, iii) implement our algorithm and conduct a performance study in which we store and query a provenance graph with over five million vertices using a DSE cluster running in AWS cloud. Our performance study results further validate scalability and performance efficiency of our approach. |
DOI | 10.1109/BigDataSecurity-HPSC-IDS49724.2020.00054 |
Citation Key | kashliev_storage_2020 |