Biblio | CPS-VO

Shih, M. H., Chang, J. M.. 2017. Design and analysis of high performance crypt-NoSQL. 2017 IEEE Conference on Dependable and Secure Computing. :52–59.

NoSQL databases have become popular with enterprises due to their scalable and flexible storage management of big data. Nevertheless, their popularity also brings up security concerns. Most NoSQL databases lacked secure data encryption, relying on developers to implement cryptographic methods at application level or middleware layer as a wrapper around the database. While this approach protects the integrity of data, it increases the difficulty of executing queries. We were motivated to design a system that not only provides NoSQL databases with the necessary data security, but also supports the execution of query over encrypted data. Furthermore, how to exploit the distributed fashion of NoSQL databases to deliver high performance and scalability with massive client accesses is another important challenge. In this research, we introduce Crypt-NoSQL, the first prototype to support execution of query over encrypted data on NoSQL databases with high performance. Three different models of Crypt-NoSQL were proposed and performance was evaluated with Yahoo! Cloud Service Benchmark (YCSB) considering an enormous number of clients. Our experimental results show that Crypt-NoSQL can process queries over encrypted data with high performance and scalability. A guidance of establishing service level agreement (SLA) for Crypt-NoSQL as a cloud service is also proposed.

Stephan, E., Raju, B., Elsethagen, T., Pouchard, L., Gamboa, C.. 2017. A scientific data provenance harvester for distributed applications. 2017 New York Scientific Data Summit (NYSDS). :1–9.

Data provenance provides a way for scientists to observe how experimental data originates, conveys process history, and explains influential factors such as experimental rationale and associated environmental factors from system metrics measured at runtime. The US Department of Energy Office of Science Integrated end-to-end Performance Prediction and Diagnosis for Extreme Scientific Workflows (IPPD) project has developed a provenance harvester that is capable of collecting observations from file based evidence typically produced by distributed applications. To achieve this, file based evidence is extracted and transformed into an intermediate data format inspired in part by W3C CSV on the Web recommendations, called the Harvester Provenance Application Interface (HAPI) syntax. This syntax provides a general means to pre-stage provenance into messages that are both human readable and capable of being written to a provenance store, Provenance Environment (ProvEn). HAPI is being applied to harvest provenance from climate ensemble runs for Accelerated Climate Modeling for Energy (ACME) project funded under the U.S. Department of Energy's Office of Biological and Environmental Research (BER) Earth System Modeling (ESM) program. ACME informally provides provenance in a native form through configuration files, directory structures, and log files that contain success/failure indicators, code traces, and performance measurements. Because of its generic format, HAPI is also being applied to harvest tabular job management provenance from Belle II DIRAC scheduler relational database tables as well as other scientific applications that log provenance related information.

DiScala, Michael, Abadi, Daniel J.. 2016. Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data. Proceedings of the 2016 International Conference on Management of Data. :295–310.

Self-describing key-value data formats such as JSON are becoming increasingly popular as application developers choose to avoid the rigidity imposed by the relational model. Database systems designed for these self-describing formats, such as MongoDB, encourage users to use denormalized, heavily nested data models so that relationships across records and other schema information need not be predefined or standardized. Such data models contribute to long-term development complexity, as their lack of explicit entity and relationship tracking burdens new developers unfamiliar with the dataset. Furthermore, the large amount of data repetition present in such data layouts can introduce update anomalies and poor scan performance, which reduce both the quality and performance of analytics over the data. In this paper we present an algorithm that automatically transforms the denormalized, nested data commonly found in NoSQL systems into traditional relational data that can be stored in a standard RDBMS. This process includes a schema generation algorithm that discovers relationships across the attributes of the denormalized datasets in order to organize those attributes into relational tables. It further includes a matching algorithm that discovers sets of attributes that represent overlapping entities and merges those sets together. These algorithms reduce data repetition, allow the use of data analysis tools targeted at relational data, accelerate scan-intensive algorithms over the data, and help users gain a semantic understanding of complex, nested datasets.

Jain, Shrainik, Moritz, Dominik, Halperin, Daniel, Howe, Bill, Lazowska, Ed. 2016. SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment. Proceedings of the 2016 International Conference on Management of Data. :281–293.

We analyze the workload from a multi-year deployment of a database-as-a-service platform targeting scientists and data scientists with minimal database experience. Our hypothesis was that relatively minor changes to the way databases are delivered can increase their use in ad hoc analysis environments. The web-based SQLShare system emphasizes easy dataset-at-a-time ingest, relaxed schemas and schema inference, easy view creation and sharing, and full SQL support. We find that these features have helped attract workloads typically associated with scripts and files rather than relational databases: complex analytics, routine processing pipelines, data publishing, and collaborative analysis. Quantitatively, these workloads are characterized by shorter dataset "lifetimes", higher query complexity, and higher data complexity. We report on usage scenarios that suggest SQL is being used in place of scripts for one-off data analysis and ad hoc data sharing. The workload suggests that a new class of relational systems emphasizing short-term, ad hoc analytics over engineered schemas may improve uptake of database technology in data science contexts. Our contributions include a system design for delivering databases into these contexts, a description of a public research query workload dataset released to advance research in analytic data systems, and an initial analysis of the workload that provides evidence of new use cases under-supported in existing systems.

DiScala, Michael, Abadi, Daniel J.. 2016. Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data. Proceedings of the 2016 International Conference on Management of Data. :295–310.

Self-describing key-value data formats such as JSON are becoming increasingly popular as application developers choose to avoid the rigidity imposed by the relational model. Database systems designed for these self-describing formats, such as MongoDB, encourage users to use denormalized, heavily nested data models so that relationships across records and other schema information need not be predefined or standardized. Such data models contribute to long-term development complexity, as their lack of explicit entity and relationship tracking burdens new developers unfamiliar with the dataset. Furthermore, the large amount of data repetition present in such data layouts can introduce update anomalies and poor scan performance, which reduce both the quality and performance of analytics over the data. In this paper we present an algorithm that automatically transforms the denormalized, nested data commonly found in NoSQL systems into traditional relational data that can be stored in a standard RDBMS. This process includes a schema generation algorithm that discovers relationships across the attributes of the denormalized datasets in order to organize those attributes into relational tables. It further includes a matching algorithm that discovers sets of attributes that represent overlapping entities and merges those sets together. These algorithms reduce data repetition, allow the use of data analysis tools targeted at relational data, accelerate scan-intensive algorithms over the data, and help users gain a semantic understanding of complex, nested datasets.

Xingbang Tian, Baohua Huang, Min Wu. 2014. A transparent middleware for encrypting data in MongoDB. Electronics, Computer and Applications, 2014 IEEE Workshop on. :906-909.

Due to the development of cloud computing and NoSQL database, more and more sensitive information are stored in NoSQL databases, which exposes quite a lot security vulnerabilities. This paper discusses security features of MongoDB database and proposes a transparent middleware implementation. The analysis of experiment results show that this transparent middleware can efficiently encrypt sensitive data specified by users on a dataset level. Existing application systems do not need too many modifications in order to apply this middleware.

Pathan, A.C., Potey, M.A.. 2014. Detection of Malicious Transaction in Database Using Log Mining Approach. Electronic Systems, Signal Processing and Computing Technologies (ICESC), 2014 International Conference on. :262-265.

Data mining is the process of finding correlations in the relational databases. There are different techniques for identifying malicious database transactions. Many existing approaches which profile is SQL query structures and database user activities to detect intrusion, the log mining approach is the automatic discovery for identifying anomalous database transactions. Mining of the Data is very helpful to end users for extracting useful business information from large database. Multi-level and multi-dimensional data mining are employed to discover data item dependency rules, data sequence rules, domain dependency rules, and domain sequence rules from the database log containing legitimate transactions. Database transactions that do not comply with the rules are identified as malicious transactions. The log mining approach can achieve desired true and false positive rates when the confidence and support are set up appropriately. The implemented system incrementally maintain the data dependency rule sets and optimize the performance of the intrusion detection process.