Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data
Title | Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data |
Publication Type | Conference Paper |
Year of Publication | 2016 |
Authors | DiScala, Michael, Abadi, Daniel J. |
Conference Name | Proceedings of the 2016 International Conference on Management of Data |
Publisher | ACM |
Conference Location | New York, NY, USA |
ISBN Number | 978-1-4503-3531-7 |
Keywords | Deduplication, denormalized data, entity extraction, functional dependencies, functional dependency mining, Human Behavior, JSON, Key Management, key-value data, Metrics, normalization, pubcrawl, relational databases, Resiliency, Scalability, schema extraction, schema generation, schema matching, semistructured data, semistructured-to-relational mappings |
Abstract | Self-describing key-value data formats such as JSON are becoming increasingly popular as application developers choose to avoid the rigidity imposed by the relational model. Database systems designed for these self-describing formats, such as MongoDB, encourage users to use denormalized, heavily nested data models so that relationships across records and other schema information need not be predefined or standardized. Such data models contribute to long-term development complexity, as their lack of explicit entity and relationship tracking burdens new developers unfamiliar with the dataset. Furthermore, the large amount of data repetition present in such data layouts can introduce update anomalies and poor scan performance, which reduce both the quality and performance of analytics over the data. In this paper we present an algorithm that automatically transforms the denormalized, nested data commonly found in NoSQL systems into traditional relational data that can be stored in a standard RDBMS. This process includes a schema generation algorithm that discovers relationships across the attributes of the denormalized datasets in order to organize those attributes into relational tables. It further includes a matching algorithm that discovers sets of attributes that represent overlapping entities and merges those sets together. These algorithms reduce data repetition, allow the use of data analysis tools targeted at relational data, accelerate scan-intensive algorithms over the data, and help users gain a semantic understanding of complex, nested datasets. |
URL | http://doi.acm.org/10.1145/2882903.2882924 |
DOI | 10.1145/2882903.2882924 |
Citation Key | discala_automatic_2016 |
- normalization
- semistructured-to-relational mappings
- semistructured data
- schema matching
- schema generation
- schema extraction
- Scalability
- Resiliency
- relational databases
- pubcrawl
- Deduplication
- Metrics
- key-value data
- key management
- JSON
- Human behavior
- functional dependency mining
- functional dependencies
- entity extraction
- denormalized data