Title | DataSynthesizer: Privacy-Preserving Synthetic Datasets |
Publication Type | Conference Paper |
Year of Publication | 2017 |
Authors | Ping, Haoyue, Stoyanovich, Julia, Howe, Bill |
Conference Name | Proceedings of the 29th International Conference on Scientific and Statistical Database Management |
Publisher | ACM |
Conference Location | New York, NY, USA |
ISBN Number | 978-1-4503-5282-6 |
Keywords | Collaboration, compositionality, Data Sanitization, data sharing, Differential privacy, Human Behavior, human factors, policy, privacy, pubcrawl, Resiliency, Synthetic Data |
Abstract | To facilitate collaboration over sensitive data, we present DataSynthesizer, a tool that takes a sensitive dataset as input and generates a structurally and statistically similar synthetic dataset with strong privacy guarantees. The data owners need not release their data, while potential collaborators can begin developing models and methods with some confidence that their results will work similarly on the real dataset. The distinguishing feature of DataSynthesizer is its usability -- the data owner does not have to specify any parameters to start generating and sharing data safely and effectively. DataSynthesizer consists of three high-level modules -- DataDescriber, DataGenerator and ModelInspector. The first, DataDescriber, investigates the data types, correlations and distributions of the attributes in the private dataset, and produces a data summary, adding noise to the distributions to preserve privacy. DataGenerator samples from the summary computed by DataDescriber and outputs synthetic data. ModelInspector shows an intuitive description of the data summary that was computed by DataDescriber, allowing the data owner to evaluate the accuracy of the summarization process and adjust any parameters, if desired. We describe DataSynthesizer and illustrate its use in an urban science context, where sharing sensitive, legally encumbered data between agencies and with outside collaborators is reported as the primary obstacle to data-driven governance. The code implementing all parts of this work is publicly available at https://github.com/DataResponsibly/DataSynthesizer. |
URL | http://doi.acm.org/10.1145/3085504.3091117 |
DOI | 10.1145/3085504.3091117 |
Citation Key | ping_datasynthesizer:_2017 |