Visible to the public PrivateClean: Data Cleaning and Differential Privacy

TitlePrivateClean: Data Cleaning and Differential Privacy
Publication TypeConference Paper
Year of Publication2016
AuthorsKrishnan, Sanjay, Wang, Jiannan, Franklin, Michael J., Goldberg, Ken, Kraska, Tim
Conference NameProceedings of the 2016 International Conference on Management of Data
Date PublishedJune 2016
PublisherACM
Conference LocationNew York, NY, USA
ISBN Number978-1-4503-3531-7
Keywordsdata cleaning, Differential privacy, local differential privacy, pubcrawl170201
Abstract

Recent advances in differential privacy make it possible to guarantee user privacy while preserving the main characteristics of the data. However, most differential privacy mechanisms assume that the underlying dataset is clean. This paper explores the link between data cleaning and differential privacy in a framework we call PrivateClean. PrivateClean includes a technique for creating private datasets of numerical and discrete-valued attributes, a formalism for privacy-preserving data cleaning, and techniques for answering sum, count, and avg queries after cleaning. We show: (1) how the degree of privacy affects subsequent aggregate query accuracy, (2) how privacy potentially amplifies certain types of errors in a dataset, and (3) how this analysis can be used to tune the degree of privacy. The key insight is to maintain a bipartite graph relating dirty values to clean values and use this graph to estimate biases due to the interaction between cleaning and privacy. We validate these results on four datasets with a variety of well-studied cleaning techniques including using functional dependencies, outlier filtering, and resolving inconsistent attributes.

URLhttps://dl.acm.org/doi/10.1145/2882903.2915248
DOI10.1145/2882903.2915248
Citation Keykrishnan_privateclean:_2016