Visible to the public Biblio

Filters: Author is Goldberg, Ken  [Clear All Filters]
2017-05-22
Krishnan, Sanjay, Wang, Jiannan, Franklin, Michael J., Goldberg, Ken, Kraska, Tim.  2016.  PrivateClean: Data Cleaning and Differential Privacy. Proceedings of the 2016 International Conference on Management of Data. :937–951.

Recent advances in differential privacy make it possible to guarantee user privacy while preserving the main characteristics of the data. However, most differential privacy mechanisms assume that the underlying dataset is clean. This paper explores the link between data cleaning and differential privacy in a framework we call PrivateClean. PrivateClean includes a technique for creating private datasets of numerical and discrete-valued attributes, a formalism for privacy-preserving data cleaning, and techniques for answering sum, count, and avg queries after cleaning. We show: (1) how the degree of privacy affects subsequent aggregate query accuracy, (2) how privacy potentially amplifies certain types of errors in a dataset, and (3) how this analysis can be used to tune the degree of privacy. The key insight is to maintain a bipartite graph relating dirty values to clean values and use this graph to estimate biases due to the interaction between cleaning and privacy. We validate these results on four datasets with a variety of well-studied cleaning techniques including using functional dependencies, outlier filtering, and resolving inconsistent attributes.

2017-03-07
Krishnan, Sanjay, Wang, Jiannan, Franklin, Michael J., Goldberg, Ken, Kraska, Tim.  2016.  PrivateClean: Data Cleaning and Differential Privacy. Proceedings of the 2016 International Conference on Management of Data. :937–951.

Recent advances in differential privacy make it possible to guarantee user privacy while preserving the main characteristics of the data. However, most differential privacy mechanisms assume that the underlying dataset is clean. This paper explores the link between data cleaning and differential privacy in a framework we call PrivateClean. PrivateClean includes a technique for creating private datasets of numerical and discrete-valued attributes, a formalism for privacy-preserving data cleaning, and techniques for answering sum, count, and avg queries after cleaning. We show: (1) how the degree of privacy affects subsequent aggregate query accuracy, (2) how privacy potentially amplifies certain types of errors in a dataset, and (3) how this analysis can be used to tune the degree of privacy. The key insight is to maintain a bipartite graph relating dirty values to clean values and use this graph to estimate biases due to the interaction between cleaning and privacy. We validate these results on four datasets with a variety of well-studied cleaning techniques including using functional dependencies, outlier filtering, and resolving inconsistent attributes.

Krishnan, Sanjay, Franklin, Michael J., Goldberg, Ken, Wang, Jiannan, Wu, Eugene.  2016.  ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning. Proceedings of the 2016 International Conference on Management of Data. :2117–2120.

Databases can be corrupted with various errors such as missing, incorrect, or inconsistent values. Increasingly, modern data analysis pipelines involve Machine Learning, and the effects of dirty data can be difficult to debug.Dirty data is often sparse, and naive sampling solutions are not suited for high-dimensional models. We propose ActiveClean, a progressive framework for training Machine Learning models with data cleaning. Our framework updates a model iteratively as the analyst cleans small batches of data, and includes numerous optimizations such as importance weighting and dirty data detection. We designed a visual interface to wrap around this framework and demonstrate ActiveClean for a video classification problem and a topic modeling problem.