Visible to the public EAGER: Bridging The Gap between Theory and Practice in Data PrivacyConflict Detection Enabled

Project Details

Lead PI

Performance Period

Sep 01, 2016 - Aug 31, 2018

Institution(s)

Purdue University

Award Number


This project aims to bridge the gap between theory and practice in privacy-preserving data sharing and analysis. Data collected by organizations and agencies are a key resource in today's information age. However, the disclosure of those data poses serious threats to individual privacy. While differential privacy provides a solid foundation for developing techniques to balance privacy and utility in data sharing, currently there is a significant gap between theory and practice in research in this area. In the current state of the art, each task requires specialized algorithms to achieve acceptable trade-off of privacy and utility. The process of designing new algorithms is manual and challenging. Furthermore, research in this area tends to take either a pure theoretical approach or a pure experimental approach; both have significant limitations. This project aims to develop algorithms that can be broadly and automatically applied, and methodologies for combining theoretical analysis with experimental validations, focusing on concrete (instead of asymptotic) analysis where constants are spelled out. Advances in data privacy techniques will benefit society by providing a better balance between the need to release data to serve public interest and the need to protect individuals' privacy.

The project pursues the following research goals to advance the state-of-the-art of data privacy. One goal is to develop a general method that can take a non-private data analysis algorithm as a blackbox, and make it private. This may require the development of a data privacy notion that is more relaxed than differential privacy. Another goal is to develop a concrete approach to understanding the utility of data analysis algorithms. The theoretical approach of proving asymptotic utility bounds is limited for a number of reasons. Asymptotic analysis ignores constants (and oftentimes poly-logarithmic terms as well), which are critical for utility in practice. A method with an appealing asymptotic utility bound often performs poorly except for very large parameters, when applying the method requires an unacceptable amount of space and time computing resources. As the utility bound must hold for all datasets (including pathological ones), such bounds can be so loose that they are meaningless once the actual parameters are plugged in. Bridging this gap requires better understanding of the factors affecting utility, better utility metrics, and methods to formalize the dependencies of utility on dataset features. The resulting concrete approach combines theoretical analysis with heuristic approximations and experimental validations, and can more effectively guide the development of practically effective algorithms.