Safely managing the release of data containing confidential information about individuals is a problem of great societal importance. Governments, institutions, and researchers collect data whose release can have enormous benefits to society by influencing public policy or advancing scientific knowledge. But dissemination of these data can only happen if the privacy of the respondents' data is preserved or if the amount of disclosure is limited. The goal of this research project is to bridge the gap between the statistics and computer science community and between theory and practice in limiting disclosure. The research focuses on limiting statistical disclosure using synthetic data, the most advanced method from the statistics community that enables the construction of public data sets with strong statistical properties; the research incorporates formal privacy guarantees from the computer science community into this approach. Techniques focus on household data and relational data dealing with real problems motivated by the U.S. Census Bureau and related agencies. The approach of the team is based on the development of novel techniques for boosting the utility of synthetic data generation methods with formal privacy guarantees; novel formal privacy models that formalize attackers implicitly considered in the statistics literature, and new attacker models that allow an exploration of the space between weak and strong adversaries; and novel techniques designed for data from censuses or surveys about households which have a relational structure. The research has broad impact by influencing the methodology of statistical agencies around the world. The project also develops a open-source toolkit for limiting disclosure in data publishing with formal privacy guarantees; it integrates undergraduate students into research, and it creates educational material for material for practitioners responsible for safe data handling.
|