One of the keys to scientific progress is the sharing of research data. When the data contain information about human subjects, the incentives not to share data are stronger. The biggest concern is privacy - specific information about individuals must be protected at all times. Recent advances in mathematical notions of privacy have raised the hope that the data can be properly sanitized and distributed to other research groups without revealing information about any individual. In order to make this effort worthwhile, the sanitized data must be useful for statistical analysis. This project addresses the research challenges in making the sanitized data useful. The first part of the project deals with the design of algorithms that produce useful sanitized data subject to privacy constraints. The second part of the project deals with the development of tools for the statistical analysis of sanitized data. Existing statistical routines are not designed for the types of complex noise patterns that are found in sanitized data; their naive use will often result in missed discoveries or false claims of statistical significance. The target application for this project is a social science dataset with geographic characteristics. The intellectual merit of this proposal is the development of a utility theory for algorithms that sanitize data and statistical tools for their analysis. The broader impact is the improved ability of research groups to share useful, but privacy-preserving, research data.
|