Recent improvements in computing capabilities, data collection, and data science have enabled tremendous advances in scientific data analysis. However, the relevant data are often highly sensitive (e.g., Census records, tax records, medical records). This project addresses an emerging and critical scientific problem: Privacy concerns limit access to raw data that might reveal information about individuals. Techniques to "sanitize" such data (e.g., anonymization) could have negative impact on the quality of the scientific results that use the data. How can we provide data that protect the privacy of individuals but also accurately support scientific analyses?
The project addresses challenges regarding analysis of privacy-preserving sanitized data: (1) How can sanitized data be analyzed so that conclusions will stand up to peer review? (2) What workflows and visualizations must be supported by privacy technology? (3) How can scientists assess bias introduced by sanitization without access to the raw data? The project focuses specifically on "social flow analysis," in which data analysis is performed on sensitive social flow data (e.g., commuting patterns, migration trajectories) of individuals or families. The researchers are developing an ecological model of networks of neighborhoods that are linked by social flows and studying how social flows are formed and maintained. The project is cataloging the types of data access and visualization needed to develop such theories, studying alternative analyses that are both scalable and statistically robust, developing preliminary privacy-preserving data protection methods, and evaluating whether the privacy-preserving methods enable the same conclusions as access to the raw data.
|