Visible to the public CAREER: Privacy-preserving learning for distributed dataConflict Detection Enabled

Project Details

Lead PI

Performance Period

Jul 01, 2015 - Jun 30, 2020

Institution(s)

Rutgers University New Brunswick

Award Number


Medical technologies such as imaging and sequencing make it possible to gather massive amounts of information at increasingly lower cost. Sharing data from studies can advance scientific understanding and improve healthcare outcomes. Concern about patient privacy, however, can preclude open data sharing, thus hampering progress in understanding stigmatized conditions such as mental health disorders. This research seeks to understand how to analyze and learn from sensitive data held at different sites (such as medical centers) in a way that quantifiably and rigorously protects the privacy of the data.

The framework used in this research is differential privacy, a recently-proposed model for measuring privacy risk in data sharing. Differentially private algorithms provide approximate (noisy) answers to protect sensitive data, involving a tradeoff between privacy and utility. This research studies how to combine private approximations from different sites to improve the overall quality or utility of the result. The main goals of this research are to understand the fundamental limits of private data sharing, to design algorithms for making private approximations and rules for combining them, and to understand the consequences of sites having more complex privacy and sharing restrictions. The methods used to address these problems are a mix of mathematical techniques from statistics, computer science, and electrical engineering.

The educational component of this research will involve designing introductory university courses and material on data science, undergraduate research projects, curricular materials for graduate courses, and outreach to the growing data-hacker community via presentations, tutorial materials, and open-source software.

The primary aim of this research is bridge the gap between theory and practice by developing algorithmic principles for practical privacy-preserving algorithms. These algorithms will be validated on neuroimaging data used to understand and diagnose mental health disorders. Implementing the results of this research will create a blueprint for building practical privacy-preserving learning for research in healthcare and other fields. The tradeoffs between privacy and utility in distributed systems lead naturally to more general questions of cost-benefit tradeoffs for learning problems, and the same algorithmic principles will shed light on information processing and machine learning in general distributed systems where messages may be noisy or corrupted.