Iterative Classification for Sanitizing Large-Scale Datasets

Submitted by Katie Dey on Thu, 11/12/2015 - 1:39pm

Title	Iterative Classification for Sanitizing Large-Scale Datasets
Publication Type	Conference Paper
Year of Publication	2015
Authors	Li, Bo, Vorobeychik, Yevgeniy, Li, Muqun, Malin, Bradley
Conference Name	SIAM International Conference on Data Mining
Date Published	05/2015
Publisher	IEEE
Conference Location	Vancouver, British Columbia, Canada
Keywords	Foundations, Hierarchical Coordination and Control, Resilient Systems, Science of decentralized security, science of security, SURE Project
Abstract	Cheap ubiquitous computing enables the collectionof massive amounts of personal data in a wide variety of domains.Many organizations aim to share such data while obscuring fea-tures that could disclose identities or other sensitive information.Much of the data now collected exhibits weak structure (e.g.,natural language text) and machine learning approaches havebeen developed to identify and remove sensitive entities in suchdata. Learning-based approaches are never perfect and relyingupon them to sanitize data can leak sensitive information as aconsequence. However, a small amount of risk is permissiblein practice, and, thus, our goal is to balance the value ofdata published and the risk of an adversary discovering leakedsensitive information. We model data sanitization as a gamebetween 1) a publisher who chooses a set of classifiers to applyto data and publishes only instances predicted to be non-sensitiveand 2) an attacker who combines machine learning and manualinspection to uncover leaked sensitive entities (e.g., personal names). We introduce an iterative greedy algorithm for thepublisher that provably executes no more than a linear numberof iterations, and ensures a low utility for a resource-limitedadversary. Moreover, using several real world natural languagecorpora, we illustrate that our greedy algorithm leaves virtuallyno automatically identifiable sensitive instances for a state-of-the-art learning algorithm, while sharing over 93% of the original data, and completes after at most 5 iterations.
Citation Key	liiterative

Groups: