Iterative Classification for Sanitizing Large-Scale Datasets
Title | Iterative Classification for Sanitizing Large-Scale Datasets |
Publication Type | Conference Paper |
Year of Publication | 2015 |
Authors | Li, Bo, Vorobeychik, Yevgeniy, Li, Muqun, Malin, Bradley |
Conference Name | SIAM International Conference on Data Mining |
Date Published | 05/2015 |
Publisher | IEEE |
Conference Location | Vancouver, British Columbia, Canada |
Keywords | Foundations, Hierarchical Coordination and Control, Resilient Systems, Science of decentralized security, science of security, SURE Project |
Abstract | Cheap ubiquitous computing enables the collectionof massive amounts of personal data in a wide variety of domains.Many organizations aim to share such data while obscuring fea-tures that could disclose identities or other sensitive information.Much of the data now collected exhibits weak structure (e.g.,natural language text) and machine learning approaches havebeen developed to identify and remove sensitive entities in suchdata. Learning-based approaches are never perfect and relyingupon them to sanitize data can leak sensitive information as aconsequence. However, a small amount of risk is permissiblein practice, and, thus, our goal is to balance the value ofdata published and the risk of an adversary discovering leakedsensitive information. We model data sanitization as a gamebetween 1) a publisher who chooses a set of classifiers to applyto data and publishes only instances predicted to be non-sensitiveand 2) an attacker who combines machine learning and manualinspection to uncover leaked sensitive entities (e.g., personal names). We introduce an iterative greedy algorithm for thepublisher that provably executes no more than a linear numberof iterations, and ensures a low utility for a resource-limitedadversary. Moreover, using several real world natural languagecorpora, we illustrate that our greedy algorithm leaves virtuallyno automatically identifiable sensitive instances for a state-of-the-art learning algorithm, while sharing over 93% of the original data, and completes after at most 5 iterations. |
Citation Key | liiterative |