Visible to the public ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning

TitleActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning
Publication TypeConference Paper
Year of Publication2016
AuthorsKrishnan, Sanjay, Franklin, Michael J., Goldberg, Ken, Wang, Jiannan, Wu, Eugene
Conference NameProceedings of the 2016 International Conference on Management of Data
PublisherACM
Conference LocationNew York, NY, USA
ISBN Number978-1-4503-3531-7
Keywordsdata cleaning, machine learning, pubcrawl170201
Abstract

Databases can be corrupted with various errors such as missing, incorrect, or inconsistent values. Increasingly, modern data analysis pipelines involve Machine Learning, and the effects of dirty data can be difficult to debug.Dirty data is often sparse, and naive sampling solutions are not suited for high-dimensional models. We propose ActiveClean, a progressive framework for training Machine Learning models with data cleaning. Our framework updates a model iteratively as the analyst cleans small batches of data, and includes numerous optimizations such as importance weighting and dirty data detection. We designed a visual interface to wrap around this framework and demonstrate ActiveClean for a video classification problem and a topic modeling problem.

URLhttp://doi.acm.org/10.1145/2882903.2899409
DOI10.1145/2882903.2899409
Citation Keykrishnan_activeclean:_2016