ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning
Title | ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning |
Publication Type | Conference Paper |
Year of Publication | 2016 |
Authors | Krishnan, Sanjay, Franklin, Michael J., Goldberg, Ken, Wang, Jiannan, Wu, Eugene |
Conference Name | Proceedings of the 2016 International Conference on Management of Data |
Publisher | ACM |
Conference Location | New York, NY, USA |
ISBN Number | 978-1-4503-3531-7 |
Keywords | data cleaning, machine learning, pubcrawl170201 |
Abstract | Databases can be corrupted with various errors such as missing, incorrect, or inconsistent values. Increasingly, modern data analysis pipelines involve Machine Learning, and the effects of dirty data can be difficult to debug.Dirty data is often sparse, and naive sampling solutions are not suited for high-dimensional models. We propose ActiveClean, a progressive framework for training Machine Learning models with data cleaning. Our framework updates a model iteratively as the analyst cleans small batches of data, and includes numerous optimizations such as importance weighting and dirty data detection. We designed a visual interface to wrap around this framework and demonstrate ActiveClean for a video classification problem and a topic modeling problem. |
URL | http://doi.acm.org/10.1145/2882903.2899409 |
DOI | 10.1145/2882903.2899409 |
Citation Key | krishnan_activeclean:_2016 |