Data Cleaning: Overview and Emerging Challenges

Submitted by grigby1 on Tue, 03/07/2017 - 12:50pm

Title	Data Cleaning: Overview and Emerging Challenges
Publication Type	Conference Paper
Year of Publication	2016
Authors	Chu, Xu, Ilyas, Ihab F., Krishnan, Sanjay, Wang, Jiannan
Conference Name	Proceedings of the 2016 International Conference on Management of Data
Date Published	June 2016
Publisher	ACM
Conference Location	New York, NY, USA
ISBN Number	978-1-4503-3531-7
Keywords	data cleaning, data quality, integrity constraints, pubcrawl170201, sampling, statistical cleaning
Abstract	Detecting and repairing dirty data is one of the perennial challenges in data analytics, and failure to do so can result in inaccurate analytics and unreliable decisions. Over the past few years, there has been a surge of interest from both industry and academia on data cleaning problems including new abstractions, interfaces, approaches for scalability, and statistical techniques. To better understand the new advances in the field, we will first present a taxonomy of the data cleaning literature in which we highlight the recent interest in techniques that use constraints, rules, or patterns to detect errors, which we call qualitative data cleaning. We will describe the state-of-the-art techniques and also highlight their limitations with a series of illustrative examples. While traditionally such approaches are distinct from quantitative approaches such as outlier detection, we also discuss recent work that casts such approaches into a statistical estimation framework including: using Machine Learning to improve the efficiency and accuracy of data cleaning and considering the effects of data cleaning on statistical analysis.
URL	https://dl.acm.org/doi/10.1145/2882903.2912574
DOI	10.1145/2882903.2912574
Citation Key	chu_data_2016

Groups:

Science of Security VO