Automated Debugging in Data-Intensive Scalable Computing

Submitted by grigby1 on Wed, 05/09/2018 - 2:47pm

Title	Automated Debugging in Data-Intensive Scalable Computing
Publication Type	Conference Paper
Year of Publication	2017
Authors	Gulzar, Muhammad Ali, Interlandi, Matteo, Han, Xueyuan, Li, Mingda, Condie, Tyson, Kim, Miryung
Conference Name	Proceedings of the 2017 Symposium on Cloud Computing
Publisher	ACM
Conference Location	New York, NY, USA
ISBN Number	978-1-4503-5028-0
Keywords	and data cleaning, automated debugging, Big Data, compositionality, data provenance, data-intensive scalable computing (DISC), fault localization, Metrics, pubcrawl, resilience, Resiliency, Scalability, scalable verification
Abstract	Developing Big Data Analytics workloads often involves trial and error debugging, due to the unclean nature of datasets or wrong assumptions made about data. When errors (e.g., program crash, outlier results, etc.) arise, developers are often interested in identifying a subset of the input data that is able to reproduce the problem. BigSift is a new faulty data localization approach that combines insights from automated fault isolation in software engineering and data provenance in database systems to find a minimum set of failure-inducing inputs. BigSift redefines data provenance for the purpose of debugging using a test oracle function and implements several unique optimizations, specifically geared towards the iterative nature of automated debugging workloads. BigSift improves the accuracy of fault localizability by several orders-of-magnitude ($\sim$103 to 107x) compared to Titian data provenance, and improves performance by up to 66x compared to Delta Debugging, an automated fault-isolation technique. For each faulty output, BigSift is able to localize fault-inducing data within 62% of the original job running time.
URL	https://dl.acm.org/citation.cfm?doid=3127479.3131624
DOI	10.1145/3127479.3131624
Citation Key	gulzar_automated_2017

Groups:

Science of Security VO