Interactive and Automated Debugging for Big Data Analytics

Submitted by grigby1 on Fri, 06/28/2019 - 10:37am

Title	Interactive and Automated Debugging for Big Data Analytics
Publication Type	Conference Paper
Year of Publication	2018
Authors	Gulzar, Muhammad Ali
Conference Name	Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings
Publisher	ACM
ISBN Number	978-1-4503-5663-3
Keywords	and data cleaning, automated debugging, Big Data, compositionality, data provenance, data-intensive scalable computing (DISC), debugging and testing, fault localization, Metrics, pubcrawl, resilience, Resiliency, Scalability, scalable verification, test minimization
Abstract	An abundance of data in many disciplines of science, engineering, national security, health care, and business has led to the emerging field of Big Data Analytics that run in a cloud computing environment. To process massive quantities of data in the cloud, developers leverage Data-Intensive Scalable Computing (DISC) systems such as Google's MapReduce, Hadoop, and Spark. Currently, developers do not have easy means to debug DISC applications. The use of cloud computing makes application development feel more like batch jobs and the nature of debugging is therefore post-mortem. Developers of big data applications write code that implements a data processing pipeline and test it on their local workstation with a small sample data, downloaded from a TB-scale data warehouse. They cross fingers and hope that the program works in the expensive production cloud. When a job fails or they get a suspicious result, data scientists spend hours guessing at the source of the error, digging through post-mortem logs. In such cases, the data scientists may want to pinpoint the root cause of errors by investigating a subset of corresponding input records. The vision of my work is to provide interactive, real-time and automated debugging services for big data processing programs in modern DISC systems with minimum performance impact. My work investigates the following research questions in the context of big data analytics: (1) What are the necessary debugging primitives for interactive big data processing? (2) What scalable fault localization algorithms are needed to help the user to localize and characterize the root causes of errors? (3) How can we improve testing efficiency during iterative development of DISC applications by reasoning the semantics of dataflow operators and user-defined functions used inside dataflow operators in tandem? To answer these questions, we synthesize and innovate ideas from software engineering, big data systems, and program analysis, and coordinate innovations across the software stack from the user-facing API all the way down to the systems infrastructure.
URL	https://dl.acm.org/citation.cfm?doid=3183440.3190334
DOI	10.1145/3183440.3190334
Citation Key	gulzar_interactive_2018

Groups:

Science of Security VO