Visible to the public CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing

TitleCoRAL: Confined Recovery in Distributed Asynchronous Graph Processing
Publication TypeConference Paper
Year of Publication2017
AuthorsVora, Keval, Tian, Chen, Gupta, Rajiv, Hu, Ziang
Conference NameProceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems
PublisherACM
Conference LocationNew York, NY, USA
ISBN Number978-1-4503-4465-4
Keywordscomposability, confinement, Cyber-physical systems, distributed processing, Fault tolerance, graph processing, privacy, pubcrawl, resilience, Resiliency
AbstractExisting distributed asynchronous graph processing systems employ checkpointing to capture globally consistent snapshots and rollback all machines to most recent checkpoint to recover from machine failures. In this paper we argue that recovery in distributed asynchronous graph processing does not require the entire execution state to be rolled back to a globally consistent state due to the relaxed asynchronous execution semantics. We define the properties required in the recovered state for it to be usable for correct asynchronous processing and develop CoRAL, a lightweight checkpointing and recovery algorithm. First, this algorithm carries out confined recovery that only rolls back graph execution states of the failed machines to affect recovery. Second, it relies upon lightweight checkpoints that capture locally consistent snapshots with a reduced peak network bandwidth requirement. Our experiments using real-world graphs show that our technique recovers from failures and finishes processing 1.5x to 3.2x faster compared to the traditional asynchronous checkpointing and recovery mechanism when failures impact 1 to 6 machines of a 16 machine cluster. Moreover, capturing locally consistent snapshots significantly reduces intermittent high peak bandwidth usage required to save the snapshots - the average reduction in 99th percentile bandwidth ranges from 22% to 51% while 1 to 6 snapshot replicas are being maintained.
URLhttp://doi.acm.org/10.1145/3037697.3037747
DOI10.1145/3037697.3037747
Citation Keyvora_coral:_2017