Visible to the public Increased Fault-Tolerance and Real-Time Performance Resiliency for Stream Processing Workloads through Redundancy

TitleIncreased Fault-Tolerance and Real-Time Performance Resiliency for Stream Processing Workloads through Redundancy
Publication TypeConference Paper
Year of Publication2019
AuthorsTran, Geoffrey Phi, Walters, John Paul, Crago, Stephen
Conference Name2019 IEEE International Conference on Services Computing (SCC)
Date Publishedjul
ISBN Number978-1-7281-2720-0
Keywordsaggregate performance metrics, business analytics, communication faults, data analytics, Embedded systems, failed communication, Fault tolerance, fault tolerant computing, fine-grained acknowledgment schemes, fine-grained acknowledgment tracking scheme, Garbage Collection, multiprocessing systems, performance faults, Processor scheduling, pubcrawl, real time, Redundancy, resilience, Resiliency, software fault tolerance, stream processing, stream processing workloads, tail latency, telemetry

Data analytics and telemetry have become paramount to monitoring and maintaining quality-of-service in addition to business analytics. Stream processing-a model where a network of operators receives and processes continuously arriving discrete elements-is well-suited for these needs. Current and previous studies and frameworks have focused on continuity of operations and aggregate performance metrics. However, real-time performance and tail latency are also important. Timing errors caused by either performance or failed communication faults also affect real-time performance more drastically than aggregate metrics. In this paper, we introduce redundancy in the stream data to improve the real-time performance and resiliency to timing errors caused by either performance or failed communication faults. We also address limitations in previous solutions using a fine-grained acknowledgment tracking scheme to both increase the effectiveness for resiliency to performance faults and enable effectiveness for failed communication faults. Our results show that fine-grained acknowledgment schemes can improve the tail and mean latencies by approximately 30%. We also show that these schemes can improve resiliency to performance faults compared to existing work. Our improvements result in 47.4% to 92.9% fewer missed deadlines compared to 17.3% to 50.6% for comparable topologies and redundancy levels in the state of the art. Finally, we show that redundancies of 25% to 100% can reduce the number of data elements that miss their deadline constraints by 0.76% to 14.04% for applications with high fan-out and by 7.45% up to 50% for applications with no fan-out.

Citation Keytran_increased_2019