Increased Fault-Tolerance and Real-Time Performance Resiliency for Stream Processing Workloads through Redundancy

Submitted by grigby1 on Wed, 02/26/2020 - 4:48pm

Title	Increased Fault-Tolerance and Real-Time Performance Resiliency for Stream Processing Workloads through Redundancy
Publication Type	Conference Paper
Year of Publication	2019
Authors	Tran, Geoffrey Phi, Walters, John Paul, Crago, Stephen
Conference Name	2019 IEEE International Conference on Services Computing (SCC)
Date Published	jul
ISBN Number	978-1-7281-2720-0
Keywords	aggregate performance metrics, business analytics, communication faults, data analytics, Embedded systems, failed communication, Fault tolerance, fault tolerant computing, fine-grained acknowledgment schemes, fine-grained acknowledgment tracking scheme, Garbage Collection, multiprocessing systems, performance faults, Processor scheduling, pubcrawl, real time, Redundancy, resilience, Resiliency, software fault tolerance, stream processing, stream processing workloads, tail latency, telemetry
Abstract	Data analytics and telemetry have become paramount to monitoring and maintaining quality-of-service in addition to business analytics. Stream processing-a model where a network of operators receives and processes continuously arriving discrete elements-is well-suited for these needs. Current and previous studies and frameworks have focused on continuity of operations and aggregate performance metrics. However, real-time performance and tail latency are also important. Timing errors caused by either performance or failed communication faults also affect real-time performance more drastically than aggregate metrics. In this paper, we introduce redundancy in the stream data to improve the real-time performance and resiliency to timing errors caused by either performance or failed communication faults. We also address limitations in previous solutions using a fine-grained acknowledgment tracking scheme to both increase the effectiveness for resiliency to performance faults and enable effectiveness for failed communication faults. Our results show that fine-grained acknowledgment schemes can improve the tail and mean latencies by approximately 30%. We also show that these schemes can improve resiliency to performance faults compared to existing work. Our improvements result in 47.4% to 92.9% fewer missed deadlines compared to 17.3% to 50.6% for comparable topologies and redundancy levels in the state of the art. Finally, we show that redundancies of 25% to 100% can reduce the number of data elements that miss their deadline constraints by 0.76% to 14.04% for applications with high fan-out and by 7.45% up to 50% for applications with no fan-out.
URL	https://ieeexplore.ieee.org/document/8814158
DOI	10.1109/SCC.2019.00021
Citation Key	tran_increased_2019

Groups:

Science of Security VO