Increased Fault-Tolerance and Real-Time Performance Resiliency for Stream Processing Workloads through Redundancy
Title | Increased Fault-Tolerance and Real-Time Performance Resiliency for Stream Processing Workloads through Redundancy |
Publication Type | Conference Paper |
Year of Publication | 2019 |
Authors | Tran, Geoffrey Phi, Walters, John Paul, Crago, Stephen |
Conference Name | 2019 IEEE International Conference on Services Computing (SCC) |
Date Published | jul |
ISBN Number | 978-1-7281-2720-0 |
Keywords | aggregate performance metrics, business analytics, communication faults, data analytics, Embedded systems, failed communication, Fault tolerance, fault tolerant computing, fine-grained acknowledgment schemes, fine-grained acknowledgment tracking scheme, Garbage Collection, multiprocessing systems, performance faults, Processor scheduling, pubcrawl, real time, Redundancy, resilience, Resiliency, software fault tolerance, stream processing, stream processing workloads, tail latency, telemetry |
Abstract | Data analytics and telemetry have become paramount to monitoring and maintaining quality-of-service in addition to business analytics. Stream processing-a model where a network of operators receives and processes continuously arriving discrete elements-is well-suited for these needs. Current and previous studies and frameworks have focused on continuity of operations and aggregate performance metrics. However, real-time performance and tail latency are also important. Timing errors caused by either performance or failed communication faults also affect real-time performance more drastically than aggregate metrics. In this paper, we introduce redundancy in the stream data to improve the real-time performance and resiliency to timing errors caused by either performance or failed communication faults. We also address limitations in previous solutions using a fine-grained acknowledgment tracking scheme to both increase the effectiveness for resiliency to performance faults and enable effectiveness for failed communication faults. Our results show that fine-grained acknowledgment schemes can improve the tail and mean latencies by approximately 30%. We also show that these schemes can improve resiliency to performance faults compared to existing work. Our improvements result in 47.4% to 92.9% fewer missed deadlines compared to 17.3% to 50.6% for comparable topologies and redundancy levels in the state of the art. Finally, we show that redundancies of 25% to 100% can reduce the number of data elements that miss their deadline constraints by 0.76% to 14.04% for applications with high fan-out and by 7.45% up to 50% for applications with no fan-out. |
URL | https://ieeexplore.ieee.org/document/8814158 |
DOI | 10.1109/SCC.2019.00021 |
Citation Key | tran_increased_2019 |
- performance faults
- telemetry
- tail latency
- stream processing workloads
- stream processing
- software fault tolerance
- Resiliency
- resilience
- Redundancy
- Real Time
- pubcrawl
- Processor scheduling
- aggregate performance metrics
- multiprocessing systems
- Garbage Collection
- fine-grained acknowledgment tracking scheme
- fine-grained acknowledgment schemes
- fault tolerant computing
- fault tolerance
- failed communication
- embedded systems
- Data Analytics
- communication faults
- business analytics