Visible to the public Biblio

Found 102 results

Filters: Keyword is Fault tolerance  [Clear All Filters]
2009
Chen, Jing, Du, Ruiying.  2009.  Fault Tolerance and Security in Forwarding Packets Using Game Theory. 2009 International Conference on Multimedia Information Networking and Security. 2:534–537.
In self-organized wireless network, such as ad hoc network, sensor network or mesh network, nodes are independent individuals which have different benefit; Therefore, selfish nodes refuse to forward packets for other nodes in order to save energy which causes the network fault. At the same time, some nodes may be malicious, whose aim is to damage the network. In this paper, we analyze the cooperation stimulation and security in self-organized wireless networks under a game theoretic framework. We first analyze a four node wireless network in which nodes share the channel by relaying for others during its idle periods in order to help the other nodes, each node has to use a part of its available channel capacity. And then, the fault tolerance and security problem is modeled as a non-cooperative game in which each player maximizes its own utility function. The goal of the game is to maximize the utility function in the giving condition in order to get better network efficiency. At last, for characterizing the efficiency of Nash equilibria, we analyze the so called price of anarchy, as the ratio between the objective function at the worst Nash equilibrium and the optimal objective function. Our results show that the players can get the biggest payoff if they obey cooperation strategy.
2014
Turguner, C..  2014.  Secure fault tolerance mechanism of wireless Ad-Hoc networks with mobile agents. Signal Processing and Communications Applications Conference (SIU), 2014 22nd. :1620-1623.

Mobile Ad-Hoc Networks are dynamic and wireless self-organization networks that many mobile nodes connect to each other weakly. To compare with traditional networks, they suffer failures that prevent the system from working properly. Nevertheless, we have to cope with many security issues such as unauthorized attempts, security threats and reliability. Using mobile agents in having low level fault tolerance ad-hoc networks provides fault masking that the users never notice. Mobile agent migration among nodes, choosing an alternative paths autonomous and, having high level fault tolerance provide networks that have low bandwidth and high failure ratio, more reliable. In this paper we declare that mobile agents fault tolerance peculiarity and existing fault tolerance method based on mobile agents. Also in ad-hoc networks that need security precautions behind fault tolerance, we express the new model: Secure Mobil Agent Based Fault Tolerance Model.

Wenbing Zhao.  2014.  Application-Aware Byzantine Fault Tolerance. Dependable, Autonomic and Secure Computing (DASC), 2014 IEEE 12th International Conference on. :45-50.

Byzantine fault tolerance has been intensively studied over the past decade as a way to enhance the intrusion resilience of computer systems. However, state-machine-based Byzantine fault tolerance algorithms require deterministic application processing and sequential execution of totally ordered requests. One way of increasing the practicality of Byzantine fault tolerance is to exploit the application semantics, which we refer to as application-aware Byzantine fault tolerance. Application-aware Byzantine fault tolerance makes it possible to facilitate concurrent processing of requests, to minimize the use of Byzantine agreement, and to identify and control replica nondeterminism. In this paper, we provide an overview of recent works on application-aware Byzantine fault tolerance techniques. We elaborate the need for exploiting application semantics for Byzantine fault tolerance and the benefits of doing so, provide a classification of various approaches to application-aware Byzantine fault tolerance, and outline the mechanisms used in achieving application-aware Byzantine fault tolerance according to our classification.

Wenbing Zhao.  2014.  Application-Aware Byzantine Fault Tolerance. Dependable, Autonomic and Secure Computing (DASC), 2014 IEEE 12th International Conference on. :45-50.

Byzantine fault tolerance has been intensively studied over the past decade as a way to enhance the intrusion resilience of computer systems. However, state-machine-based Byzantine fault tolerance algorithms require deterministic application processing and sequential execution of totally ordered requests. One way of increasing the practicality of Byzantine fault tolerance is to exploit the application semantics, which we refer to as application-aware Byzantine fault tolerance. Application-aware Byzantine fault tolerance makes it possible to facilitate concurrent processing of requests, to minimize the use of Byzantine agreement, and to identify and control replica nondeterminism. In this paper, we provide an overview of recent works on application-aware Byzantine fault tolerance techniques. We elaborate the need for exploiting application semantics for Byzantine fault tolerance and the benefits of doing so, provide a classification of various approaches to application-aware Byzantine fault tolerance, and outline the mechanisms used in achieving application-aware Byzantine fault tolerance according to our classification.

Kirsch, J., Goose, S., Amir, Y., Dong Wei, Skare, P..  2014.  Survivable SCADA Via Intrusion-Tolerant Replication. Smart Grid, IEEE Transactions on. 5:60-70.

Providers of critical infrastructure services strive to maintain the high availability of their SCADA systems. This paper reports on our experience designing, architecting, and evaluating the first survivable SCADA system-one that is able to ensure correct behavior with minimal performance degradation even during cyber attacks that compromise part of the system. We describe the challenges we faced when integrating modern intrusion-tolerant protocols with a conventional SCADA architecture and present the techniques we developed to overcome these challenges. The results illustrate that our survivable SCADA system not only functions correctly in the face of a cyber attack, but that it also processes in excess of 20 000 messages per second with a latency of less than 30 ms, making it suitable for even large-scale deployments managing thousands of remote terminal units.

Liu, J.N.K., Yanxing Hu, You, J.J., Yulin He.  2014.  An advancing investigation on reduct and consistency for decision tables in Variable Precision Rough Set models. Fuzzy Systems (FUZZ-IEEE), 2014 IEEE International Conference on. :1496-1503.

Variable Precision Rough Set (VPRS) model is one of the most important extensions of the Classical Rough Set (RS) theory. It employs a majority inclusion relation mechanism in order to make the Classical RS model become more fault tolerant, and therefore the generalization of the model is improved. This paper can be viewed as an extension of previous investigations on attribution reduction problem in VPRS model. In our investigation, we illustrated with examples that the previously proposed reduct definitions may spoil the hidden classification ability of a knowledge system by ignoring certian essential attributes in some circumstances. Consequently, by proposing a new β-consistent notion, we analyze the relationship between the structures of Decision Table (DT) and different definitions of reduct in VPRS model. Then we give a new notion of β-complement reduct that can avoid the defects of reduct notions defined in previous literatures. We also supply the method to obtain the β- complement reduct using a decision table splitting algorithm, and finally demonstrate the feasibility of our approach with sample instances.
 

Hua Chai, Wenbing Zhao.  2014.  Towards trustworthy complex event processing. Software Engineering and Service Science (ICSESS), 2014 5th IEEE International Conference on. :758-761.

Complex event processing has become an important technology for big data and intelligent computing because it facilitates the creation of actionable, situational knowledge from potentially large amount events in soft realtime. Complex event processing can be instrumental for many mission-critical applications, such as business intelligence, algorithmic stock trading, and intrusion detection. Hence, the servers that carry out complex event processing must be made trustworthy. In this paper, we present a threat analysis on complex event processing systems and describe a set of mechanisms that can be used to control various threats. By exploiting the application semantics for typical event processing operations, we are able to design lightweight mechanisms that incur minimum runtime overhead appropriate for soft realtime computing.

Di Benedetto, M.D., D'Innocenzo, A., Smarra, F..  2014.  Fault-tolerant control of a wireless HVAC control system. Communications, Control and Signal Processing (ISCCSP), 2014 6th International Symposium on. :235-238.

In this paper we address the problem of designing a fault tolerant control scheme for an HVAC control system where sensing and actuation data are exchanged with a centralized controller via a wireless sensors and actuators network where the communication nodes are subject to permanent failures and malicious intrusions.

Rui Zhou, Rong Min, Qi Yu, Chanjuan Li, Yong Sheng, Qingguo Zhou, Xuan Wang, Kuan-Ching Li.  2014.  Formal Verification of Fault-Tolerant and Recovery Mechanisms for Safe Node Sequence Protocol. Advanced Information Networking and Applications (AINA), 2014 IEEE 28th International Conference on. :813-820.

Fault-tolerance has huge impact on embedded safety-critical systems. As technology that assists to the development of such improvement, Safe Node Sequence Protocol (SNSP) is designed to make part of such impact. In this paper, we present a mechanism for fault-tolerance and recovery based on the Safe Node Sequence Protocol (SNSP) to strengthen the system robustness, from which the correctness of a fault-tolerant prototype system is analyzed and verified. In order to verify the correctness of more than thirty failure modes, we have partitioned the complete protocol state machine into several subsystems, followed to the injection of corresponding fault classes into dedicated independent models. Experiments demonstrate that this method effectively reduces the size of overall state space, and verification results indicate that the protocol is able to recover from the fault model in a fault-tolerant system and continue to operate as errors occur.
 

Aiash, M., Mapp, G., Gemikonakli, O..  2014.  Secure Live Virtual Machines Migration: Issues and Solutions. Advanced Information Networking and Applications Workshops (WAINA), 2014 28th International Conference on. :160-165.

In recent years, there has been a huge trend towards running network intensive applications, such as Internet servers and Cloud-based service in virtual environment, where multiple virtual machines (VMs) running on the same machine share the machine's physical and network resources. In such environment, the virtual machine monitor (VMM) virtualizes the machine's resources in terms of CPU, memory, storage, network and I/O devices to allow multiple operating systems running in different VMs to operate and access the network concurrently. A key feature of virtualization is live migration (LM) that allows transfer of virtual machine from one physical server to another without interrupting the services running in virtual machine. Live migration facilitates workload balancing, fault tolerance, online system maintenance, consolidation of virtual machines etc. However, live migration is still in an early stage of implementation and its security is yet to be evaluated. The security concern of live migration is a major factor for its adoption by the IT industry. Therefore, this paper uses the X.805 security standard to investigate attacks on live virtual machine migration. The analysis highlights the main source of threats and suggests approaches to tackle them. The paper also surveys and compares different proposals in the literature to secure the live migration.

2015
Ansari, M. R., Yu, S., Yu, Q..  2015.  "IntelliCAN: Attack-resilient Controller Area Network (CAN) for secure automobiles". 2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS). :233–236.

Controller Area Network (CAN) is the main bus network that connects electronic control units in automobiles. Although CAN protocols have been revised to improve the vehicle safety, the security weaknesses of CAN have not been fully addressed. Security threats on automobiles might be from external wireless communication or from internal malicious CAN nodes mounted on the CAN bus. Despite of various threat sources, the security weakness of CAN is the root of security problems. Due to the limited computation power and storage capacity on each CAN node, there is a lack of hardware-efficient protection methods for the CAN system without losing the compatibility to CAN protocols. To save the cost and maintain the compatibility, we propose to exploit the built-in CAN fault confinement mechanism to detect the masquerade attacks originated from the malicious CAN devices on the CAN bus. Simulation results show that our method achieves the attack misdetection rate at the order of 10-5 and reduces the encryption latency by up to 68% over the complete frame encryption method.

Yang, J.-S., Chang, J.-M., Pai, K.-J., Chan, H.-C..  2015.  Parallel Construction of Independent Spanning Trees on Enhanced Hypercubes. Parallel and Distributed Systems, IEEE Transactions on. PP:1-1.

The use of multiple independent spanning trees (ISTs) for data broadcasting in networks provides a number of advantages, including the increase of fault-tolerance, bandwidth and security. Thus, the designs of multiple ISTs on several classes of networks have been widely investigated. In this paper, we give an algorithm to construct ISTs on enhanced hypercubes Qn,k, which contain folded hypercubes as a subclass. Moreover, we show that these ISTs are near optimal for heights and path lengths. Let D(Qn,k) denote the diameter of Qn,k. If n - k is odd or n - k ∈ {2; n}, we show that all the heights of ISTs are equal to D(Qn,k) + 1, and thus are optimal. Otherwise, we show that each path from a node to the root in a spanning tree has length at most D(Qn,k) + 2. In particular, no more than 2.15 percent of nodes have the maximum path length. As a by-product, we improve the upper bound of wide diameter (respectively, fault diameter) of Qn,k from these path lengths.

Yilin Mo, Sinopoli, B..  2015.  Secure Estimation in the Presence of Integrity Attacks. Automatic Control, IEEE Transactions on. 60:1145-1151.

We consider the estimation of a scalar state based on m measurements that can be potentially manipulated by an adversary. The attacker is assumed to have full knowledge about the true value of the state to be estimated and about the value of all the measurements. However, the attacker has limited resources and can only manipulate up to l of the m measurements. The problem is formulated as a minimax optimization, where one seeks to construct an optimal estimator that minimizes the “worst-case” expected cost against all possible manipulations by the attacker. We show that if the attacker can manipulate at least half the measurements (l ≥ m/2), then the optimal worst-case estimator should ignore all measurements and be based solely on the a-priori information. We provide the explicit form of the optimal estimator when the attacker can manipulate less than half the measurements (l <; m/2), which is based on (m2l) local estimators. We further prove that such an estimator can be reduced into simpler forms for two special cases, i.e., either the estimator is symmetric and monotone or m = 2l + 1. Finally we apply the proposed methodology in the case of Gaussian measurements.

2016
Mendizabal, Odorico M., Dotti, Fernando Luís, Pedone, Fernando.  2016.  Analysis of Checkpointing Overhead in Parallel State Machine Replication. Proceedings of the 31st Annual ACM Symposium on Applied Computing. :534–537.

State machine replication (SMR) is a well-established technique to fault-tolerant systems. In part, this is explained by the simplicity of the approach and its strong consistency guarantees. Recently, several proposals have suggested parallelizing the execution of state machine replicas to achieve high throughput. Concurrent execution of commands has many implications, including the recovery of replicas from failures. Conventional checkpointing techniques, for example, must be revisited in parallelized models. In this paper, we review parallel variations of state machine replication and discuss how checkpointing procedures apply to these models. Moreover, we evaluate the impact caused by checkpointing techniques on recovery through simulations.

Regainia, L., Salva, S., Ecuhcurs, C..  2016.  A classification methodology for security patterns to help fix software weaknesses. 2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA). :1–8.

Security patterns are generic solutions that can be applied since early stages of software life to overcome recurrent security weaknesses. Their generic nature and growing number make their choice difficult, even for experts in system design. To help them on the pattern choice, this paper proposes a semi-automatic methodology of classification and the classification itself, which exposes relationships among software weaknesses, security principles and security patterns. It expresses which patterns remove a given weakness with respect to the security principles that have to be addressed to fix the weakness. The methodology is based on seven steps, which anatomize patterns and weaknesses into set of more precise sub-properties that are associated through a hierarchical organization of security principles. These steps provide the detailed justifications of the resulting classification and allow its upgrade. Without loss of generality, this classification has been established for Web applications and covers 185 software weaknesses, 26 security patterns and 66 security principles. Research supported by the industrial chair on Digital Confidence (http://confiance-numerique.clermont-universite.fr/index-en.html).

Levy, Scott, Ferreira, Kurt B..  2016.  An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart. Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale. :35–42.

Fault tolerance is a key challenge to building the first exa\textbackslash-scale system. To understand the potential impacts of failures on next-generation systems, significant effort has been devoted to collecting, characterizing and analyzing failures on current systems. These studies require large volumes of data and complex analysis. Because the occurrence of failures in large-scale systems is unpredictable, failures are commonly modeled as a stochastic process. Failure data from current systems is examined in an attempt to identify the underlying probability distribution and its statistical properties. In this paper, we use modeling to examine the impact of failure distributions on the time-to-solution and the optimal checkpoint interval of applications that use coordinated checkpoint/restart. Using this approach, we show that as failures become more frequent, the failure distribution has a larger influence on application performance. We also show that as failure times are less tightly grouped (i.e., as the standard deviation increases) the underlying probability distribution has a greater impact on application performance. Finally, we show that computing the checkpoint interval based on the assumption that failures are exponentially distributed has a modest impact on application performance even when failures are drawn from a different distribution. Our work provides critical analysis and guidance to the process of analyzing failure data in the context of coordinated checkpoint/restart. Specifically, the data presented in this paper helps to distinguish cases where the failure distribution has a strong influence on application performance from those cases when the failure distribution has relatively little impact.

Peres, Bruna Soares, Souza, Otavio Augusto de Oliveira, Santos, Bruno Pereira, Junior, Edson Roteia Araujo, Goussevskaia, Olga, Vieira, Marcos Augusto Menezes, Vieira, Luiz Filipe Menezes, Loureiro, Antonio Alfredo Ferreira.  2016.  Matrix: Multihop Address Allocation and Dynamic Any-to-Any Routing for 6LoWPAN. Proceedings of the 19th ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems. :302–309.

Standard routing protocols for IPv6 over Low power Wireless Personal Area Networks (6LoWPAN) are mainly designed for data collection applications and work by establishing a tree-based network topology, which enables packets to be sent upwards, from the leaves to the root, adapting to dynamics of low-power communication links. The routing tables in such unidirectional networks are very simple and small since each node just needs to maintain the address of its parent in the tree, providing the best-quality route at every moment. In this work, we propose Matrix, a platform-independent routing protocol that utilizes the existing tree structure of the network to enable reliable and efficient any-to-any data traffic. Matrix uses hierarchical IPv6 address assignment in order to optimize routing table size, while preserving bidirectional routing. Moreover, it uses a local broadcast mechanism to forward messages to the right subtree when persistent node or link failures occur. We implemented Matrix on TinyOS and evaluated its performance both analytically and through simulations on TOSSIM. Our results show that the proposed protocol is superior to available protocols for 6LoWPAN, when it comes to any-to-any data communication, in terms of reliability, message efficiency, and memory footprint.

Azaiez, Meriem, Chainbi, Walid.  2016.  A Multi-agent System Architecture for Self-Healing Cloud Infrastructure. Proceedings of the International Conference on Internet of Things and Cloud Computing. :7:1–7:6.

The popularity of Cloud computing has considerably increased during the last years. The increase of Cloud users and their interactions with the Cloud infrastructure raise the risk of resources faults. Such a problem can lead to a bad reputation of the Cloud environment which slows down the evolution of this technology. To address this issue, the dynamic and the complex architecture of the Cloud should be taken into account. Indeed, this architecture requires that resources protection and healing must be transparent and without external intervention. Unlike previous work, we suggest integrating the fundamental aspects of autonomic computing in the Cloud to deal with the self-healing of Cloud resources. Starting from the high degree of match between autonomic computing systems and multiagent systems, we propose to take advantage from the autonomous behaviour of agent technology to create an intelligent Cloud that supports autonomic aspects. Our proposed solution is a multi-agent system which interacts with the Cloud infrastructure to analyze the resources state and execute Checkpoint/Replication strategy or migration technique to solve the problem of failed resources.

Ghaffari, Mohsen, Parter, Merav.  2016.  Near-Optimal Distributed Algorithms for Fault-Tolerant Tree Structures. Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures. :387–396.

Tree structures such as breadth-first search (BFS) trees and minimum spanning trees (MST) are among the most fundamental graph structures in distributed network algorithms. However, by definition, these structures are not robust against failures and even a single edge's removal can disrupt their functionality. A well-studied concept which attempts to circumvent this issue is Fault-Tolerant Tree Structures, where the tree gets augmented with additional edges from the network so that the functionality of the structure is maintained even when an edge fails. These structures, or other equivalent formulations, have been studied extensively from a centralized viewpoint. However, despite the fact that the main motivations come from distributed networks, their distributed construction has not been addressed before. In this paper, we present distributed algorithms for constructing fault tolerant BFS and MST structures. The time complexity of our algorithms are nearly optimal in the following strong sense: they almost match even the lower bounds of constructing (basic) BFS and MST trees.

Golab, Wojciech, Ramaraju, Aditya.  2016.  Recoverable Mutual Exclusion: [Extended Abstract]. Proceedings of the 2016 ACM Symposium on Principles of Distributed Computing. :65–74.

Mutex locks have traditionally been the most common mechanism for protecting shared data structures in parallel programs. However, the robustness of such locks against process failures has not been studied thoroughly. Most (user-level) mutex algorithms are designed around the assumption that processes are reliable, meaning that a process may not fail while executing the lock acquisition and release code, or while inside the critical section. If such a failure does occur, then the liveness properties of a conventional mutex lock may cease to hold until the application or operating system intervenes by cleaning up the internal structure of the lock. For example, a process that is attempting to acquire an otherwise starvation-free mutex may be blocked forever waiting for a failed process to release the critical section. Adding to the difficulty, if the failed process recovers and attempts to acquire the same mutex again without appropriate cleanup, then the mutex may become corrupted to the point where it loses safety, notably the mutual exclusion property. We address this challenge by formalizing the problem of recoverable mutual exclusion, and proposing several solutions that vary both in their assumptions regarding hardware support for synchronization, and in their time complexity. Compared to known solutions, our algorithms are more robust as they do not restrict where or when a process may crash, and provide stricter guarantees in terms of time complexity, which we define in terms of remote memory references.

Beevi, L. S., Merlin, G., MoganaPriya, G..  2016.  Security and privacy for smart grid using scalable key management. 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT). :4716–4721.

This paper focuses on the issues of secure key management for smart grid. With the present key management schemes, it will not yield security for deployment in smart grid. A novel key management scheme is proposed in this paper which merges elliptic curve public key technique and symmetric key technique. Based on the Needham-Schroeder authentication protocol, symmetric key scheme works. Well known threats like replay attack and man-in-the-middle attack can be successfully abolished using Smart Grid. The benefits of the proposed system are fault-tolerance, accessibility, Strong security, scalability and Efficiency.

Zheng, J., Okamura, H., Dohi, T..  2016.  Performance Evaluation of VM-based Intrusion Tolerant Systems with Poisson Arrivals. 2016 Fourth International Symposium on Computing and Networking (CANDAR). :181–187.

Computer security has become an increasingly important hot topic in computer and communication industry, since it is important to support critical business process and to protect personal and sensitive information. Computer security is to keep security attributes (confidentiality, integrity and availability) of computer systems, which face the threats such as deny-of-service (DoS), virus and intrusion. To ensure high computer security, the intrusion tolerance technique based on fault-tolerant scheme has been widely applied. This paper presents the quantitative performance evaluation of a virtual machine (VM) based intrusion tolerant system. Concretely, two security measures are derived; MTTSF (mean time to security failure) and the effective traffic intensity. The mathematical analysis is achieved by using Laplace-Stieltjes transforms according to the analysis of M/G/1 queueing system.

Duan, S., Li, Y., Levitt, K..  2016.  Cost sensitive moving target consensus. 2016 IEEE 15th International Symposium on Network Computing and Applications (NCA). :272–281.

Consensus is a fundamental approach to implementing fault-tolerant services through replication. It is well known that there exists a tradeoff between the cost and the resilience. For instance, Crash Fault Tolerant (CFT) protocols have a low cost but can only handle crash failures while Byzantine Fault Tolerant (BFT) protocols handle arbitrary failures but have a higher cost. Hybrid protocols enjoy the benefits of both high performance without failures and high resiliency under failures by switching among different subprotocols. However, it is challenging to determine which subprotocols should be used. We propose a moving target approach to switch among protocols according to the existing system and network vulnerability. At the core of our approach is a formalized cost model that evaluates the vulnerability and performance of consensus protocols based on real-time Intrusion Detection System (IDS) signals. Based on the evaluation results, we demonstrate that a safe, cheap, and unpredictable protocol is always used and a high IDS error rate can be tolerated.

2017
Vora, Keval, Tian, Chen, Gupta, Rajiv, Hu, Ziang.  2017.  CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing. Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. :223–236.
Existing distributed asynchronous graph processing systems employ checkpointing to capture globally consistent snapshots and rollback all machines to most recent checkpoint to recover from machine failures. In this paper we argue that recovery in distributed asynchronous graph processing does not require the entire execution state to be rolled back to a globally consistent state due to the relaxed asynchronous execution semantics. We define the properties required in the recovered state for it to be usable for correct asynchronous processing and develop CoRAL, a lightweight checkpointing and recovery algorithm. First, this algorithm carries out confined recovery that only rolls back graph execution states of the failed machines to affect recovery. Second, it relies upon lightweight checkpoints that capture locally consistent snapshots with a reduced peak network bandwidth requirement. Our experiments using real-world graphs show that our technique recovers from failures and finishes processing 1.5x to 3.2x faster compared to the traditional asynchronous checkpointing and recovery mechanism when failures impact 1 to 6 machines of a 16 machine cluster. Moreover, capturing locally consistent snapshots significantly reduces intermittent high peak bandwidth usage required to save the snapshots – the average reduction in 99th percentile bandwidth ranges from 22% to 51% while 1 to 6 snapshot replicas are being maintained.
Hoepman, Jaap-Henk.  2017.  Privacy Friendly Aggregation of Smart Meter Readings, Even When Meters Crash. Proceedings of the 2Nd Workshop on Cyber-Physical Security and Resilience in Smart Grids. :3–7.
A well studied privacy problem in the area of smart grids is the question of how to aggregate the sum of a set of smart meter readings in a privacy friendly manner, i.e., in such a way that individual meter readings are not revealed to the adversary. Much less well studied is how to deal with arbitrary meter crashes during such aggregation protocols: current privacy friendly aggregation protocols cannot deal with these type of failures. Such failures do happen in practice, though. We therefore propose two privacy friendly aggregation protocols that tolerate such crash failures, up to a predefined maximum number of smart meters. The basic protocol tolerates meter crashes at the start of each aggregation round only. The full, more complex, protocol tolerates meter crashes at arbitrary moments during an aggregation round. It runs in a constant number of phases, cleverly avoiding the otherwise applicable consensus protocol lower bound.