Biblio
Model compression is considered to be an effective way to reduce the implementation cost of deep neural networks (DNNs) while maintaining the inference accuracy. Many recent studies have developed efficient model compression algorithms and implementations in accelerators on various devices. Protecting integrity of DNN inference against fault attacks is important for diverse deep learning enabled applications. However, there has been little research investigating the fault resilience of DNNs and the impact of model compression on fault tolerance. In this work, we consider faults on different data types and develop a simulation framework for understanding the fault resiliency of compressed DNN models as compared to uncompressed models. We perform our experiments on two common DNNs, LeNet-5 and VGG16, and evaluate their fault resiliency with different types of compression. The results show that binary quantization can effectively increase the fault resilience of DNN models by 10000x for both LeNet5 and VGG16. Finally, we propose software and hardware mitigation techniques to increase the fault resiliency of DNN models.
Data analytics and telemetry have become paramount to monitoring and maintaining quality-of-service in addition to business analytics. Stream processing-a model where a network of operators receives and processes continuously arriving discrete elements-is well-suited for these needs. Current and previous studies and frameworks have focused on continuity of operations and aggregate performance metrics. However, real-time performance and tail latency are also important. Timing errors caused by either performance or failed communication faults also affect real-time performance more drastically than aggregate metrics. In this paper, we introduce redundancy in the stream data to improve the real-time performance and resiliency to timing errors caused by either performance or failed communication faults. We also address limitations in previous solutions using a fine-grained acknowledgment tracking scheme to both increase the effectiveness for resiliency to performance faults and enable effectiveness for failed communication faults. Our results show that fine-grained acknowledgment schemes can improve the tail and mean latencies by approximately 30%. We also show that these schemes can improve resiliency to performance faults compared to existing work. Our improvements result in 47.4% to 92.9% fewer missed deadlines compared to 17.3% to 50.6% for comparable topologies and redundancy levels in the state of the art. Finally, we show that redundancies of 25% to 100% can reduce the number of data elements that miss their deadline constraints by 0.76% to 14.04% for applications with high fan-out and by 7.45% up to 50% for applications with no fan-out.
Emerging intelligent systems have stringent constraints including cost and power consumption. When they are used in critical applications, resiliency becomes another key requirement. Much research into techniques for fault tolerance and dependability has been successfully applied to highly critical systems, such as those used in space, where cost is not an overriding constraint. Further, most resiliency techniques were focused on dealing with failures in the hardware and bugs in the software. The next generation of systems used in critical applications will also have to be tolerant to test escapes after manufacturing, soft errors and transients in the electronics, hardware bugs, hardware and software Trojans and viruses, as well as intrusions and other security attacks during operation. This paper will assess the impact of these threats on the results produced by a critical system, and proposed solutions to each of them. It is argued that run-time checks at the application-level are necessary to deal with errors in the results.
Atomic multicast is a communication primitive that delivers messages to multiple groups of processes according to some total order, with each group receiving the projection of the total order onto messages addressed to it. To be scalable, atomic multicast needs to be genuine, meaning that only the destination processes of a message should participate in ordering it. In this paper we propose a novel genuine atomic multicast protocol that in the absence of failures takes as low as 3 message delays to deliver a message when no other messages are multicast concurrently to its destination groups, and 5 message delays in the presence of concurrency. This improves the latencies of both the fault-tolerant version of classical Skeen's multicast protocol (6 or 12 message delays, depending on concurrency) and its recent improvement by Coelho et al. (4 or 8 message delays). To achieve such low latencies, we depart from the typical way of guaranteeing fault-tolerance by replicating each group with Paxos. Instead, we weave Paxos and Skeen's protocol together into a single coherent protocol, exploiting opportunities for white-box optimisations. We experimentally demonstrate that the superior theoretical characteristics of our protocol are reflected in practical performance pay-offs.
This paper presents a novel technique to quantify the operational resilience for power electronic-based components affected by High-Impact Low-Frequency (HILF) weather-related events such as high speed winds. In this study, the resilience quantification is utilized to investigate how prompt the system goes back to the pre-disturbance or another stable operational state. A complexity quantification metric is used to assess the system resilience. The test system is a Solid-State Transformer (SST) representing a complex, nonlinear interconnected system. Results show the effectiveness of the proposed technique for quantifying the operational resilience in systems affected by weather-related disturbances.
Peer-to-peer computing (P2P) refers to the famous technology that provides peers an equal spontaneous collaboration in the network by using appropriate information and communication systems without the need for a central server coordination. Today, the interconnection of several P2P networks has become a genuine solution for increasing system reliability, fault tolerance and resource availability. However, the existence of security threats in such networks, allows us to investigate the safety of users from P2P threats by studying the effects of competition between these interconnected networks. In this paper, we present an e-epidemic model to characterize the worm propagation in an interconnected peer-to-peer network. Here, we address this issue by introducing a model of network competition where an unprotected network is willing to partially weaken its own safety in order to more severely damage a more protected network. The unprotected network can infect all peers in the competitive networks after their non react against the passive worm propagation. Our model also evaluated the effect of an immunization strategies adopted by the protected network to resist against attacking networks. The launch time of immunization strategies in the protected network, the number of peers synapse connected to the both networks, and other effective parameters have also been investigated in this paper.
Fog computing provides computing, storage and communication resources at the edge of the network, near the physical world. Subsequently, end devices nearing the physical world can have interesting properties such as short delays, responsiveness, optimized communications and privacy. However, these end devices have low stability and are prone to failures. There is consequently a need for failure management protocols for IoT applications in the Fog. The design of such solutions is complex due to the specificities of the environment, i.e., (i) dynamic infrastructure where entities join and leave without synchronization, (ii) high heterogeneity in terms of functions, communication models, network, processing and storage capabilities, and, (iii) cyber-physical interactions which introduce non-deterministic and physical world's space and time dependent events. This paper presents a fault tolerance approach taking into account these three characteristics of the Fog-IoT environment. Fault tolerance is achieved by saving the state of the application in an uncoordinated way. When a failure is detected, notifications are propagated to limit the impact of failures and dynamically reconfigure the application. Data stored during the state saving process are used for recovery, taking into account consistency with respect to the physical world. The approach was validated through practical experiments on a smart home platform.
As an information hinge of various trades and professions in the era of big data, cloud data center bears the responsibility to provide uninterrupted service. To cope with the impact of failure and interruption during the operation on the Quality of Service (QoS), it is important to guarantee the resilience of cloud data center. Thus, different resilience actions are conducted in its life circle, that is, resilience strategy. In order to measure the effect of resilience strategy on the system resilience, this paper propose a new approach to model and evaluate the resilience strategy for cloud data center focusing on its core part of service providing-IT architecture. A comprehensive resilience metric based on resilience loss is put forward considering the characteristic of cloud data center. Furthermore, mapping model between system resilience and resilience strategy is built up. Then, based on a hierarchical colored generalized stochastic petri net (HCGSPN) model depicting the procedure of the system processing the service requests, simulation is conducted to evaluate the resilience strategy through the metric calculation. With a case study of a company's cloud data center, the applicability and correctness of the approach is demonstrated.
As robotic capabilities improve and robots become more capable as team members, a better understanding of effective human-robot teaming is needed. In this paper, we investigate failures by robots in various team configurations in space EVA operations. This paper describes the methodology of extending and the application of Work Models that Compute (WMC), a computational simulation framework, to model robot failures, interruptions, and the resolutions they require. Using these models, we investigate how different team configurations respond to a robot's failure to correctly complete the task and overall mission. We also identify key factors that impact the teamwork metrics for team designers to keep in mind while assembling teams and assigning taskwork to the agents. We highlight different metrics that these failures impact on team performance through varying components of teaming and interaction that occur. Finally, we discuss the future implications of this work and the future work to be done to investigate function allocation in human-robot teams.
Wireless sensor networks have achieved the substantial research interest in the present time because of their unique features such as fault tolerance, autonomous operation etc. The coverage maximization while considering the resource scarcity is a crucial problem in the wireless sensor networks. The approaches which address these problems and maximize the network lifetime are considered prominent. The node scheduling is such mechanism to address this issue. The scheduling strategy which addresses the target coverage problem based on coverage probability and trust values is proposed in Energy Efficient Coverage Protocol (EECP). In this paper the optimized decision rules is obtained by using the rough set theory to determine the number of active nodes. The results show that the proposed extension results in the lesser number of decision rules to consider in determination of node states in the network, hence it improves the network efficiency by reducing the number of packets transmitted and reducing the overhead.
The evolution of convolutional neural networks (CNNs) into more complex forms of organization, with additional layers, larger convolutions and increasing connections, established the state-of-the-art in terms of accuracy errors for detection and classification challenges in images. Moreover, as they evolved to a point where Gigabytes of memory are required for their operation, we have reached a stage where it becomes fundamental to understand how their inference capabilities can be impaired if data elements somehow become corrupted in memory. This paper introduces fault-injection in these systems by simulating failing bit-cells in hardware memories brought on by relaxing the 100% reliable operation assumption. We analyze the behavior of these networks calculating inference under severe fault-injection rates and apply fault mitigation strategies to improve on the CNNs resilience. For the MNIST dataset, we show that 8x less memory is required for the feature maps memory space, and that in sub-100% reliable operation, fault-injection rates up to 10-1 (with most significant bit protection) can withstand only a 1% error probability degradation. Furthermore, considering the offload of the feature maps memory to an embedded dynamic RAM (eDRAM) system, using technology nodes from 65 down to 28 nm, up to 73 80% improved power efficiency can be obtained.
In this paper, we propose principles of information control and sharing that support ORCON (ORiginator COntrolled access control) models while simultaneously improving components of confidentiality, availability, and integrity needed to inherently support, when needed, responsibility to share policies, rapid information dissemination, data provenance, and data redaction. This new paradigm of providing unfettered and unimpeded access to information by authorized users, while at the same time, making access by unauthorized users impossible, contrasts with historical approaches to information sharing that have focused on need to know rather than need to (or responsibility to) share.
Covert operations involving clandestine dealings and communication through cryptic and hidden messages have existed since time immemorial. While these do have a negative connotation, they have had their fair share of use in situations and applications beneficial to society in general. A "Dead Drop" is one such method of espionage trade craft used to physically exchange items or information between two individuals using a secret rendezvous point. With a "Dead Drop", to maintain operational security, the exchange itself is asynchronous. Information hiding in the slack space is one modern technique that has been used extensively. Slack space is the unused space within the last block allocated to a stored file. However, hiding in slack space operates under significant constraints with little resilience and fault tolerance. In this paper, we propose FROST – a novel asynchronous "Digital Dead Drop" robust to detection and data loss with tunable fault tolerance. Fault tolerance is a critical attribute of a secure and robust system design. Through extensive validation of FROST prototype implementation on Ubuntu Linux, we confirm the performance and robustness of the proposed digital dead drop to detection and data loss. We verify the recoverability of the secret message under various operating conditions ranging from block corruption and drive de-fragmentation to growing existing files on the target drive.
This paper offers a new approach to modelling the effect of cyber-attacks on reliability of software used in industrial control applications. The model is based on the view that successful cyber-attacks introduce failure regions, which are not present in non-compromised software. The model is then extended to cover a fault tolerant architecture, such as the 1-out-of-2 software, popular for building industrial protection systems. The model is used to study the effectiveness of software maintenance policies such as patching and "cleansing" ("proactive recovery") under different adversary models ranging from independent attacks to sophisticated synchronized attacks on the channels. We demonstrate that the effect of attacks on reliability of diverse software significantly depends on the adversary model. Under synchronized attacks system reliability may be more than an order of magnitude worse than under independent attacks on the channels. These findings, although not surprising, highlight the importance of using an adequate adversary model in the assessment of how effective various cyber-security controls are.
This paper addresses the problem of state estimation of a linear time-invariant system when some of the sensors or/and actuators are under adversarial attack. In our set-up, the adversarial agent attacks a sensor (actuator) by manipulating its measurement (input), and we impose no constraint on how the measurements (inputs) are corrupted. We introduce the notion of ``sparse strong observability'' to characterize systems for which the state estimation is possible, given bounds on the number of attacked sensors and actuators. Furthermore, we develop a secure state estimator based on Satisfiability Modulo Theory (SMT) solvers.
Mobility and multihoming have become the norm in Internet access, e.g. smartphones with Wi-Fi and LTE, and connected vehicles with LTE and DSRC links that change rapidly. Mobility creates challenges for active session continuity when provider-aggregatable locators are used, while multihoming brings opportunities for improving resiliency and allocative efficiency. This paper proposes a novel migration protocol, in the context of the eXpressive Internet Architecture (XIA), the XIA Migration Protocol. We compare it with Mobile IPv6, with respect to handoff latency and overhead, flow migration support, and defense against spoofing and replay of protocol messages. Handoff latencies of the XIA Migration Protocol and Mobile IPv6 Enhanced Route Optimization are comparable and neither protocol opens up avenues for spoofing or replay attacks. However, XIA requires no mobility anchor point to support client mobility while Mobile IPv6 always depends on a home agent. We show that XIA has significant advantage over IPv6 for multihomed hosts and networks in terms of resiliency, scalability, load balancing and allocative efficiency. IPv6 multihoming solutions either forgo scalability (BGP-based) or sacrifice resiliency (NAT-based), while XIA's fallback-based multihoming provides fault tolerance without a heavy-weight protocol. XIA also allows fine-grained incoming load-balancing and QoS-matching by supporting flow migration. Flow migration is not possible using Mobile IPv6 when a single IPv6 address is associated with multiple flows. From a protocol design and architectural perspective, the key enablers of these benefits are flow-level migration, XIA's DAG-based locators and self-certifying identifiers.
A novel method, consisting of fault detection, rough set generation, element isolation and parameter estimation is presented for multiple-fault diagnosis on analog circuit with tolerance. Firstly, a linear-programming concept is developed to transform fault detection of circuit with limited accessible terminals into measurement to check existence of a feasible solution under tolerance constraints. Secondly, fault characteristic equation is deduced to generate a fault rough set. It is proved that the node voltages of nominal circuit can be used in fault characteristic equation with fault tolerance. Lastly, fault detection of circuit with revised deviation restriction for suspected fault elements is proceeded to locate faulty elements and estimate their parameters. The diagnosis accuracy and parameter identification precision of the method are verified by simulation results.