Biblio
Developing software is hard. Developing software that is resilient and does not crash at the occurrence of unexpected inputs or events is even harder, especially with IoT devices and real-time requirements, e.g., due to interactions with human beings. Therefore, there is a need for a software architecture that helps software developers to build fault-tolerant software with as little pain and effort as possible. To this end, we have designed a fault tolerance framework for automation systems that lets developers be mostly oblivious to fault tolerance issues. Thus they can focus on the application logic encapsulated in (micro)services. That is, the developer only needs to specify the required fault tolerance level by description, not implementation. The fault tolerance aspects are transparent to the developer, as the framework takes care of them. This approach is particularly suited for the development for mixed-criticality systems, where different parts have very different and demanding functional and non-functional requirements. For such systems highly specialized developers are needed and removing the burden of fault tolerance results in faster time to market and safer and more dependable systems.
Cloud computation has become prominent with seemingly unlimited amount of storage and computation available to users. Yet, security is a major issue that hampers the growth of cloud. In this research we investigate a collaborative Intrusion Detection System (IDS) based on the ensemble learning method. It uses weak classifiers, and allows the use of untapped resources of cloud to detect various types of attacks on the cloud system. In the proposed system, tasks are distributed among available virtual machines (VM), individual results are then merged for the final adaptation of the learning model. Performance evaluation is carried out using decision trees and using fuzzy classifiers, on KDD99, one of the largest datasets for IDS. Segmentation of the dataset is done in order to mimic the behavior of real-time data traffic occurred in a real cloud environment. The experimental results show that the proposed approach reduces the execution time with improved accuracy, and is fault-tolerant when handling VM failures. The system is a proof-of-concept model for a scalable, cloud-based distributed system that is able to explore untapped resources, and may be used as a base model for a real-time hierarchical IDS.
Due to the growing performance requirements, embedded systems are increasingly more complex. Meanwhile, they are also expected to be reliable. Guaranteeing reliability on complex systems is very challenging. Consequently, there is a substantial need for designs that enable the use of unverified components such as real-time operating system (RTOS) without requiring their correctness to guarantee safety. In this work, we propose a novel approach to design a controller that enables the system to restart and remain safe during and after the restart. Complementing this controller with a switching logic allows the system to use complex, unverified controller to drive the system as long as it does not jeopardize safety. Such a design also tolerates faults that occur in the underlying software layers such as RTOS and middleware and recovers from them through system-level restarts that reinitialize the software (middleware, RTOS, and applications) from a read-only storage. Our approach is implementable using one commercial off-the-shelf (COTS) processing unit. To demonstrate the efficacy of our solution, we fully implement a controller for a 3 degree of freedom (3DOF) helicopter. We test the system by injecting various types of faults into the applications and RTOS and verify that the system remains safe.
This paper focuses on the issues of secure key management for smart grid. With the present key management schemes, it will not yield security for deployment in smart grid. A novel key management scheme is proposed in this paper which merges elliptic curve public key technique and symmetric key technique. Based on the Needham-Schroeder authentication protocol, symmetric key scheme works. Well known threats like replay attack and man-in-the-middle attack can be successfully abolished using Smart Grid. The benefits of the proposed system are fault-tolerance, accessibility, Strong security, scalability and Efficiency.
Motivated by the growing complexity and heterogeneity of modern data centers, and the prevalence of commodity component failures, this paper studies the failure-aware placement problem of placing tasks of a parallel job on machines in the data center with the goal of increasing availability. We consider two models of failures: adversarial and probabilistic. In the adversarial model, each node has a weight (higher weight implying higher reliability) and the adversary can remove any subset of nodes of total weight at most a given bound W and our goal is to find a placement that incurs the least disruption against such an adversary. In the probabilistic model, each node has a probability of failure and we need to find a placement that maximizes the probability that at least K out of N tasks survive at any time. For adversarial failures, we first show that (i) the problems are in Σ2, the second level of the polynomial hierarchy, (ii) a basic variant, that we call RobustFAP, is co-NP-hard, and (iii) an all-or-nothing version of RobustFAP is Σ2-complete. We then give a PTAS for RobustFAP, a key ingredient of which is a solution that we design for a fractional version of RobustFAP. We then study fractional RobustFAP over hierarchies, denoted HierRobustFAP, and introduce a notion of hierarchical max-min fairness/ and a novel Generalized Spreading/ algorithm which is simultaneously optimal for all W. These generalize the classical notion of max-min fairness to work with nodes of differing capacities, differing reliability weights and hierarchical structures. Using randomized rounding, we extend this to give an algorithm for integral HierRobustFAP. For the probabilistic version, we first give an algorithm that achieves an additive ε approximation in the failure probability for the single level version, called ProbFAP, while giving up a (1 + ε) multiplicative factor in the number of failures. We then extend the result to the hierarchical version, HierProbFAP, achieving an ε additive approximation in failure probability while giving up an (L + ε) multiplicative factor in the number of failures, where \$L\$ is the number of levels in the hierarchy.
In the context of service-oriented applications, the self-healing property provides reliable execution in order to support failures and assist automatic recovery techniques. This paper presents a knowledge-based approach for self-healing Composite Service (CS) applications. A CS is an application composed by a set of services interacting each other and invoked on the Web. Our approach is supported by Service Agents, which are in charge of the CS fault-tolerance execution control, making decisions about the selection of recovery and proactive strategies. Service Agents decisions are based on the information they have about the whole application, about themselves, and about what it is expected and what it is really happening at run-time. Hence, application knowledge for decision making comprises off-line precomputed global and local information, user QoS preferences, and propagated actual run-time information. Our approach is evaluated experimentally using a case study.
This article focuses on the design of safe and attack-resilient Cyber-Physical Systems (CPS) equipped with multiple sensors measuring the same physical variable. A malicious attacker may be able to disrupt system performance through compromising a subset of these sensors. Consequently, we develop a precise and resilient sensor fusion algorithm that combines the data received from all sensors by taking into account their specified precisions. In particular, we note that in the presence of a shared bus, in which messages are broadcast to all nodes in the network, the attacker’s impact depends on what sensors he has seen before sending the corrupted measurements. Therefore, we explore the effects of communication schedules on the performance of sensor fusion and provide theoretical and experimental results advocating for the use of the Ascending schedule, which orders sensor transmissions according to their precision starting from the most precise. In addition, to improve the accuracy of the sensor fusion algorithm, we consider the dynamics of the system in order to incorporate past measurements at the current time. Possible ways of mapping sensor measurement history are investigated in the article and are compared in terms of the confidence in the final output of the sensor fusion. We show that the precision of the algorithm using history is never worse than the no-history one, while the benefits may be significant. Furthermore, we utilize the complementary properties of the two methods and show that their combination results in a more precise and resilient algorithm. Finally, we validate our approach in simulation and experiments on a real unmanned ground robot.
Bulk electric systems include hundreds of synchronous generators. Faults in such systems can induce oscillations in the generators which if not detected and controlled can destabilize the system. Mode estimation is a popular method for oscillation detection. In this paper, we propose a resilient algorithm to estimate electro-mechanical oscillation modes in large scale power system in the presence of false data. In particular, we add a fault tolerance mechanism to a variant of alternating direction method of multipliers (ADMM) called S-ADMM. We evaluate our method on an IEEE 68-bus test system under different attack scenarios and show that in all the scenarios our algorithm converges well.
Fault-tolerance has huge impact on embedded safety-critical systems. As technology that assists to the development of such improvement, Safe Node Sequence Protocol (SNSP) is designed to make part of such impact. In this paper, we present a mechanism for fault-tolerance and recovery based on the Safe Node Sequence Protocol (SNSP) to strengthen the system robustness, from which the correctness of a fault-tolerant prototype system is analyzed and verified. In order to verify the correctness of more than thirty failure modes, we have partitioned the complete protocol state machine into several subsystems, followed to the injection of corresponding fault classes into dedicated independent models. Experiments demonstrate that this method effectively reduces the size of overall state space, and verification results indicate that the protocol is able to recover from the fault model in a fault-tolerant system and continue to operate as errors occur.