Biblio
The failure prediction method of virtual machines (VM) guarantees reliability to cloud platforms. However, the uncertainty of VM security state will affect the reliability and task processing capabilities of the entire cloud platform. In this study, a failure prediction method of VM based on AdaBoost-Hidden Markov Model was proposed to improve the reliability of VMs and overall performance of cloud platforms. This method analyzed the deep relationship between the observation state and the hidden state of the VM through the hidden Markov model, proved the influence of the AdaBoost algorithm on the hidden Markov model (HMM), and realized the prediction of the VM failure state. Results show that the proposed method adapts to the complex dynamic cloud platform environment, can effectively predict the failure state of VMs, and improve the predictive ability of VM security state.
This paper presents an overview of the H2020 project VESSEDIA [9] aimed at verifying the security and safety of modern connected systems also called IoT. The originality relies in using Formal Methods inherited from high-criticality applications domains to analyze the source code at different levels of intensity, to gather possible faults and weaknesses. The analysis methods are mostly exhaustive an guarantee that, after analysis, the source code of the application is error-free. This paper is structured as follows: after an introductory section 1 giving some factual data, section 2 presents the aims and the problems addressed; section 3 describes the project's use-cases and section 4 describes the proposed approach for solving these problems and the results achieved until now; finally, section 5 discusses some remaining future work.
Data analytics and telemetry have become paramount to monitoring and maintaining quality-of-service in addition to business analytics. Stream processing-a model where a network of operators receives and processes continuously arriving discrete elements-is well-suited for these needs. Current and previous studies and frameworks have focused on continuity of operations and aggregate performance metrics. However, real-time performance and tail latency are also important. Timing errors caused by either performance or failed communication faults also affect real-time performance more drastically than aggregate metrics. In this paper, we introduce redundancy in the stream data to improve the real-time performance and resiliency to timing errors caused by either performance or failed communication faults. We also address limitations in previous solutions using a fine-grained acknowledgment tracking scheme to both increase the effectiveness for resiliency to performance faults and enable effectiveness for failed communication faults. Our results show that fine-grained acknowledgment schemes can improve the tail and mean latencies by approximately 30%. We also show that these schemes can improve resiliency to performance faults compared to existing work. Our improvements result in 47.4% to 92.9% fewer missed deadlines compared to 17.3% to 50.6% for comparable topologies and redundancy levels in the state of the art. Finally, we show that redundancies of 25% to 100% can reduce the number of data elements that miss their deadline constraints by 0.76% to 14.04% for applications with high fan-out and by 7.45% up to 50% for applications with no fan-out.
Emerging intelligent systems have stringent constraints including cost and power consumption. When they are used in critical applications, resiliency becomes another key requirement. Much research into techniques for fault tolerance and dependability has been successfully applied to highly critical systems, such as those used in space, where cost is not an overriding constraint. Further, most resiliency techniques were focused on dealing with failures in the hardware and bugs in the software. The next generation of systems used in critical applications will also have to be tolerant to test escapes after manufacturing, soft errors and transients in the electronics, hardware bugs, hardware and software Trojans and viruses, as well as intrusions and other security attacks during operation. This paper will assess the impact of these threats on the results produced by a critical system, and proposed solutions to each of them. It is argued that run-time checks at the application-level are necessary to deal with errors in the results.
A prioritized cyber defense remediation plan is critical for effective risk management in cyber-physical systems (CPS). The increased integration of Information Technology (IT)/Operational Technology (OT) in CPS has to lead to the need to identify the critical assets which, when affected, will impact resilience and safety. In this work, we propose a methodology for prioritized cyber risk remediation plan that balances operational resilience and economic loss (safety impacts) in CPS. We present a platform for modeling and analysis of the effect of cyber threats and random system faults on the safety of CPS that could lead to catastrophic damages. We propose to develop a data-driven attack graph and fault graph-based model to characterize the exploitability and impact of threats in CPS. We develop an operational impact assessment to quantify the damages. Finally, we propose the development of a strategic response decision capability that proposes optimal mitigation actions and policies that balances the trade-off between operational resilience (Tactical Risk) and Strategic Risk.
The paper describes modification of the ATA (Attack Tree Analysis) technique for assessment of instrumentation and control systems (ICS) dependability (reliability, availability and cyber security) called AvTA (Availability Tree Analysis). The techniques FMEA, FMECA and IMECA applied to carry out preliminary semi-formal and criticality oriented analysis before AvTA based assessment are described. AvTA models combine reliability and cyber security subtrees considering probabilities of ICS recovery in case of hardware (physical) and software (design) failures and attacks on components casing failures. Successful recovery events (SREs) avoid corresponding failures in tree using OR gates if probabilities of SRE for assumed time are more than required. Case for dependability AvTA based assessment (model, availability function and technology of decision-making for choice of component and system parameters) for smart building ICS (Building Automation Systems, BAS) is discussed.
Approaches for the automatic analysis of security policies on source code level cannot trivially be applied to binaries. This is due to the lacking high-level semantics of low-level object code, and the fundamental problem that control-flow recovery from binaries is difficult. We present a novel approach to recover the control-flow of binaries that is both safe and efficient. The key idea of our approach is to use the information contained in security mechanisms to approximate the targets of computed branches. To achieve this, we first define a restricted control transition intermediate language (RCTIL), which restricts the number of possible targets for each branch to a finite number of given targets. Based on this intermediate language, we demonstrate how a safe model of the control flow can be recovered without data-flow analyses. Our evaluation shows that that makes our solution more efficient than existing solutions.
As safety-critical systems become increasingly interconnected, a system's operations depend on the reliability and security of the computing components and the interconnections among them. Therefore, a growing body of research seeks to tie safety analysis to security analysis. Specifically, it is important to analyze system safety under different attacker models. In this paper, we develop generic parameterizable state automaton templates to model the effects of an attack. Then, given an attacker model, we generate a state automaton that represents the system operation under the threat of the attacker model. We use a railway signaling system as our case study and consider threats to the communication protocol and the commands issued to physical devices. Our results show that while less skilled attackers are not able to violate system safety, more dedicated and skilled attackers can affect system safety. We also consider several countermeasures and show how well they can deter attacks.
Predicting software faults before software testing activities can help rational distribution of time and resources. Software metrics are used for software fault prediction due to their close relationship with software faults. Thanks to the non-linear fitting ability, Neural networks are increasingly used in the prediction model. We first filter metric set of the embedded software by statistical methods to reduce the dimensions of model input. Then we build a back propagation neural network with simple structure but good performance and apply it to two practical embedded software projects. The verification results show that the model has good ability to predict software faults.
This paper offers a new approach to modelling the effect of cyber-attacks on reliability of software used in industrial control applications. The model is based on the view that successful cyber-attacks introduce failure regions, which are not present in non-compromised software. The model is then extended to cover a fault tolerant architecture, such as the 1-out-of-2 software, popular for building industrial protection systems. The model is used to study the effectiveness of software maintenance policies such as patching and "cleansing" ("proactive recovery") under different adversary models ranging from independent attacks to sophisticated synchronized attacks on the channels. We demonstrate that the effect of attacks on reliability of diverse software significantly depends on the adversary model. Under synchronized attacks system reliability may be more than an order of magnitude worse than under independent attacks on the channels. These findings, although not surprising, highlight the importance of using an adequate adversary model in the assessment of how effective various cyber-security controls are.
Resilience in High Performance Computing (HPC) is a constraining factor for bringing applications to the upcoming exascale systems. Resilience techniques must be able to scale to handle the increasing number of expected errors in an energy efficient manner. Since the purpose of running applications on HPC systems is to perform large scale computations as quick as possible, resilience methods should not add a large delay to the time to completion of the application. In this paper we introduce a novel technique to detect and recover from transient errors in HPC applications. One of the features of our technique is that the energy budget allocated to resilience can be adjusted depending on the operator's resilience needs. For example, on synthetic data, the technique can detect about 50% of transient errors while only using 20% of the dynamic energy required for running the application. For a 60% energy budget, an application that uses 10k cores and takes 128 hours to run, will only require 10% longer to complete.
The evolution of convolutional neural networks (CNNs) into more complex forms of organization, with additional layers, larger convolutions and increasing connections, established the state-of-the-art in terms of accuracy errors for detection and classification challenges in images. Moreover, as they evolved to a point where Gigabytes of memory are required for their operation, we have reached a stage where it becomes fundamental to understand how their inference capabilities can be impaired if data elements somehow become corrupted in memory. This paper introduces fault-injection in these systems by simulating failing bit-cells in hardware memories brought on by relaxing the 100% reliable operation assumption. We analyze the behavior of these networks calculating inference under severe fault-injection rates and apply fault mitigation strategies to improve on the CNNs resilience. For the MNIST dataset, we show that 8x less memory is required for the feature maps memory space, and that in sub-100% reliable operation, fault-injection rates up to 10-1 (with most significant bit protection) can withstand only a 1% error probability degradation. Furthermore, considering the offload of the feature maps memory to an embedded dynamic RAM (eDRAM) system, using technology nodes from 65 down to 28 nm, up to 73 80% improved power efficiency can be obtained.
Creating and implementing fault-tolerant distributed algorithms is a challenging task in highly safety-critical industries. Using formal methods supports design and development of complex algorithms. However, formal methods are often perceived as an unjustifiable overhead. This paper presents the experience and insights when using TLA+ and PlusCal to model and develop fault-tolerant and safety-critical modules for TAS Control Platform, a platform for railway control applications up to safety integrity level (SIL) 4. We show how formal methods helped us improve the correctness of the algorithms, improved development efficiency and how part of the gap between model and implementation has been closed by translation to C code. Additionally, we describe how we gained trust in the formal model and tools by following a specific design process called property-driven design, which also implicitly addresses software quality metrics such as code coverage metrics.
Vulnerability being the buzz word in the modern time is the most important jargon related to software and operating system. Since every now and then, software is developed some loopholes and incompleteness lie in the development phase, so there always remains a vulnerability of abruptness in it which can come into picture anytime. Detecting vulnerability is one thing and predicting its occurrence in the due course of time is another thing. If we get to know the vulnerability of any software in the due course of time then it acts as an active alarm for the developers to again develop sound and improvised software the second time. The proposal talks about the implementation of the idea using the artificial neural network, where different data sets are being given as input for being used for further analysis for successful results. As of now, there are models for studying the vulnerabilities in the software and networks, this paper proposal in addition to the current work, will throw light on the predictability of vulnerabilities over the due course of time.
In this paper, we present AnomalyDetect, an approach for detecting anomalies in cloud services. A cloud service consists of a set of interacting applications/processes running on one or more interconnected virtual machines. AnomalyDetect uses the Kalman Filter as the basis for predicting the states of virtual machines running cloud services. It uses the cloud service's virtual machine historical data to forecast potential anomalies. AnomalyDetect has been integrated with the AutoMigrate framework and serves as the means for detecting anomalies to automatically trigger live migration of cloud services to preserve their availability. AutoMigrate is a framework for developing intelligent systems that can monitor and migrate cloud services to maximize their availability in case of cloud disruption. We conducted a number of experiments to analyze the performance of the proposed AnomalyDetect approach. The experimental results highlight the feasibility of AnomalyDetect as an approach to autonomic cloud availability.
The Internet of Things (IoT) connects not only computers and mobile devices, but it also interconnects smart buildings, homes, and cities, as well as electrical grids, gas, and water networks, automobiles, airplanes, etc. However, IoT applications introduce grand security challenges due to the increase in the attack surface. Current security approaches do not handle cybersecurity from a holistic point of view; hence a systematic cybersecurity mechanism needs to be adopted when designing IoTbased applications. In this work, we present a risk management framework to deploy secure IoT-based applications for Smart Infrastructures at the design time and the runtime. At the design time, we propose a risk management method that is appropriate for smart infrastructures. At the design time, our framework relies on the Anomaly Behavior Analysis (ABA) methodology enabled by the Autonomic Computing paradigm and an intrusion detection system to detect any threat that can compromise IoT infrastructures by. Our preliminary experimental results show that our framework can be used to detect threats and protect IoT premises and services.
Elliptic Curve Cryptosystems are very much delicate to attacks or physical attacks. This paper aims to correctly implementing the fault injection attack against Elliptic Curve Digital Signature Algorithm. More specifically, the proposed algorithm concerns to fault attack which is implemented to sufficiently alter signature against vigilant periodic sequence algorithm that supports the efficient speed up and security perspectives with most prominent and well known scalar multiplication algorithm for ECDSA. The purpose is to properly injecting attack whether any probable countermeasure threatening the pseudo code is determined by the attack model according to the predefined methodologies. We show the results of our experiment with bits acquire from the targeted implementation to determine the reliability of our attack.
Consensus is a fundamental approach to implementing fault-tolerant services through replication. It is well known that there exists a tradeoff between the cost and the resilience. For instance, Crash Fault Tolerant (CFT) protocols have a low cost but can only handle crash failures while Byzantine Fault Tolerant (BFT) protocols handle arbitrary failures but have a higher cost. Hybrid protocols enjoy the benefits of both high performance without failures and high resiliency under failures by switching among different subprotocols. However, it is challenging to determine which subprotocols should be used. We propose a moving target approach to switch among protocols according to the existing system and network vulnerability. At the core of our approach is a formalized cost model that evaluates the vulnerability and performance of consensus protocols based on real-time Intrusion Detection System (IDS) signals. Based on the evaluation results, we demonstrate that a safe, cheap, and unpredictable protocol is always used and a high IDS error rate can be tolerated.