Visible to the public Biblio

Filters: Keyword is fault tolerant computing  [Clear All Filters]
2021-03-01
Houzé, É, Diaconescu, A., Dessalles, J.-L., Mengay, D., Schumann, M..  2020.  A Decentralized Approach to Explanatory Artificial Intelligence for Autonomic Systems. 2020 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C). :115–120.
While Explanatory AI (XAI) is attracting increasing interest from academic research, most AI-based solutions still rely on black box methods. This is unsuitable for certain domains, such as smart homes, where transparency is key to gaining user trust and solution adoption. Moreover, smart homes are challenging environments for XAI, as they are decentralized systems that undergo runtime changes. We aim to develop an XAI solution for addressing problems that an autonomic management system either could not resolve or resolved in a surprising manner. This implies situations where the current state of affairs is not what the user expected, hence requiring an explanation. The objective is to solve the apparent conflict between expectation and observation through understandable logical steps, thus generating an argumentative dialogue. While focusing on the smart home domain, our approach is intended to be generic and transferable to other cyber-physical systems offering similar challenges. This position paper focuses on proposing a decentralized algorithm, called D-CAN, and its corresponding generic decentralized architecture. This approach is particularly suited for SISSY systems, as it enables XAI functions to be extended and updated when devices join and leave the managed system dynamically. We illustrate our proposal via several representative case studies from the smart home domain.
2020-12-14
Pilet, A. B., Frey, D., Taïani, F..  2020.  Foiling Sybils with HAPS in Permissionless Systems: An Address-based Peer Sampling Service. 2020 IEEE Symposium on Computers and Communications (ISCC). :1–6.
Blockchains and distributed ledgers have brought renewed interest in Byzantine fault-tolerant protocols and decentralized systems, two domains studied for several decades. Recent promising works have in particular proposed to use epidemic protocols to overcome the limitations of popular Blockchain mechanisms, such as proof-of-stake or proof-of-work. These works unfortunately assume a perfect peer-sampling service, immune to malicious attacks, a property that is difficult and costly to achieve. We revisit this fundamental problem in this paper, and propose a novel Byzantine-tolerant peer-sampling service that is resilient to Sybil attacks in open systems by exploiting the underlying structure of wide-area networks.
2020-12-02
Gliksberg, J., Capra, A., Louvet, A., García, P. J., Sohier, D..  2019.  High-Quality Fault-Resiliency in Fat-Tree Networks (Extended Abstract). 2019 IEEE Symposium on High-Performance Interconnects (HOTI). :9—12.
Coupling regular topologies with optimized routing algorithms is key in pushing the performance of interconnection networks of HPC systems. In this paper we present Dmodc, a fast deterministic routing algorithm for Parallel Generalized Fat-Trees (PGFTs) which minimizes congestion risk even under massive topology degradation caused by equipment failure. It applies a modulo-based computation of forwarding tables among switches closer to the destination, using only knowledge of subtrees for pre-modulo division. Dmodc allows complete re-routing of topologies with tens of thousands of nodes in less than a second, which greatly helps centralized fabric management react to faults with high-quality routing tables and no impact to running applications in current and future very large-scale HPC clusters. We compare Dmodc against routing algorithms available in the InfiniBand control software (OpenSM) first for routing execution time to show feasibility at scale, and then for congestion risk under degradation to demonstrate robustness. The latter comparison is done using static analysis of routing tables under random permutation (RP), shift permutation (SP) and all-to-all (A2A) traffic patterns. Results for Dmodc show A2A and RP congestion risks similar under heavy degradation as the most stable algorithms compared, and near-optimal SP congestion risk up to 1% of random degradation.
2020-11-04
Torkura, K. A., Sukmana, M. I. H., Strauss, T., Graupner, H., Cheng, F., Meinel, C..  2018.  CSBAuditor: Proactive Security Risk Analysis for Cloud Storage Broker Systems. 2018 IEEE 17th International Symposium on Network Computing and Applications (NCA). :1—10.

Cloud Storage Brokers (CSB) provide seamless and concurrent access to multiple Cloud Storage Services (CSS) while abstracting cloud complexities from end-users. However, this multi-cloud strategy faces several security challenges including enlarged attack surfaces, malicious insider threats, security complexities due to integration of disparate components and API interoperability issues. Novel security approaches are imperative to tackle these security issues. Therefore, this paper proposes CS-BAuditor, a novel cloud security system that continuously audits CSB resources, to detect malicious activities and unauthorized changes e.g. bucket policy misconfigurations, and remediates these anomalies. The cloud state is maintained via a continuous snapshotting mechanism thereby ensuring fault tolerance. We adopt the principles of chaos engineering by integrating BrokerMonkey, a component that continuously injects failure into our reference CSB system, CloudRAID. Hence, CSBAuditor is continuously tested for efficiency i.e. its ability to detect the changes injected by BrokerMonkey. CSBAuditor employs security metrics for risk analysis by computing severity scores for detected vulnerabilities using the Common Configuration Scoring System, thereby overcoming the limitation of insufficient security metrics in existing cloud auditing schemes. CSBAuditor has been tested using various strategies including chaos engineering failure injection strategies. Our experimental evaluation validates the efficiency of our approach against the aforementioned security issues with a detection and recovery rate of over 96 %.

2020-09-04
Zhang, Xiao, Wang, Yanqiu, Wang, Qing, Zhao, Xiaonan.  2019.  A New Approach to Double I/O Performance for Ceph Distributed File System in Cloud Computing. 2019 2nd International Conference on Data Intelligence and Security (ICDIS). :68—75.
Block storage resources are essential in an Infrastructure-as-a-Service(IaaS) cloud computing system. It is used for storing virtual machines' images. It offers persistent storage service even the virtual machine is off. Distribute storage systems are used to provide block storage services in IaaS, such as Amazon EBS, Cinder, Ceph, Sheepdog. Ceph is widely used as the backend block storage service of OpenStack platform. It converts block devices into objects with the same size and saves them on the local file system. The performance of block devices provided by Ceph is only 30% of hard disks in many cases. One of the key issues that affect the performance of Ceph is the three replicas for fault tolerance. But our research finds that replicas are not the real reason slow down the performance. In this paper, we present a new approach to accelerate the IO operations. The experiment results show that by using our storage engine, Ceph can offer faster IO performance than the hard disk in most cases. Our new storage engine provides more than three times up than the original one.
2020-08-24
Noor, Joseph, Ali-Eldin, Ahmed, Garcia, Luis, Rao, Chirag, Dasari, Venkat R., Ganesan, Deepak, Jalaian, Brian, Shenoy, Prashant, Srivastava, Mani.  2019.  The Case for Robust Adaptation: Autonomic Resource Management is a Vulnerability. MILCOM 2019 - 2019 IEEE Military Communications Conference (MILCOM). :821–826.
Autonomic resource management for distributed edge computing systems provides an effective means of enabling dynamic placement and adaptation in the face of network changes, load dynamics, and failures. However, adaptation in-and-of-itself offers a side channel by which malicious entities can extract valuable information. An attacker can take advantage of autonomic resource management techniques to fool a system into misallocating resources and crippling applications. Using a few scenarios, we outline how attacks can be launched using partial knowledge of the resource management substrate - with as little as a single compromised node. We argue that any system that provides adaptation must consider resource management as an attack surface. As such, we propose ADAPT2, a framework that incorporates concepts taken from Moving-Target Defense and state estimation techniques to ensure correctness and obfuscate resource management, thereby protecting valuable system and application information from leaking.
2020-07-27
Tun, May Thet, Nyaung, Dim En, Phyu, Myat Pwint.  2019.  Performance Evaluation of Intrusion Detection Streaming Transactions Using Apache Kafka and Spark Streaming. 2019 International Conference on Advanced Information Technologies (ICAIT). :25–30.
In the information era, the size of network traffic is complex because of massive Internet-based services and rapid amounts of data. The more network traffic has enhanced, the more cyberattacks have dramatically increased. Therefore, cybersecurity intrusion detection has been a challenge in the current research area in recent years. The Intrusion detection system requires high-level protection and detects modern and complex attacks with more accuracy. Nowadays, big data analytics is the main key to solve marketing, security and privacy in an extremely competitive financial market and government. If a huge amount of stream data flows within a short period time, it is difficult to analyze real-time decision making. Performance analysis is extremely important for administrators and developers to avoid bottlenecks. The paper aims to reduce time-consuming by using Apache Kafka and Spark Streaming. Experiments on the UNSWNB-15 dataset indicate that the integration of Apache Kafka and Spark Streaming can perform better in terms of processing time and fault-tolerance on the huge amount of data. According to the results, the fault tolerance can be provided by the multiple brokers of Kafka and parallel recovery of Spark Streaming. And then, the multiple partitions of Apache Kafka increase the processing time in the integration of Apache Kafka and Spark Streaming.
Babay, Amy, Tantillo, Thomas, Aron, Trevor, Platania, Marco, Amir, Yair.  2018.  Network-Attack-Resilient Intrusion-Tolerant SCADA for the Power Grid. 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). :255–266.
As key components of the power grid infrastructure, Supervisory Control and Data Acquisition (SCADA) systems are likely to be targeted by nation-state-level attackers willing to invest considerable resources to disrupt the power grid. We present Spire, the first intrusion-tolerant SCADA system that is resilient to both system-level compromises and sophisticated network-level attacks and compromises. We develop a novel architecture that distributes the SCADA system management across three or more active sites to ensure continuous availability in the presence of simultaneous intrusions and network attacks. A wide-area deployment of Spire, using two control centers and two data centers spanning 250 miles, delivered nearly 99.999% of all SCADA updates initiated over a 30-hour period within 100ms. This demonstrates that Spire can meet the latency requirements of SCADA for the power grid.
Vöelp, Marcus, Esteves-Verissimo, Paulo.  2018.  Intrusion-Tolerant Autonomous Driving. 2018 IEEE 21st International Symposium on Real-Time Distributed Computing (ISORC). :130–133.
Fully autonomous driving is one if not the killer application for the upcoming decade of real-time systems. However, in the presence of increasingly sophisticated attacks by highly skilled and well equipped adversarial teams, autonomous driving must not only guarantee timeliness and hence safety. It must also consider the dependability of the software concerning these properties while the system is facing attacks. For distributed systems, fault-and-intrusion tolerance toolboxes already offer a few solutions to tolerate partial compromise of the system behind a majority of healthy components operating in consensus. In this paper, we present a concept of an intrusion-tolerant architecture for autonomous driving. In such a scenario, predictability and recovery challenges arise from the inclusion of increasingly more complex software on increasingly less predictable hardware. We highlight how an intrusion tolerant design can help solve these issues by allowing timeliness to emerge from a majority of complex components being fast enough, often enough while preserving safety under attack through pre-computed fail safes.
2020-07-16
Ma, Siyou, Yan, Yunqiang.  2018.  Simulation Testing of Fault-Tolerant CPS Based on Hierarchical Adaptive Policies. 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C). :443—449.

Cyber physical system (CPS) is often deployed at safety-critical key infrastructures and fields, fault tolerance policies are extensively applied in CPS systems to improve its credibility; the same physical backup of hardware redundancy (SPB) technology is frequently used for its simple and reliable implementation. To resolve challenges faced with in simulation test of SPB-CPS, this paper dynamically determines the test resources matched with the CPS scale by using the adaptive allocation policies, establishes the hierarchical models and inter-layer message transmission mechanism. Meanwhile, the collaborative simulation time sequence push strategy and the node activity test mechanism based on the sliding window are designed in this paper to improve execution efficiency of the simulation test. In order to validate effectiveness of the method proposed in this paper, we successfully built up a fault-tolerant CPS simulation platform. Experiments showed that it can improve the SPB-CPS simulation test efficiency.

2020-07-06
Saffar, Zahra, Mohammadi, Siamak.  2019.  Fault tolerant non-linear techniques for scalar multiplication in ECC. 2019 16th International ISC (Iranian Society of Cryptology) Conference on Information Security and Cryptology (ISCISC). :104–113.
Elliptic curve cryptography (ECC) has shorter key length than other asymmetric cryptography algorithms such as RSA with the same security level. Existing faults in cryptographic computations can cause faulty results. If a fault occurs during encryption, false information will be sent to the destination, in which case channel error detection codes are unable to detect the fault. In this paper, we consider the error detection in elliptic curve scalar multiplication point, which is the most important operation in ECC. Our technique is based on non-linear error detection codes. We consider an algorithm for scalar multiplication point proposed by Microsoft research group. The proposed technique in our methods has less overhead for additions (36.36%) and multiplications (34.84%) in total, compared to previous works. Also, the proposed method can detect almost 100% of injected faults.
2020-05-15
Ravikumar, C.P., Swamy, S. Kendaganna, Uma, B.V..  2019.  A hierarchical approach to self-test, fault-tolerance and routing security in a Network-on-Chip. 2019 IEEE International Test Conference India (ITC India). :1—6.
Since the performance of bus interconnects does not scale with the number of processors connected to the bus, chip multiprocessors make use of on-chip networks that implement packet switching and virtual channel flow control to efficiently transport data. In this paper, we consider the test and fault-tolerance aspects of such a network-on-chip (NoC). Past work in this area has addressed the communication efficiency and deadlock-free properties in NoC, but when routing externally received data, aspects of security must be addressed. A malicious denial-of-service attack or a power virus can be launched by a malicious external agent. We propose a two-tier solution to this problem, where a local self-test manager in each processing element runs test algorithms to detect faults in local processing element and its associated physical and virtual channels. At the global level, the health of the NoC is tested using a sorting-based algorithm proposed in this paper. Similarly, we propose to handle fault-tolerance and security concerns in routing at two levels. At the local level, each node is capable of fault-tolerant routing by deflecting packets to an alternate path; when doing so, since a chance of deadlock may be created, the local router must be capable of guestimating a deadlock situation, switch to packet-switching instead of flit-switching and attempt to reroute the packet. At the global level, a routing agent plays the role of gathering fault data and provide the fault-information to nodes that seek this information periodically. Similarly, the agent is capable of detecting malformed packets coming from an external source and prevent injecting such packets into the network, thereby conserving the network bandwidth. The agent also attempts to guess attempts at denial-of-service attacks and power viruses and will reject packets. Use of a two-tier approach helps in keeping the IP modular and reduces their complexity, thereby making them easier to verify.
2020-05-11
Kinkelin, Holger, Hauner, Valentin, Niedermayer, Heiko, Carle, Georg.  2018.  Trustworthy configuration management for networked devices using distributed ledgers. NOMS 2018 - 2018 IEEE/IFIP Network Operations and Management Symposium. :1–5.
Numerous IoT applications, like building automation or process control of industrial sites, exist today. These applications inherently have a strong connection to the physical world. Hence, IT security threats cannot only cause problems like data leaks but also safety issues which might harm people. Attacks on IT systems are not only performed by outside attackers but also insiders like administrators. For this reason, we present ongoing work on a Byzantine fault tolerant configuration management system (CMS) that provides control over administrators, restrains their rights, and enforces separation of concerns. We reach this goal by conducting a configuration management process that requires multi-party authorization for critical configurations to prevent individual malicious administrators from performing undesired actions. Only after a configuration has been authorized by multiple experts, it is applied to the targeted devices. For the whole configuration management process, our CMS guarantees accountability and traceability. Lastly, our system is tamper-resistant as we leverage Hyperledger Fabric, which provides a distributed execution environment for our CMS and a blockchain-based distributed ledger that we use to store the configurations. A beneficial side effect of this approach is that our CMS is also suitable to manage configurations for infrastructure shared across different organizations that do not need to trust each other.
2020-03-02
Zhang, Xuefei, Liu, Junjie, Li, Yijing, Cui, Qimei, Tao, Xiaofeng, Liu, Ren Ping.  2019.  Blockchain Based Secure Package Delivery via Ridesharing. 2019 11th International Conference on Wireless Communications and Signal Processing (WCSP). :1–6.

Delivery service via ridesharing is a promising service to share travel costs and improve vehicle occupancy. Existing ridesharing systems require participating vehicles to periodically report individual private information (e.g., identity and location) to a central controller, which is a potential central point of failure, resulting in possible data leakage or tampering in case of controller break down or under attack. In this paper, we propose a Blockchain secured ridesharing delivery system, where the immutability and distributed architecture of the Blockchain can effectively prevent data tampering. However, such tamper-resistance property comes at the cost of a long confirmation delay caused by the consensus process. A Hash-oriented Practical Byzantine Fault Tolerance (PBFT) based consensus algorithm is proposed to improve the Blockchain efficiency and reduce the transaction confirmation delay from 10 minutes to 15 seconds. The Hash-oriented PBFT effectively avoids the double-spending attack and Sybil attack. Security analysis and simulation results demonstrate that the proposed Blockchain secured ridesharing delivery system offers strong security guarantees and satisfies the quality of delivery service in terms of confirmation delay and transaction throughput.

2020-02-26
Sabbagh, Majid, Gongye, Cheng, Fei, Yunsi, Wang, Yanzhi.  2019.  Evaluating Fault Resiliency of Compressed Deep Neural Networks. 2019 IEEE International Conference on Embedded Software and Systems (ICESS). :1–7.

Model compression is considered to be an effective way to reduce the implementation cost of deep neural networks (DNNs) while maintaining the inference accuracy. Many recent studies have developed efficient model compression algorithms and implementations in accelerators on various devices. Protecting integrity of DNN inference against fault attacks is important for diverse deep learning enabled applications. However, there has been little research investigating the fault resilience of DNNs and the impact of model compression on fault tolerance. In this work, we consider faults on different data types and develop a simulation framework for understanding the fault resiliency of compressed DNN models as compared to uncompressed models. We perform our experiments on two common DNNs, LeNet-5 and VGG16, and evaluate their fault resiliency with different types of compression. The results show that binary quantization can effectively increase the fault resilience of DNN models by 10000x for both LeNet5 and VGG16. Finally, we propose software and hardware mitigation techniques to increase the fault resiliency of DNN models.

Tran, Geoffrey Phi, Walters, John Paul, Crago, Stephen.  2019.  Increased Fault-Tolerance and Real-Time Performance Resiliency for Stream Processing Workloads through Redundancy. 2019 IEEE International Conference on Services Computing (SCC). :51–55.

Data analytics and telemetry have become paramount to monitoring and maintaining quality-of-service in addition to business analytics. Stream processing-a model where a network of operators receives and processes continuously arriving discrete elements-is well-suited for these needs. Current and previous studies and frameworks have focused on continuity of operations and aggregate performance metrics. However, real-time performance and tail latency are also important. Timing errors caused by either performance or failed communication faults also affect real-time performance more drastically than aggregate metrics. In this paper, we introduce redundancy in the stream data to improve the real-time performance and resiliency to timing errors caused by either performance or failed communication faults. We also address limitations in previous solutions using a fine-grained acknowledgment tracking scheme to both increase the effectiveness for resiliency to performance faults and enable effectiveness for failed communication faults. Our results show that fine-grained acknowledgment schemes can improve the tail and mean latencies by approximately 30%. We also show that these schemes can improve resiliency to performance faults compared to existing work. Our improvements result in 47.4% to 92.9% fewer missed deadlines compared to 17.3% to 50.6% for comparable topologies and redundancy levels in the state of the art. Finally, we show that redundancies of 25% to 100% can reduce the number of data elements that miss their deadline constraints by 0.76% to 14.04% for applications with high fan-out and by 7.45% up to 50% for applications with no fan-out.

2020-02-17
Zhao, Guowei, Zhao, Rui, Wang, Qiang, Xue, Hui, Luo, Fang.  2019.  Virtual Network Mapping Algorithm for Self-Healing of Distribution Network. 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC). :1442–1445.
This paper focuses on how to provide virtual network (VN) with the survivability of node failure. In the SVNE that responds to node failures, the backup mechanism provided by the VN initial mapping method should be as flexible as possible, so that backup resources can be shared among the VNs, thereby providing survivability support for the most VNs with the least backup overhead, which can improve The utilization of backup resources can also improve the survivability of VN to deal with multi-node failures. For the remapping method of virtual networks, it needs to be higher because it involves both remapping of virtual nodes and remapping of related virtual links. The remapping efficiency, so as to restore the affected VN to a normal state as soon as possible, to avoid affecting the user's business experience. Considering that the SVNE method that actively responds to node failures always has a certain degree of backup resource-specific phenomenon, this section provides a SVNE method that passively responds to node failures. This paper mainly introduces the survivability virtual network initial mapping method based on physical node recoverability in this method.
Khalil, Kasem, Eldash, Omar, Kumar, Ashok, Bayoumi, Magdy.  2019.  Self-Healing Approach for Hardware Neural Network Architecture. 2019 IEEE 62nd International Midwest Symposium on Circuits and Systems (MWSCAS). :622–625.
Neural Network is used in many applications and guarding its performance against faults is a research challenge. Self-healing neural network is a promising concept for achieving reliability, which is the ability to detect and fix a fault in the system automatically. Most of the current self-healing neural network are based on replication of hardware nodes which causes significant area overhead. The proposed self-healing approach results in a modest area overhead and it is suitable for complex neural network. The proposed method is based on a shared operation and a spare node in each layer which compensates for any faulty node in the layer. Each faulty node will be compensated by its neighbor node, and the neighbor node performs the faulty node as well as its own operations sequentially. In the case the neighbor is faulty, the spare node will compensate for it. The proposed method is implemented using VHDL and the simulation results are obtained using Altira 10 GX FPGA for a different number of nodes. The area overhead is very small for a complex network. The reliability of the proposed method is studied and compared with the traditional neural network.
Hao, Lina, Ng, Bryan.  2019.  Self-Healing Solutions for Wi-Fi Networks to Provide Seamless Handover. 2019 IFIP/IEEE Symposium on Integrated Network and Service Management (IM). :639–642.
The dynamic nature of the wireless channel poses a challenge to services requiring seamless and uniform network quality of service (QoS). Self-healing, a promising approach under the self-organizing networks (SON) paradigm, and has been shown to deal with unexpected network faults in cellular networks. In this paper, we use simple machine learning (ML) algorithms inspired by SON developments in cellular networks. Evaluation results show that the proposed approach identifies the faulty APs. Our proposed approach improves throughput by 63.6% and reduces packet loss rate by 16.6% compared with standard 802.11.
2019-03-04
Lin, F., Beadon, M., Dixit, H. D., Vunnam, G., Desai, A., Sankar, S..  2018.  Hardware Remediation at Scale. 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W). :14–17.
Large scale services have automated hardware remediation to maintain the infrastructure availability at a healthy level. In this paper, we share the current remediation flow at Facebook, and how it is being monitored. We discuss a class of hardware issues that are transient and typically have higher rates during heavy load. We describe how our remediation system was enhanced to be efficient in detecting this class of issues. As hardware and systems change in response to the advancement in technology and scale, we have also utilized machine learning frameworks for hardware remediation to handle the introduction of new hardware failure modes. We present an ML methodology that uses a set of predictive thresholds to monitor remediation efficiency over time. We also deploy a recommendation system based on natural language processing, which is used to recommend repair actions for efficient diagnosis and repair. We also describe current areas of research that will enable us to improve hardware availability further.
2018-06-07
Marques, J., Andrade, J., Falcao, G..  2017.  Unreliable memory operation on a convolutional neural network processor. 2017 IEEE International Workshop on Signal Processing Systems (SiPS). :1–6.

The evolution of convolutional neural networks (CNNs) into more complex forms of organization, with additional layers, larger convolutions and increasing connections, established the state-of-the-art in terms of accuracy errors for detection and classification challenges in images. Moreover, as they evolved to a point where Gigabytes of memory are required for their operation, we have reached a stage where it becomes fundamental to understand how their inference capabilities can be impaired if data elements somehow become corrupted in memory. This paper introduces fault-injection in these systems by simulating failing bit-cells in hardware memories brought on by relaxing the 100% reliable operation assumption. We analyze the behavior of these networks calculating inference under severe fault-injection rates and apply fault mitigation strategies to improve on the CNNs resilience. For the MNIST dataset, we show that 8x less memory is required for the feature maps memory space, and that in sub-100% reliable operation, fault-injection rates up to 10-1 (with most significant bit protection) can withstand only a 1% error probability degradation. Furthermore, considering the offload of the feature maps memory to an embedded dynamic RAM (eDRAM) system, using technology nodes from 65 down to 28 nm, up to 73 80% improved power efficiency can be obtained.

2018-03-05
Mfula, H., Nurminen, J. K..  2017.  Adaptive Root Cause Analysis for Self-Healing in 5G Networks. 2017 International Conference on High Performance Computing Simulation (HPCS). :136–143.

Root cause analysis (RCA) is a common and recurring task performed by operators of cellular networks. It is done mainly to keep customers satisfied with the quality of offered services and to maximize return on investment (ROI) by minimizing and where possible eliminating the root causes of faults in cellular networks. Currently, the actual detection and diagnosis of faults or potential faults is still a manual and slow process often carried out by network experts who manually analyze and correlate various pieces of network data such as, alarms, call traces, configuration management (CM) and key performance indicator (KPI) data in order to come up with the most probable root cause of a given network fault. In this paper, we propose an automated fault detection and diagnosis solution called adaptive root cause analysis (ARCA). The solution uses measurements and other network data together with Bayesian network theory to perform automated evidence based RCA. Compared to the current common practice, our solution is faster due to automation of the entire RCA process. The solution is also cheaper because it needs fewer or no personnel in order to operate and it improves efficiency through domain knowledge reuse during adaptive learning. As it uses a probabilistic Bayesian classifier, it can work with incomplete data and it can handle large datasets with complex probability combinations. Experimental results from stratified synthesized data affirmatively validate the feasibility of using such a solution as a key part of self-healing (SH) especially in emerging self-organizing network (SON) based solutions in LTE Advanced (LTE-A) and 5G.

2017-12-28
Chatti, S., Ounelli, H..  2017.  Fault Tolerance in a Cloud of Databases Environment. 2017 31st International Conference on Advanced Information Networking and Applications Workshops (WAINA). :166–171.

We will focused the concept of serializability in order to ensure the correct processing of transactions. However, both serializability and relevant properties within transaction-based applications might be affected. Ensure transaction serialization in corrupt systems is one of the demands that can handle properly interrelated transactions, which prevents blocking situations that involve the inability to commit either transaction or related sub-transactions. In addition some transactions has been marked as malicious and they compromise the serialization of running system. In such context, this paper proposes an approach for the processing of transactions in a cloud of databases environment able to secure serializability in running transactions whether the system is compromised or not. We propose also an intrusion tolerant scheme to ensure the continuity of the running transactions. A case study and a simulation result are shown to illustrate the capabilities of the suggested system.

Mehetrey, P., Shahriari, B., Moh, M..  2016.  Collaborative Ensemble-Learning Based Intrusion Detection Systems for Clouds. 2016 International Conference on Collaboration Technologies and Systems (CTS). :404–411.

Cloud computation has become prominent with seemingly unlimited amount of storage and computation available to users. Yet, security is a major issue that hampers the growth of cloud. In this research we investigate a collaborative Intrusion Detection System (IDS) based on the ensemble learning method. It uses weak classifiers, and allows the use of untapped resources of cloud to detect various types of attacks on the cloud system. In the proposed system, tasks are distributed among available virtual machines (VM), individual results are then merged for the final adaptation of the learning model. Performance evaluation is carried out using decision trees and using fuzzy classifiers, on KDD99, one of the largest datasets for IDS. Segmentation of the dataset is done in order to mimic the behavior of real-time data traffic occurred in a real cloud environment. The experimental results show that the proposed approach reduces the execution time with improved accuracy, and is fault-tolerant when handling VM failures. The system is a proof-of-concept model for a scalable, cloud-based distributed system that is able to explore untapped resources, and may be used as a base model for a real-time hierarchical IDS.

2015-05-06
Yunfeng Zhu, Lee, P.P.C., Yinlong Xu, Yuchong Hu, Liping Xiang.  2014.  On the Speedup of Recovery in Large-Scale Erasure-Coded Storage Systems. Parallel and Distributed Systems, IEEE Transactions on. 25:1830-1840.

Modern storage systems stripe redundant data across multiple nodes to provide availability guarantees against node failures. One form of data redundancy is based on XOR-based erasure codes, which use only XOR operations for encoding and decoding. In addition to tolerating failures, a storage system must also provide fast failure recovery to reduce the window of vulnerability. This work addresses the problem of speeding up the recovery of a single-node failure for general XOR-based erasure codes. We propose a replace recovery algorithm, which uses a hill-climbing technique to search for a fast recovery solution, such that the solution search can be completed within a short time period. We further extend the algorithm to adapt to the scenario where nodes have heterogeneous capabilities (e.g., processing power and transmission bandwidth). We implement our replace recovery algorithm atop a parallelized architecture to demonstrate its feasibility. We conduct experiments on a networked storage system testbed, and show that our replace recovery algorithm uses less recovery time than the conventional recovery approach.