Biblio
Root cause analysis (RCA) is a common and recurring task performed by operators of cellular networks. It is done mainly to keep customers satisfied with the quality of offered services and to maximize return on investment (ROI) by minimizing and where possible eliminating the root causes of faults in cellular networks. Currently, the actual detection and diagnosis of faults or potential faults is still a manual and slow process often carried out by network experts who manually analyze and correlate various pieces of network data such as, alarms, call traces, configuration management (CM) and key performance indicator (KPI) data in order to come up with the most probable root cause of a given network fault. In this paper, we propose an automated fault detection and diagnosis solution called adaptive root cause analysis (ARCA). The solution uses measurements and other network data together with Bayesian network theory to perform automated evidence based RCA. Compared to the current common practice, our solution is faster due to automation of the entire RCA process. The solution is also cheaper because it needs fewer or no personnel in order to operate and it improves efficiency through domain knowledge reuse during adaptive learning. As it uses a probabilistic Bayesian classifier, it can work with incomplete data and it can handle large datasets with complex probability combinations. Experimental results from stratified synthesized data affirmatively validate the feasibility of using such a solution as a key part of self-healing (SH) especially in emerging self-organizing network (SON) based solutions in LTE Advanced (LTE-A) and 5G.
NoCs are a well established research topic and several Implementations have been proposed for Self-healing. Self-healing refers to the ability of a system to detect faults or failures and fix them through healing or repairing. The main problems in current self-healing approaches are area overhead and scalability for complex structure since they are based on redundancy and spare blocks. Also, faulty router can isolate PE from other router nodes which can reduce the overall performance of the system. This paper presents a self-healing for a router to avoid denied fault PE function and isolation PE from other nodes. In the proposed design, the neighbor routers receive signal from a faulty router which keeps them to send the data packet which has only faulted router destination to a faulty router. Control unite turns on switches to connect four input ports to local ports successively to send coming packets to PE. The reliability of the proposed technique is studied and compared to conventional system with different failure rates. This approach is capable of healing 50% of the router. The area overhead is 14% for the proposed approach which is much lower compared to other approaches using redundancy.
The restoration of power distribution systems has a crucial role in the electric utility environment, taking into account both the pressure experienced by the operators that must choose the corrective actions to be followed in emergency restoration plans and the goals imposed by the regulatory agencies. In this sense, decision-aiding systems and self-healing networks may be good alternatives since they either perform an automated analysis of the situation, providing consistent and high-quality restoration plans, or even directly perform the restoration fast and automatically in both cases reducing the impacts caused by network disturbances. This work proposes a new restoration strategy which is novel in the sense it deals with the problem from the operator viewpoint, without simplifications that are used in most literature works. In this proposal, a permutation based genetic algorithm is employed to restore the maximum amount of loads, in real time, without depending on a priori knowledge of the location of the fault. To validate the proposed methodology two large real systems were tested: one with 2 substations, 5 feeders, 703 buses, and 132 switches, and; the other with 3 substations, 7 feeders, 21,633 buses, and 2,808 switches. These networks were tested considering situations of single and multiple failures. The results obtained were achieved with very low processing time (of the order of ten seconds), while compliance with all operational requirements was ensured.
Existing compact routing schemes, e.g., Thorup and Zwick [SPAA 2001] and Chechik [PODC 2013], often have no means to tolerate failures, once the system has been setup and started. This paper presents, to our knowledge, the first self-healing compact routing scheme. Besides, our schemes are developed for low memory nodes, i.e., nodes need only O(log2 n) memory, and are thus, compact schemes. We introduce two algorithms of independent interest: The first is CompactFT, a novel compact version (using only O(log n) local memory) of the self-healing algorithm Forgiving Tree of Hayes et al. [PODC 2008]. The second algorithm (CompactFTZ) combines CompactFT with Thorup-Zwick's tree-based compact routing scheme [SPAA 2001] to produce a fully compact self-healing routing scheme. In the self-healing model, the adversary deletes nodes one at a time with the affected nodes self-healing locally by adding few edges. CompactFT recovers from each attack in only O(1) time and Δ messages, with only +3 degree increase and O(logΔ) graph diameter increase, over any sequence of deletions (Δ is the initial maximum degree). Additionally, CompactFTZ guarantees delivery of a packet sent from sender s as long as the receiver t has not been deleted, with only an additional O(y logΔ) latency, where y is the number of nodes that have been deleted on the path between s and t. If t has been deleted, s gets informed and the packet removed from the network.
In the context of service-oriented applications, the self-healing property provides reliable execution in order to support failures and assist automatic recovery techniques. This paper presents a knowledge-based approach for self-healing Composite Service (CS) applications. A CS is an application composed by a set of services interacting each other and invoked on the Web. Our approach is supported by Service Agents, which are in charge of the CS fault-tolerance execution control, making decisions about the selection of recovery and proactive strategies. Service Agents decisions are based on the information they have about the whole application, about themselves, and about what it is expected and what it is really happening at run-time. Hence, application knowledge for decision making comprises off-line precomputed global and local information, user QoS preferences, and propagated actual run-time information. Our approach is evaluated experimentally using a case study.
The popularity of Cloud computing has considerably increased during the last years. The increase of Cloud users and their interactions with the Cloud infrastructure raise the risk of resources faults. Such a problem can lead to a bad reputation of the Cloud environment which slows down the evolution of this technology. To address this issue, the dynamic and the complex architecture of the Cloud should be taken into account. Indeed, this architecture requires that resources protection and healing must be transparent and without external intervention. Unlike previous work, we suggest integrating the fundamental aspects of autonomic computing in the Cloud to deal with the self-healing of Cloud resources. Starting from the high degree of match between autonomic computing systems and multiagent systems, we propose to take advantage from the autonomous behaviour of agent technology to create an intelligent Cloud that supports autonomic aspects. Our proposed solution is a multi-agent system which interacts with the Cloud infrastructure to analyze the resources state and execute Checkpoint/Replication strategy or migration technique to solve the problem of failed resources.
A small battery driven bio-patch, attached to the human body and monitoring various vital signals such as temperature, humidity, heart activity, muscle and brain activity, is an example of a highly resource constrained system, that has the demanding task to assess correctly the state of the monitored subject (healthy, normal, weak, ill, improving, worsening, etc.), and its own capabilities (attached to subject, working sensors, sufficient energy supply, etc.). These systems and many other systems would benefit from a sense of itself and its environment to improve robustness and sensibility of its behavior. Although we can get inspiration from fields like neuroscience, robotics, AI, and control theory, the tight resource and energy constraints imply that we have to understand accurately what technique leads to a particular feature of awareness, how it contributes to improved behavior, and how it can be implemented cost-efficiently in hardware or software. We review the concepts of environment- and self-models, semantic interpretation, semantic attribution, history, goals and expectations, prediction, and self-inspection, how they contribute to awareness and self-awareness, and how they contribute to improved robustness and sensibility of behavior.