Visible to the public Biblio

Filters: Keyword is System recovery  [Clear All Filters]
2017-05-16
Najafi, Ali, Rudell, Jacques C., Sathe, Visvesh.  2016.  Regenerative Breaking: Recovering Stored Energy from Inactive Voltage Domains for Energy-efficient Systems-on-Chip. Proceedings of the 2016 International Symposium on Low Power Electronics and Design. :94–99.

Modern Systems-on-Chip(SoCs) frequently power-off individual voltage domains to save leakage power across a variety of applications, from large-scale heterogeneous computing to ultra-low power systems in IoT applications. However, the considerable energy stored within the capacitance of the powered-off domain is lost through leakage. In this paper, we present an approach to leverage existing voltage regulators to recover this energy from the disabled voltage-domain back into the supply using a low-overhead all-digital runtime control system. Simulation experiments conducted in an industrial 65nm CMOS process indicate that over 90% of the stored energy can be recovered across a range of operating system voltages from 0.4V–1V.

Yao, Chang, Agrawal, Divyakant, Chen, Gang, Ooi, Beng Chin, Wu, Sai.  2016.  Adaptive Logging: Optimizing Logging and Recovery Costs in Distributed In-memory Databases. Proceedings of the 2016 International Conference on Management of Data. :1119–1134.

By maintaining the data in main memory, in-memory databases dramatically reduce the I/O cost of transaction processing. However, for recovery purposes, in-memory systems still need to flush the log to disk, which incurs a substantial number of I/Os. Recently, command logging has been proposed to replace the traditional data log (e.g., ARIES logging) in in-memory databases. Instead of recording how the tuples are updated, command logging only tracks the transactions that are being executed, thereby effectively reducing the size of the log and improving the performance. However, when a failure occurs, all the transactions in the log after the last checkpoint must be redone sequentially and this significantly increases the cost of recovery. In this paper, we first extend the command logging technique to a distributed system, where all the nodes can perform their recovery in parallel. We show that in a distributed system, the only bottleneck of recovery caused by command logging is the synchronization process that attempts to resolve the data dependency among the transactions. We then propose an adaptive logging approach by combining data logging and command logging. The percentage of data logging versus command logging becomes a tuning knob between the performance of transaction processing and recovery to meet different OLTP requirements, and a model is proposed to guide such tuning. Our experimental study compares the performance of our proposed adaptive logging, ARIES-style data logging and command logging on top of H-Store. The results show that adaptive logging can achieve a 10x boost for recovery and a transaction throughput that is comparable to that of command logging.

Lee, Sungwon, Moon, Eunbae, Kim, Dongkyun.  2016.  Consistency of Path Based Upward Path Recovery Method to Reduce Path Recovery Delay for RPL. Proceedings of the International Conference on Research in Adaptive and Convergent Systems. :117–120.

In IoT (Internet of Things) networks, RPL (IPv6 Routing protocol for Low Power and Lossy Networks) is preferred for reducing routing overhead. In RPL, a node selects one parent node which includes the lowest routing metric among its neighbors and the other neighbors are stored as immediate successors. If the selected parent node is lost, the node selects a new parent node among the immediate successors. However, if the new path also includes the same intermediate node which is lost in previous path, it also fails to transmit upward packets. This procedure might be repeated until the new path is selected which does not include the lost immediate node. In this paper, we therefore propose a new path recovery method to reduce the unnecessary repetition for upward path recovery. When a node receives routing message, it calculates the hash value and sets 1 to a new field in the routing message. Based on the field, the node estimates an approximate number of ancestors that are shared between each paths. When loss of upward path is detected, the node selects a new path according to both approximate number and the routing metric. Therefore, a new path which dose not include same ancestors with the previous path is selected and data packet can be resumed immediately.

Mokhtar, Maizura, Hunt, Ian, Burns, Stephen, Ross, Dave.  2016.  Optimising a Waste Heat Recovery System Using Multi-Objective Evolutionary Algorithm. Proceedings of the 2016 on Genetic and Evolutionary Computation Conference Companion. :913–920.

A waste heat recovery system (WHRS) on a process with variable output, is an example of an intermittent renewable process. WHRS recycles waste heat into usable energy. As an example, waste heat produced from refrigeration can be used to provide hot water. However, consistent with most intermittent renewable energy systems, the likelihood of waste heat availability at times of demand is low. For this reason, the WHRS may be coupled with a hot water reservoir (HWR) acting as the energy storage system that aims to maintain desired hot water temperature Td (and therefore energy) at time of demand. The coupling of the WHRS and the HWR must be optimised to ensure higher efficiency given the intermittent mismatch of demand and heat availability. Efficiency of an WHRS can be defined as achieving multiple objectives, including to minimise the need for back-up energy to achieve Td, and to minimise waste heat not captured (when the reservoir volume Vres is too small). This paper investigates the application of a Multi Objective Evolutionary Algorithm (MOEA) to optimise the parameters of the WHRS, including the Vres and depth of discharge (DoD), that affect the WHRS efficiency. Results show that one of the optimum solutions obtained requires the combination of high Vres, high DoD, low water feed in rate, low power external back-up heater and high excess temperature for the HWR to ensure efficiency of the WHRS.

Koskinen, Eric, Yang, Junfeng.  2016.  Reducing Crash Recoverability to Reachability. Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. :97–108.

Software applications run on a variety of platforms (filesystems, virtual slices, mobile hardware, etc.) that do not provide 100% uptime. As such, these applications may crash at any unfortunate moment losing volatile data and, when re-launched, they must be able to correctly recover from potentially inconsistent states left on persistent storage. From a verification perspective, crash recovery bugs can be particularly frustrating because, even when it has been formally proved for a program that it satisfies a property, the proof is foiled by these external events that crash and restart the program. In this paper we first provide a hierarchical formal model of what it means for a program to be crash recoverable. Our model captures the recoverability of many real world programs, including those in our evaluation which use sophisticated recovery algorithms such as shadow paging and write-ahead logging. Next, we introduce a novel technique capable of automatically proving that a program correctly recovers from a crash via a reduction to reachability. Our technique takes an input control-flow automaton and transforms it into an encoding that blends the capture of snapshots of pre-crash states into a symbolic search for a proof that recovery terminates and every recovered execution simulates some crash-free execution. Our encoding is designed to enable one to apply existing abstraction techniques in order to do the work that is necessary to prove recoverability. We have implemented our technique in a tool called Eleven82, capable of analyzing C programs to detect recoverability bugs or prove their absence. We have applied our tool to benchmark examples drawn from industrial file systems and databases, including GDBM, LevelDB, LMDB, PostgreSQL, SQLite, VMware and ZooKeeper. Within minutes, our tool is able to discover bugs or prove that these fragments are crash recoverable.

Gu, Tianxiao, Sun, Chengnian, Ma, Xiaoxing, Lü, Jian, Su, Zhendong.  2016.  Automatic Runtime Recovery via Error Handler Synthesis. Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. :684–695.

Software systems are often subject to unexpected runtime errors. Automatic runtime recovery (ARR) techniques aim at recovering them from erroneous states and maintaining them functional in the field. This paper proposes Ares , a novel, practical approach to performing ARR. Our key insight is to leverage a system's already built-in error handling support to recover from unexpected errors. To this end, we synthesize error handlers via two methods: error transformation and early return. We also equip Ares with a lightweight in-vivo testing infrastructure to select the right synthesis methods and avoid potentially dangerous error handlers. Unlike existing ARR techniques based on heavyweight mechanisms (e.g., checkpoint-restart and runtime monitoring), our approach expands the intrinsic capability of runtime error resilience already existing in software systems to handle unexpected errors. Ares's lightweight mechanism makes it practical and easy to be integrated into production environments. We have implemented Ares on top of both the Java HotSpot VM and Android ART, and applied it to 52 real-world bugs. The results are promising — Ares successfully recovers from 39 of them and incurs low overhead.

Gensh, Rem, Romanovsky, Alexander, Yakovlev, Alex.  2016.  On Structuring Holistic Fault Tolerance. Proceedings of the 15th International Conference on Modularity. :130–133.

Computer systems are developed taking into account that they should be easily maintained in the future. It is one of the main requirements for the sound architectural design. The existing approaches to introducing fault tolerance rely on recursive system structuring out of functional components – this typically results in non-optimal fault tolerance. The paper proposes a vision of structuring complex many-core systems by introducing a special component supporting system-wide fault tolerance coordination. The component acts as a central module making decisions about fault tolerance strategies to be implemented by individual system components depending on the performance and energy requirements specified as system operating modes.

2015-05-06
Hardy, T.L..  2014.  Resilience: A holistic safety approach. Reliability and Maintainability Symposium (RAMS), 2014 Annual. :1-6.

Decreasing the potential for catastrophic consequences poses a significant challenge for high-risk industries. Organizations are under many different pressures, and they are continuously trying to adapt to changing conditions and recover from disturbances and stresses that can arise from both normal operations and unexpected events. Reducing risks in complex systems therefore requires that organizations develop and enhance traits that increase resilience. Resilience provides a holistic approach to safety, emphasizing the creation of organizations and systems that are proactive, interactive, reactive, and adaptive. This approach relies on disciplines such as system safety and emergency management, but also requires that organizations develop indicators and ways of knowing when an emergency is imminent. A resilient organization must be adaptive, using hands-on activities and lessons learned efforts to better prepare it to respond to future disruptions. It is evident from the discussions of each of the traits of resilience, including their limitations, that there are no easy answers to reducing safety risks in complex systems. However, efforts to strengthen resilience may help organizations better address the challenges associated with the ever-increasing complexities of their systems.

Verbeek, F., Schmaltz, J..  2014.  A Decision Procedure for Deadlock-Free Routing in Wormhole Networks. Parallel and Distributed Systems, IEEE Transactions on. 25:1935-1944.

Deadlock freedom is a key challenge in the design of communication networks. Wormhole switching is a popular switching technique, which is also prone to deadlocks. Deadlock analysis of routing functions is a manual and complex task. We propose an algorithm that automatically proves routing functions deadlock-free or outputs a minimal counter-example explaining the source of the deadlock. Our algorithm is the first to automatically check a necessary and sufficient condition for deadlock-free routing. We illustrate its efficiency in a complex adaptive routing function for torus topologies. Results are encouraging. Deciding deadlock freedom is co-NP-Complete for wormhole networks. Nevertheless, our tool proves a 13 × 13 torus deadlock-free within seconds. Finding minimal deadlocks is more difficult. Our tool needs four minutes to find a minimal deadlock in a 11 × 11 torus while it needs nine hours for a 12 × 12 network.

Chieh-Hao Chang, Jung-Chun Kao, Fu-Wen Chen, Shih Hsun Cheng.  2014.  Many-to-all priority-based network-coding broadcast in wireless multihop networks. Wireless Telecommunications Symposium (WTS), 2014. :1-6.

This paper addresses the minimum transmission broadcast (MTB) problem for the many-to-all scenario in wireless multihop networks and presents a network-coding broadcast protocol with priority-based deadlock prevention. Our main contributions are as follows: First, we relate the many-to-all-with-network-coding MTB problem to a maximum out-degree problem. The solution of the latter can serve as a lower bound for the number of transmissions. Second, we propose a distributed network-coding broadcast protocol, which constructs efficient broadcast trees and dictates nodes to transmit packets in a network coding manner. Besides, we present the priority-based deadlock prevention mechanism to avoid deadlocks. Simulation results confirm that compared with existing protocols in the literature and the performance bound we present, our proposed network-coding broadcast protocol performs very well in terms of the number of transmissions.

Stephens, B., Cox, A.L., Singla, A., Carter, J., Dixon, C., Felter, W..  2014.  Practical DCB for improved data center networks. INFOCOM, 2014 Proceedings IEEE. :1824-1832.

Storage area networking is driving commodity data center switches to support lossless Ethernet (DCB). Unfortunately, to enable DCB for all traffic on arbitrary network topologies, we must address several problems that can arise in lossless networks, e.g., large buffering delays, unfairness, head of line blocking, and deadlock. We propose TCP-Bolt, a TCP variant that not only addresses the first three problems but reduces flow completion times by as much as 70%. We also introduce a simple, practical deadlock-free routing scheme that eliminates deadlock while achieving aggregate network throughput within 15% of ECMP routing. This small compromise in potential routing capacity is well worth the gains in flow completion time. We note that our results on deadlock-free routing are also of independent interest to the storage area networking community. Further, as our hardware testbed illustrates, these gains are achievable today, without hardware changes to switches or NICs.

Bi Hong, Wan Choi.  2014.  Asymptotic analysis of failed recovery probability in a distributed wireless storage system with limited sum storage capacity. Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. :6459-6463.

In distributed wireless storage systems, failed recovery probability depends on not only wireless channel conditions but also storage size of each distributed storage node. For efficient utilization of limited storage capacity, we asymptotically analyze the failed recovery probability of a distributed wireless storage system with a sum storage capacity constraint when signal-to-noise ratio goes to infinity, and find the optimal storage allocation strategy across distributed storage nodes in terms of the asymptotic failed recovery probability. It is also shown that when the number of storage nodes is sufficiently large the storage size required at each node is not so large for high exponential order of the failed recovery probability.

Silei Xu, Runhui Li, Lee, P.P.C., Yunfeng Zhu, Liping Xiang, Yinlong Xu, Lui, J.C.S..  2014.  Single Disk Failure Recovery for X-Code-Based Parallel Storage Systems. Computers, IEEE Transactions on. 63:995-1007.

In modern parallel storage systems (e.g., cloud storage and data centers), it is important to provide data availability guarantees against disk (or storage node) failures via redundancy coding schemes. One coding scheme is X-code, which is double-fault tolerant while achieving the optimal update complexity. When a disk/node fails, recovery must be carried out to reduce the possibility of data unavailability. We propose an X-code-based optimal recovery scheme called minimum-disk-read-recovery (MDRR), which minimizes the number of disk reads for single-disk failure recovery. We make several contributions. First, we show that MDRR provides optimal single-disk failure recovery and reduces about 25 percent of disk reads compared to the conventional recovery approach. Second, we prove that any optimal recovery scheme for X-code cannot balance disk reads among different disks within a single stripe in general cases. Third, we propose an efficient logical encoding scheme that issues balanced disk read in a group of stripes for any recovery algorithm (including the MDRR scheme). Finally, we implement our proposed recovery schemes and conduct extensive testbed experiments in a networked storage system prototype. Experiments indicate that MDRR reduces around 20 percent of recovery time of the conventional approach, showing that our theoretical findings are applicable in practice.

Azab, M..  2014.  Multidimensional Diversity Employment for Software Behavior Encryption. New Technologies, Mobility and Security (NTMS), 2014 6th International Conference on. :1-5.

Modern cyber systems and their integration with the infrastructure has a clear effect on the productivity and quality of life immensely. Their involvement in our daily life elevate the need for means to insure their resilience against attacks and failure. One major threat is the software monoculture. Latest research work demonstrated the danger of software monoculture and presented diversity to reduce the attack surface. In this paper, we propose ChameleonSoft, a multidimensional software diversity employment to, in effect, induce spatiotemporal software behavior encryption and a moving target defense. ChameleonSoft introduces a loosely coupled, online programmable software-execution foundation separating logic, state and physical resources. The elastic construction of the foundation enabled ChameleonSoft to define running software as a set of behaviorally-mutated functionally-equivalent code variants. ChameleonSoft intelligently Shuffle, at runtime, these variants while changing their physical location inducing untraceable confusion and diffusion enough to encrypt the execution behavior of the running software. ChameleonSoft is also equipped with an autonomic failure recovery mechanism for enhanced resilience. In order to test the applicability of the proposed approach, we present a prototype of the ChameleonSoft Behavior Encryption (CBE) and recovery mechanisms. Further, using analysis and simulation, we study the performance and security aspects of the proposed system. This study aims to assess the provisioned level of security by measuring the avalanche effect percentage and the induced confusion and diffusion levels to evaluate the strength of the CBE mechanism. Further, we compute the computational cost of security provisioning and enhancing system resilience.

2015-05-05
Azab, M..  2014.  Multidimensional Diversity Employment for Software Behavior Encryption. New Technologies, Mobility and Security (NTMS), 2014 6th International Conference on. :1-5.

Modern cyber systems and their integration with the infrastructure has a clear effect on the productivity and quality of life immensely. Their involvement in our daily life elevate the need for means to insure their resilience against attacks and failure. One major threat is the software monoculture. Latest research work demonstrated the danger of software monoculture and presented diversity to reduce the attack surface. In this paper, we propose ChameleonSoft, a multidimensional software diversity employment to, in effect, induce spatiotemporal software behavior encryption and a moving target defense. ChameleonSoft introduces a loosely coupled, online programmable software-execution foundation separating logic, state and physical resources. The elastic construction of the foundation enabled ChameleonSoft to define running software as a set of behaviorally-mutated functionally-equivalent code variants. ChameleonSoft intelligently Shuffle, at runtime, these variants while changing their physical location inducing untraceable confusion and diffusion enough to encrypt the execution behavior of the running software. ChameleonSoft is also equipped with an autonomic failure recovery mechanism for enhanced resilience. In order to test the applicability of the proposed approach, we present a prototype of the ChameleonSoft Behavior Encryption (CBE) and recovery mechanisms. Further, using analysis and simulation, we study the performance and security aspects of the proposed system. This study aims to assess the provisioned level of security by measuring the avalanche effect percentage and the induced confusion and diffusion levels to evaluate the strength of the CBE mechanism. Further, we compute the computational cost of security provisioning and enhancing system resilience.

2015-05-01
Achouri, A., Hlaoui, Y.B., Jemni Ben Ayed, L..  2014.  Institution Theory for Services Oriented Applications. Computer Software and Applications Conference Workshops (COMPSACW), 2014 IEEE 38th International. :516-521.

In the present paper, we present our approach for the transformation of workflow applications based on institution theory. The workflow application is modeled with UML Activity Diagram(UML AD). Then, for a formal verification purposes, the graphical model will be translated to an Event-B specification. Institution theory will be used in two levels. First, we defined a local semantic for UML AD and Event B specification using a categorical description of each one. Second, we defined institution comorphism to link the two defined institutions. The theoretical foundations of our approach will be studied in the same mathematical framework since the use of institution theory. The resulted Event-B specification, after applying the transformation approach, will be used for the formal verification of functional proprieties and the verification of absences of problems such deadlock. Additionally, with the institution comorphism, we define a semantic correctness and coherence of the model transformation.

2015-04-30
Wenbing Zhao.  2014.  Application-Aware Byzantine Fault Tolerance. Dependable, Autonomic and Secure Computing (DASC), 2014 IEEE 12th International Conference on. :45-50.

Byzantine fault tolerance has been intensively studied over the past decade as a way to enhance the intrusion resilience of computer systems. However, state-machine-based Byzantine fault tolerance algorithms require deterministic application processing and sequential execution of totally ordered requests. One way of increasing the practicality of Byzantine fault tolerance is to exploit the application semantics, which we refer to as application-aware Byzantine fault tolerance. Application-aware Byzantine fault tolerance makes it possible to facilitate concurrent processing of requests, to minimize the use of Byzantine agreement, and to identify and control replica nondeterminism. In this paper, we provide an overview of recent works on application-aware Byzantine fault tolerance techniques. We elaborate the need for exploiting application semantics for Byzantine fault tolerance and the benefits of doing so, provide a classification of various approaches to application-aware Byzantine fault tolerance, and outline the mechanisms used in achieving application-aware Byzantine fault tolerance according to our classification.

Wenbing Zhao.  2014.  Application-Aware Byzantine Fault Tolerance. Dependable, Autonomic and Secure Computing (DASC), 2014 IEEE 12th International Conference on. :45-50.

Byzantine fault tolerance has been intensively studied over the past decade as a way to enhance the intrusion resilience of computer systems. However, state-machine-based Byzantine fault tolerance algorithms require deterministic application processing and sequential execution of totally ordered requests. One way of increasing the practicality of Byzantine fault tolerance is to exploit the application semantics, which we refer to as application-aware Byzantine fault tolerance. Application-aware Byzantine fault tolerance makes it possible to facilitate concurrent processing of requests, to minimize the use of Byzantine agreement, and to identify and control replica nondeterminism. In this paper, we provide an overview of recent works on application-aware Byzantine fault tolerance techniques. We elaborate the need for exploiting application semantics for Byzantine fault tolerance and the benefits of doing so, provide a classification of various approaches to application-aware Byzantine fault tolerance, and outline the mechanisms used in achieving application-aware Byzantine fault tolerance according to our classification.