Biblio

List
Filter

Found 93 results

Filters: Keyword is System recovery [Clear All Filters]

2019-02-14

Leemaster, J., Vai, M., Whelihan, D., Whitman, H., Khazan, R.. 2018. Functionality and Security Co-Design Environment for Embedded Systems. 2018 IEEE High Performance Extreme Computing Conference (HPEC). :1-5.

For decades, embedded systems, ranging from intelligence, surveillance, and reconnaissance (ISR) sensors to electronic warfare and electronic signal intelligence systems, have been an integral part of U.S. Department of Defense (DoD) mission systems. These embedded systems are increasingly the targets of deliberate and sophisticated attacks. Developers thus need to focus equally on functionality and security in both hardware and software development. For critical missions, these systems must be entrusted to perform their intended functions, prevent attacks, and even operate with resilience under attacks. The processor in a critical system must thus provide not only a root of trust, but also a foundation to monitor mission functions, detect anomalies, and perform recovery. We have developed a Lincoln Asymmetric Multicore Processing (LAMP) architecture, which mitigates adversarial cyber effects with separation and cryptography and provides a foundation to build a resilient embedded system. We will describe a design environment that we have created to enable the co-design of functionality and security for mission assurance.

Chen, B., Lu, Z., Zhou, H.. 2018. Reliability Assessment of Distribution Network Considering Cyber Attacks. 2018 2nd IEEE Conference on Energy Internet and Energy System Integration (EI2). :1-6.

With the rapid development of the smart grid, a large number of intelligent sensors and meters have been introduced in distribution network, which will inevitably increase the integration of physical networks and cyber networks, and bring potential security threats to the operating system. In this paper, the functions of the information system on distribution network are described when cyber attacks appear at the intelligent electronic devices (lED) or at the distribution main station. The effect analysis of the distribution network under normal operating condition or in the fault recovery process is carried out, and the reliability assessment model of the distribution network considering cyber attacks is constructed. Finally, the IEEE-33-bus distribution system is taken as a test system to presented the evaluation process based on the proposed model.

Richard, D. S., Rashidzadeh, R., Ahmadi, M.. 2018. Secure Scan Architecture Using Clock and Data Recovery Technique. 2018 IEEE International Symposium on Circuits and Systems (ISCAS). :1-5.

Design for Testability (DfT) techniques allow devices to be tested at various levels of the manufacturing process. Scan architecture is a dominantly used DfT technique, which supports a high level of fault coverage, observability and controllability. However, scan architecture can be used by hardware attackers to gain critical information stored within the device. The security threats due to an unrestricted access provided by scan architecture has to be addressed to ensure hardware security. In this work, a solution based on the Clock and Data Recovery (CDR) method has been presented to authenticate users and limit the access to the scan architecture to authorized users. As compared to the available solution the proposed method presents a robust performance and reduces the area overhead by more than 10%.

Zhang, S., Wolthusen, S. D.. 2018. Efficient Control Recovery for Resilient Control Systems. 2018 IEEE 15th International Conference on Networking, Sensing and Control (ICNSC). :1-6.

Resilient control systems should efficiently restore control into physical systems not only after the sabotage of themselves, but also after breaking physical systems. To enhance resilience of control systems, given an originally minimal-input controlled linear-time invariant(LTI) physical system, we address the problem of efficient control recovery into it after removing a known system vertex by finding the minimum number of inputs. According to the minimum input theorem, given a digraph embedded into LTI model and involving a precomputed maximum matching, this problem is modeled into recovering controllability of it after removing a known network vertex. Then, we recover controllability of the residual network by efficiently finding a maximum matching rather than recomputation. As a result, except for precomputing a maximum matching and the following removed vertex, the worst-case execution time of control recovery into the residual LTI physical system is linear.

Anand, Priya, Ryoo, Jungwoo. 2018. Architectural Solutions to Mitigate Security Vulnerabilities in Software Systems. Proceedings of the 13th International Conference on Availability, Reliability and Security. :5:1-5:5.

Security issues emerging out of the constantly evolving software applications became a huge challenge to software security experts. In this paper, we propose a prototype to detect vulnerabilities by identifying their architectural sources and also use security patterns to mitigate the identified vulnerabilities. We emphasize the need to consider architectural relations to introduce an effective security solution. In this research, we focused on the taint-style vulnerabilities that can induce injection-based attacks like XSS, SQLI in web applications. With numerous tools available to detect the taint-style vulnerabilities in the web applications, we scanned for the presence of repetition of a vulnerable code pattern in the software. Very importantly, we attempted to identify the architectural source files or modules by developing a tool named ArT Analyzer. We conducted a case study on a leading health-care software by applying the proposed architectural taint analysis and identified the vulnerable spots. We could identify the architectural roots for those vulnerable spots with the use of our tool ArT Analyzer. We verified the results by sharing it with the lead software architect of the project. By adopting an architectural solution, we avoided changes to be done on 252 different lines of code by merely introducing 2 lines of code changes at the architectural roots. Eventually, this solution was integrated into the latest updated release of the health-care software.

Zhao, Z., Lu, W., Ma, J., Li, S., Zhou, L.. 2018. Fast Unloading Transient Recovery of Buck Converters Using Series-Inductor Auxiliary Circuit Based Sequence Switching Control. 2018 IEEE International Power Electronics and Application Conference and Exposition (PEAC). :1-5.

This paper presents a sequence switching control (SSC) scheme for buck converters with a series-inductor auxiliary circuit, aiming at improving the load transient response. During an unloading transient, the series inductor is controlled as a small equivalent inductance so as to achieve a fast transient regulation. While in the steady state, the series inductor behaves as a large inductance to reduce the output current ripple. Furthermore, on the basis of the proposed variable inductance circuit, a SSC control scheme is proposed and implemented in a digital form. With the proposed control scheme the unloading transient event is divided into n+1 sub-periods, and in each sub-period, the capacitor-charge balance principle is used to determine the switching time sequence. Furthermore, its feasibility is validated in experiment with a 12V-3.3V low-voltage high-current synchronous buck converter. Experimental results demonstrate that the voltage overshoot of the proposed SSC scheme has improved more than 74% compared to that of the time-optimal control (TOC) scheme.

Sun, A., Gao, G., Ji, T., Tu, X.. 2018. One Quantifiable Security Evaluation Model for Cloud Computing Platform. 2018 Sixth International Conference on Advanced Cloud and Big Data (CBD). :197-201.

Whatever one public cloud, private cloud or a mixed cloud, the users lack of effective security quantifiable evaluation methods to grasp the security situation of its own information infrastructure on the whole. This paper provides a quantifiable security evaluation system for different clouds that can be accessed by consistent API. The evaluation system includes security scanning engine, security recovery engine, security quantifiable evaluation model, visual display module and etc. The security evaluation model composes of a set of evaluation elements corresponding different fields, such as computing, storage, network, maintenance, application security and etc. Each element is assigned a three tuple on vulnerabilities, score and repair method. The system adopts ``One vote vetoed'' mechanism for one field to count its score and adds up the summary as the total score, and to create one security view. We implement the quantifiable evaluation for different cloud users based on our G-Cloud platform. It shows the dynamic security scanning score for one or multiple clouds with visual graphs and guided users to modify configuration, improve operation and repair vulnerabilities, so as to improve the security of their cloud resources.

Maqbali, F. A., Mitchell, C. J.. 2018. Email-Based Password Recovery - Risking or Rescuing Users? 2018 International Carnahan Conference on Security Technology (ICCST). :1-5.

Secret passwords are very widely used for user authentication to websites, despite their known shortcomings. Most websites using passwords also implement password recovery to allow users to re-establish a shared secret if the existing value is forgotten; many such systems involve sending a password recovery email to the user, e.g. containing a secret link. The security of password recovery, and hence the entire user-website relationship, depends on the email being acted upon correctly; unfortunately, as we show, such emails are not always designed to maximise security and can introduce vulnerabilities into recovery. To understand better this serious practical security problem, we surveyed password recovery emails for 50 of the top English language websites. We investigated a range of security and usability issues for such emails, covering their design, structure and content (including the nature of the user instructions), the techniques used to recover the password, and variations in email content from one web service to another. Many well-known web services, including Facebook, Dropbox, and Microsoft, suffer from recovery email design, structure and content issues. This is, to our knowledge, the first study of its type reported in the literature. This study has enabled us to formulate a set of recommendations for the design of such emails.

Bae, S., Shin, Y.. 2018. An Automated System Recovery Using BlockChain. 2018 Tenth International Conference on Ubiquitous and Future Networks (ICUFN). :897-901.

The existing Disaster Recovery(DR) system has a technique for integrity of the duplicated file to be used for recovery, but it could not be used if the file was changed. In this study, a duplicate file is generated as a block and managed as a block-chain. If the duplicate file is corrupted, the DR system will check the integrity of the duplicated file by referring to the block-chain and proceed with the recovery. The proposed technology is verified through recovery performance evaluation and scenarios.

2019-01-31

Matos, David R., Pardal, Miguel L., Carle, Georg, Correia, Miguel. 2018. RockFS: Cloud-Backed File System Resilience to Client-Side Attacks. Proceedings of the 19th International Middleware Conference. :107–119.

Cloud-backed file systems provide on-demand, high-availability, scalable storage. Their security may be improved with techniques such as erasure codes and secret sharing to fragment files and encryption keys in several clouds. Attacking the server-side of such systems involves penetrating one or more clouds, which can be extremely difficult. Despite all these benefits, a weak side remains: the client-side. The client devices store user credentials that, if stolen or compromised, may lead to confidentiality, integrity, and availability violations. In this paper we propose RockFS, a cloud-backed file system framework that aims to make the client-side of such systems resilient to attacks. RockFS protects data in the client device and allows undoing unintended file modifications.

2018-06-07

Alazzawe, A., Kant, K.. 2017. Slice Swarms for HPC Application Resilience. 2017 Fifth International Symposium on Computing and Networking (CANDAR). :1–10.

Resilience in High Performance Computing (HPC) is a constraining factor for bringing applications to the upcoming exascale systems. Resilience techniques must be able to scale to handle the increasing number of expected errors in an energy efficient manner. Since the purpose of running applications on HPC systems is to perform large scale computations as quick as possible, resilience methods should not add a large delay to the time to completion of the application. In this paper we introduce a novel technique to detect and recover from transient errors in HPC applications. One of the features of our technique is that the energy budget allocated to resilience can be adjusted depending on the operator's resilience needs. For example, on synthetic data, the technique can detect about 50% of transient errors while only using 20% of the dynamic energy required for running the application. For a 60% energy budget, an application that uses 10k cores and takes 128 hours to run, will only require 10% longer to complete.

2017-12-12

Lee, S. Y., Chung, T. M.. 2017. A study on the fast system recovery: Selecting the number of surrogate nodes for fast recovery in industrial IoT environment. 2017 International Conference on Information and Communications (ICIC). :205–207.

This paper is based on the previous research that selects the proper surrogate nodes for fast recovery mechanism in industrial IoT (Internet of Things) Environment which uses a variety of sensors to collect the data and exchange the collected data in real-time for creating added value. We are going to suggest the way that how to decide the number of surrogate node automatically in different deployed industrial IoT Environment so that minimize the system recovery time when the central server likes IoT gateway is in failure. We are going to use the network simulator to measure the recovery time depending on the number of the selected surrogate nodes according to the sub-devices which are connected to the IoT gateway.

Pan, X., Yang, Y., Zhang, G., Zhang, B.. 2017. Resilience-based optimization of recovery strategies for network systems. 2017 Second International Conference on Reliability Systems Engineering (ICRSE). :1–6.

Network systems, such as transportation systems and water supply systems, play important roles in our daily life and industrial production. However, a variety of disruptive events occur during their life time, causing a series of serious losses. Due to the inevitability of disruption, we should not only focus on improving the reliability or the resistance of the system, but also pay attention to the ability of the system to response timely and recover rapidly from disruptive events. That is to say we need to pay more attention to the resilience. In this paper, we describe two resilience models, quotient resilience and integral resilience, to measure the final recovered performance and the performance cumulative process during recovery respectively. Based on these two models, we implement the optimization of the system recovery strategies after disruption, focusing on the repair sequence of the damaged components and the allocation scheme of resource. The proposed research in this paper can serve as guidance to prioritize repair tasks and allocate resource reasonably.

Hosseini, Fateme S., Fotouhi, Pouya, Yang, Chengmo, Gao, Guang R.. 2017. Leveraging Compiler Optimizations to Reduce Runtime Fault Recovery Overhead. Proceedings of the 54th Annual Design Automation Conference 2017. :20:1–20:6.

Smaller feature size, lower supply voltage, and faster clock rates have made modern computer systems more susceptible to faults. Although previous fault tolerance techniques usually target a relatively low fault rate and consider error recovery less critical, with the advent of higher fault rates, recovery overhead is no longer negligible. In this paper, we propose a scheme that leverages and revises a set of compiler optimizations to design, for each application hotspot, a smart recovery plan that identifies the minimal set of instructions to be re-executed in different fault scenarios. Such fault scenario and recovery plan information is efficiently delivered to the processor for runtime fault recovery. The proposed optimizations are implemented in LLVM and GEM5. The results show that the proposed scheme can significantly reduce runtime recovery overhead by 72%.

Abdi, Fardin, Tabish, Rohan, Rungger, Matthias, Zamani, Majid, Caccamo, Marco. 2017. Application and System-level Software Fault Tolerance Through Full System Restarts. Proceedings of the 8th International Conference on Cyber-Physical Systems. :197–206.

Due to the growing performance requirements, embedded systems are increasingly more complex. Meanwhile, they are also expected to be reliable. Guaranteeing reliability on complex systems is very challenging. Consequently, there is a substantial need for designs that enable the use of unverified components such as real-time operating system (RTOS) without requiring their correctness to guarantee safety. In this work, we propose a novel approach to design a controller that enables the system to restart and remain safe during and after the restart. Complementing this controller with a switching logic allows the system to use complex, unverified controller to drive the system as long as it does not jeopardize safety. Such a design also tolerates faults that occur in the underlying software layers such as RTOS and middleware and recovers from them through system-level restarts that reinitialize the software (middleware, RTOS, and applications) from a read-only storage. Our approach is implementable using one commercial off-the-shelf (COTS) processing unit. To demonstrate the efficacy of our solution, we fully implement a controller for a 3 degree of freedom (3DOF) helicopter. We test the system by injecting various types of faults into the applications and RTOS and verify that the system remains safe.

Bos, Jeroen van den. 2017. Sustainable Automated Data Recovery: A Research Roadmap. Proceedings of the 1st ACM SIGSOFT International Workshop on Software Engineering and Digital Forensics. :6–9.

Digital devices contain increasingly more data and applications. This means more data to handle and a larger amount of different types of traces to recover and consider in digital forensic investigations. Both present a challenge to data recovery approaches, requiring higher performance and increased flexibility. In order to progress to a long-term sustainable approach to automated data recovery, this paper proposes a partitioning into manual, custom, formalized and self-improving approaches. These approaches are described along with research directions to consider: building universal abstractions, selecting appropriate techniques and developing user-friendly tools.

Gilbert, Anna C., Li, Yi, Porat, Ely, Strauss, Martin J.. 2017. For-All Sparse Recovery in Near-Optimal Time. ACM Trans. Algorithms. 13:32:1–32:26.

An approximate sparse recovery system in ℓ1 norm consists of parameters k, ε, N; an m-by-N measurement Φ; and a recovery algorithm R. Given a vector, x, the system approximates x by &xwidehat; = R(Φ x), which must satisfy ‖ &xwidehat;-x‖1 ≤ (1+ε)‖ x - xk‖1. We consider the “for all” model, in which a single matrix Φ, possibly “constructed” non-explicitly using the probabilistic method, is used for all signals x. The best existing sublinear algorithm by Porat and Strauss [2012] uses O(ε−3klog (N/k)) measurements and runs in time O(k1 − αNα) for any constant α textgreater 0. In this article, we improve the number of measurements to O(ε − 2klog (N/k)), matching the best existing upper bound (attained by super-linear algorithms), and the runtime to O(k1+βpoly(log N,1/ε)), with a modest restriction that k ⩽ N1 − α and ε ⩽ (log k/log N)γ for any constants α, β, γ textgreater 0. When k ⩽ log cN for some c textgreater 0, the runtime is reduced to O(kpoly(N,1/ε)). With no restrictions on ε, we have an approximation recovery system with m = O(k/εlog (N/k)((log N/log k)γ + 1/ε)) measurements. The overall architecture of this algorithm is similar to that of Porat and Strauss [2012] in that we repeatedly use a weak recovery system (with varying parameters) to obtain a top-level recovery algorithm. The weak recovery system consists of a two-layer hashing procedure (or with two unbalanced expanders for a deterministic algorithm). The algorithmic innovation is a novel encoding procedure that is reminiscent of network coding and that reflects the structure of the hashing stages. The idea is to encode the signal position index i by associating it with a unique message mi, which will be encoded to a longer message m′i (in contrast to Porat and Strauss [2012] in which the encoding is simply the identity). Portions of the message m′i correspond to repetitions of the hashing, and we use a regular expander graph to encode the linkages among these portions. The decoding or recovery algorithm consists of recovering the portions of the longer messages m′i and then decoding to the original messages mi, all the while ensuring that corruptions can be detected and/or corrected. The recovery algorithm is similar to list recovery introduced in Indyk et al. [2010] and used in Gilbert et al. [2013]. In our algorithm, the messages \mi\ are independent of the hashing, which enables us to obtain a better result.

Wu, Yingjun, Guo, Wentian, Chan, Chee-Yong, Tan, Kian-Lee. 2017. Fast Failure Recovery for Main-Memory DBMSs on Multicores. Proceedings of the 2017 ACM International Conference on Management of Data. :267–281.

Main-memory database management systems (DBMS) can achieve excellent performance when processing massive volume of on-line transactions on modern multi-core machines. But existing durability schemes, namely, tuple-level and transaction-level logging-and-recovery mechanisms, either degrade the performance of transaction processing or slow down the process of failure recovery. In this paper, we show that, by exploiting application semantics, it is possible to achieve speedy failure recovery without introducing any costly logging overhead to the execution of concurrent transactions. We propose PACMAN, a parallel database recovery mechanism that is specifically designed for lightweight, coarse-grained transaction-level logging. PACMAN leverages a combination of static and dynamic analyses to parallelize the log recovery: at compile time, PACMAN decomposes stored procedures by carefully analyzing dependencies within and across programs; at recovery time, PACMAN exploits the availability of the runtime parameter values to attain an execution schedule with a high degree of parallelism. As such, recovery performance is remarkably increased. We evaluated PACMAN in a fully-fledged main-memory DBMS running on a 40-core machine. Compared to several state-of-the-art database recovery mechanisms, can significantly reduce recovery time without compromising the efficiency of transaction processing.

Huang, Jian, Xu, Jun, Xing, Xinyu, Liu, Peng, Qureshi, Moinuddin K.. 2017. FlashGuard: Leveraging Intrinsic Flash Properties to Defend Against Encryption Ransomware. Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. :2231–2244.

Encryption ransomware is a malicious software that stealthily encrypts user files and demands a ransom to provide access to these files. Several prior studies have developed systems to detect ransomware by monitoring the activities that typically occur during a ransomware attack. Unfortunately, by the time the ransomware is detected, some files already undergo encryption and the user is still required to pay a ransom to access those files. Furthermore, ransomware variants can obtain kernel privilege, which allows them to terminate software-based defense systems, such as anti-virus. While periodic backups have been explored as a means to mitigate ransomware, such backups incur storage overheads and are still vulnerable as ransomware can obtain kernel privilege to stop or destroy backups. Ideally, we would like to defend against ransomware without relying on software-based solutions and without incurring the storage overheads of backups. To that end, this paper proposes FlashGuard, a ransomware tolerant Solid State Drive (SSD) which has a firmware-level recovery system that allows quick and effective recovery from encryption ransomware without relying on explicit backups. FlashGuard leverages the observation that the existing SSD already performs out-of-place writes in order to mitigate the long erase latency of flash memories. Therefore, when a page is updated or deleted, the older copy of that page is anyway present in the SSD. FlashGuard slightly modifies the garbage collection mechanism of the SSD to retain the copies of the data encrypted by ransomware and ensure effective data recovery. Our experiments with 1,447 manually labeled ransomware samples show that FlashGuard can efficiently restore files encrypted by ransomware. In addition, we demonstrate that FlashGuard has a negligible impact on the performance and lifetime of the SSD.

Jun, Jaeyung, Choi, Kyu Hyun, Kim, Hokwon, Yu, Sang Ho, Kim, Seon Wook, Han, Youngsun. 2017. Recovering from Biased Distribution of Faulty Cells in Memory by Reorganizing Replacement Regions Through Universal Hashing. ACM Trans. Des. Autom. Electron. Syst.. 23:16:1–16:21.

Recently, scaling down dynamic random access memory (DRAM) has become more of a challenge, with more faults than before and a significant degradation in yield. To improve the yield in DRAM, a redundancy repair technique with intra-subarray replacement has been extensively employed to replace faulty elements (i.e., rows or columns with defective cells) with spare elements in each subarray. Unfortunately, such technique cannot efficiently handle a biased distribution of faulty cells because each subarray has a fixed number of spare elements. In this article, we propose a novel redundancy repair technique that uses a hashing method to solve this problem. Our hashing technique reorganizes replacement regions by changing the way in which their replacement information is referred, thus making faulty cells become evenly distributed to the regions. We also propose a fast repair algorithm to find the best hash function among all possible candidates. Even if our approach requires little hardware overhead, it significantly improves the yield when compared with conventional redundancy techniques. In particular, the results of our experiment show that our technique saves spare elements by about 57% and 55% for a yield of 99% at BER 1e-6 and 5e-7, respectively.

Taing, Nguonly, Springer, Thomas, Cardozo, Nicolás, Schill, Alexander. 2017. A Rollback Mechanism to Recover from Software Failures in Role-based Adaptive Software Systems. Companion to the First International Conference on the Art, Science and Engineering of Programming. :11:1–11:6.

Context-dependent applications are relatively complex due to their multiple variations caused by context activation, especially in the presence of unanticipated adaptation. Testing these systems is challenging, as it is hard to reproduce the same execution environments. Therefore, a software failure caused by bugs is no exception. This paper presents a rollback mechanism to recover from software failures as part of a role-based runtime with support for unanticipated adaptation. The mechanism performs checkpoints before each adaptation and employs specialized sensors to detect bugs resulting from recent configuration changes. When the runtime detects a bug, it assumes that the bug belongs to the latest configuration. The runtime rolls back to the recent checkpoint to recover and subsequently notifies the developer to fix the bug and re-applying the adaptation through unanticipated adaptation. We prototype the concept as part of our role-based runtime engine LyRT and demonstrate the applicability of the rollback recovery mechanism for unanticipated adaptation in erroneous situations.

2017-11-03

Weckstén, M., Frick, J., Sjöström, A., Järpe, E.. 2016. A novel method for recovery from Crypto Ransomware infections. 2016 2nd IEEE International Conference on Computer and Communications (ICCC). :1354–1358.

Extortion using digital platforms is an increasing form of crime. A commonly seen problem is extortion in the form of an infection of a Crypto Ransomware that encrypts the files of the target and demands a ransom to recover the locked data. By analyzing the four most common Crypto Ransomwares, at writing, a clear vulnerability is identified; all infections rely on tools available on the target system to be able to prevent a simple recovery after the attack has been detected. By renaming the system tool that handles shadow copies it is possible to recover from infections from all four of the most common Crypto Ransomwares. The solution is packaged in a single, easy to use script.

2017-05-16

Ren, Kun, Diamond, Thaddeus, Abadi, Daniel J., Thomson, Alexander. 2016. Low-Overhead Asynchronous Checkpointing in Main-Memory Database Systems. Proceedings of the 2016 International Conference on Management of Data. :1539–1551.

As it becomes increasingly common for transaction processing systems to operate on datasets that fit within the main memory of a single machine or a cluster of commodity machines, traditional mechanisms for guaranteeing transaction durability–-which typically involve synchronous log flushes–-incur increasingly unappealing costs to otherwise lightweight transactions. Many applications have turned to periodically checkpointing full database state. However, existing checkpointing methods–-even those which avoid freezing the storage layer–-often come with significant costs to operation throughput, end-to-end latency, and total memory usage. This paper presents Checkpointing Asynchronously using Logical Consistency (CALC), a lightweight, asynchronous technique for capturing database snapshots that does not require a physical point of consistency to create a checkpoint, and avoids conspicuous latency spikes incurred by other database snapshotting schemes. Our experiments show that CALC can capture frequent checkpoints across a variety of transactional workloads with extremely small cost to transactional throughput and low additional memory usage compared to other state-of-the-art checkpointing systems.

Mendizabal, Odorico M., Dotti, Fernando Luís, Pedone, Fernando. 2016. Analysis of Checkpointing Overhead in Parallel State Machine Replication. Proceedings of the 31st Annual ACM Symposium on Applied Computing. :534–537.

State machine replication (SMR) is a well-established technique to fault-tolerant systems. In part, this is explained by the simplicity of the approach and its strong consistency guarantees. Recently, several proposals have suggested parallelizing the execution of state machine replicas to achieve high throughput. Concurrent execution of commands has many implications, including the recovery of replicas from failures. Conventional checkpointing techniques, for example, must be revisited in parallelized models. In this paper, we review parallel variations of state machine replication and discuss how checkpointing procedures apply to these models. Moreover, we evaluate the impact caused by checkpointing techniques on recovery through simulations.

Wu, Hao, Mao, Jiangyun, Sun, Weiwei, Zheng, Baihua, Zhang, Hanyuan, Chen, Ziyang, Wang, Wei. 2016. Probabilistic Robust Route Recovery with Spatio-Temporal Dynamics. Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. :1915–1924.

Vehicle trajectories are one of the most important data in location-based services. The quality of trajectories directly affects the services. However, in the real applications, trajectory data are not always sampled densely. In this paper, we study the problem of recovering the entire route between two distant consecutive locations in a trajectory. Most existing works solve the problem without using those informative historical data or solve it in an empirical way. We claim that a data-driven and probabilistic approach is actually more suitable as long as data sparsity can be well handled. We propose a novel route recovery system in a fully probabilistic way which incorporates both temporal and spatial dynamics and addresses all the data sparsity problem introduced by the probabilistic method. It outperforms the existing works with a high accuracy (over 80%) and shows a strong robustness even when the length of routes to be recovered is very long (about 30 road segments) or the data is very sparse.