On the Speedup of Recovery in Large-Scale Erasure-Coded Storage Systems
Title | On the Speedup of Recovery in Large-Scale Erasure-Coded Storage Systems |
Publication Type | Journal Article |
Year of Publication | 2014 |
Authors | Yunfeng Zhu, Lee, P.P.C., Yinlong Xu, Yuchong Hu, Liping Xiang |
Journal | Parallel and Distributed Systems, IEEE Transactions on |
Volume | 25 |
Pagination | 1830-1840 |
Date Published | July |
ISSN | 1045-9219 |
Keywords | Algorithm design and analysis, availability guarantees, data redundancy, Distributed databases, encoding, Equations, fast recovery solution, fault tolerant computing, Generators, hill-climbing technique, large-scale erasure-coded storage systems, Mathematical model, networked storage system testbed, node failures, parallelized architecture, recovery algorithm, replace recovery algorithm, single-node failure, single-node failure recovery, storage management, Strips, vulnerability window, XOR operations, XOR-based erasure codes, XOR-coded storage system |
Abstract | Modern storage systems stripe redundant data across multiple nodes to provide availability guarantees against node failures. One form of data redundancy is based on XOR-based erasure codes, which use only XOR operations for encoding and decoding. In addition to tolerating failures, a storage system must also provide fast failure recovery to reduce the window of vulnerability. This work addresses the problem of speeding up the recovery of a single-node failure for general XOR-based erasure codes. We propose a replace recovery algorithm, which uses a hill-climbing technique to search for a fast recovery solution, such that the solution search can be completed within a short time period. We further extend the algorithm to adapt to the scenario where nodes have heterogeneous capabilities (e.g., processing power and transmission bandwidth). We implement our replace recovery algorithm atop a parallelized architecture to demonstrate its feasibility. We conduct experiments on a networked storage system testbed, and show that our replace recovery algorithm uses less recovery time than the conventional recovery approach. |
DOI | 10.1109/TPDS.2013.244 |
Citation Key | 6613479 |
- networked storage system testbed
- XOR-coded storage system
- XOR-based erasure codes
- XOR operations
- vulnerability window
- Strips
- storage management
- single-node failure recovery
- single-node failure
- replace recovery algorithm
- recovery algorithm
- parallelized architecture
- node failures
- Algorithm design and analysis
- Mathematical model
- large-scale erasure-coded storage systems
- hill-climbing technique
- Generators
- fault tolerant computing
- fast recovery solution
- Equations
- encoding
- Distributed databases
- data redundancy
- availability guarantees