Visible to the public On the Speedup of Recovery in Large-Scale Erasure-Coded Storage Systems

TitleOn the Speedup of Recovery in Large-Scale Erasure-Coded Storage Systems
Publication TypeJournal Article
Year of Publication2014
AuthorsYunfeng Zhu, Lee, P.P.C., Yinlong Xu, Yuchong Hu, Liping Xiang
JournalParallel and Distributed Systems, IEEE Transactions on
Volume25
Pagination1830-1840
Date PublishedJuly
ISSN1045-9219
KeywordsAlgorithm design and analysis, availability guarantees, data redundancy, Distributed databases, encoding, Equations, fast recovery solution, fault tolerant computing, Generators, hill-climbing technique, large-scale erasure-coded storage systems, Mathematical model, networked storage system testbed, node failures, parallelized architecture, recovery algorithm, replace recovery algorithm, single-node failure, single-node failure recovery, storage management, Strips, vulnerability window, XOR operations, XOR-based erasure codes, XOR-coded storage system
Abstract

Modern storage systems stripe redundant data across multiple nodes to provide availability guarantees against node failures. One form of data redundancy is based on XOR-based erasure codes, which use only XOR operations for encoding and decoding. In addition to tolerating failures, a storage system must also provide fast failure recovery to reduce the window of vulnerability. This work addresses the problem of speeding up the recovery of a single-node failure for general XOR-based erasure codes. We propose a replace recovery algorithm, which uses a hill-climbing technique to search for a fast recovery solution, such that the solution search can be completed within a short time period. We further extend the algorithm to adapt to the scenario where nodes have heterogeneous capabilities (e.g., processing power and transmission bandwidth). We implement our replace recovery algorithm atop a parallelized architecture to demonstrate its feasibility. We conduct experiments on a networked storage system testbed, and show that our replace recovery algorithm uses less recovery time than the conventional recovery approach.

DOI10.1109/TPDS.2013.244
Citation Key6613479