Biblio
Modern storage systems stripe redundant data across multiple nodes to provide availability guarantees against node failures. One form of data redundancy is based on XOR-based erasure codes, which use only XOR operations for encoding and decoding. In addition to tolerating failures, a storage system must also provide fast failure recovery to reduce the window of vulnerability. This work addresses the problem of speeding up the recovery of a single-node failure for general XOR-based erasure codes. We propose a replace recovery algorithm, which uses a hill-climbing technique to search for a fast recovery solution, such that the solution search can be completed within a short time period. We further extend the algorithm to adapt to the scenario where nodes have heterogeneous capabilities (e.g., processing power and transmission bandwidth). We implement our replace recovery algorithm atop a parallelized architecture to demonstrate its feasibility. We conduct experiments on a networked storage system testbed, and show that our replace recovery algorithm uses less recovery time than the conventional recovery approach.