2023 NSF Workshop on Silent Data Corruption
Welcome to the 2023 NSF Workshop on Silent Data Corruption Homepage!
In today's climate, hyperscalers are reporting frequent silent data corruptions (SDCs)—i.e, silent errors or corrupt execution errors (CEEs)—in their cloud fleets caused by silicon manufacturing defects. Remarkably, SDCs at-scale are exhibiting error occurrence rates on the order of one fault in a thousand devices. Meanwhile, hardware manufacturers strive for, and intend to achieve, one hundred and close to zero defective parts per million for the commercial and automotive domains, respectively. Hundreds of thousands of servers in a large-scale infrastructure, featuring millions of hardware devices— e.g., motherboard, CPUs, DIMMs, GPUs, hardware accelerators, NICs, HDDs, flash drives, interconnect modules—coupled with unprecedented error rates means there is a non-negligible probability that SDCs will propagate to and impact system-level applications. Unfortunately, state-of-the-art fault injection studies, and the software resliency/fault tolerance approaches they support, assume that SDCs are a one in a million occurrence. In general, naïvely scaling existing SDC detection and mitigation techniques to compensate for error rates which are at least an order of magnitude higher is not viable from performance/efficiency perspectives. With no existing solutions ready to sub in, the challenge of SDCs at-scale calls for innovation spanning the entire hardware-software stack to guarantee high assurance for the important tasks to which we entrust computers today.
The aim of the 2023 NSF Workshop on Silent Data Corruption is to facilitate cross-disciplinary interactions of participants across the technical disciplines of hardware circuits, computer architecture, systems (including networking, operating systems, and distributed systems), and theory, as well as participants from key industry and government stakeholders, to propose novel solutions to post-manufacturing hardware testing, runtime error detection, software resilience and fault tolerance, system security, and to spearhead research on SDCs at-scale. Over the course of the two-day workshop, participants will contribute position papers, give live presentations, and participate in breakout discussions on the topics discussed in the proposal for this workshop and beyond. The core deliverable to NSF will be a final report detailing key research questions and directions which, if addressed, offer to alleviate challenges imposed by SDC at-scale. As a secondary deliverable, we will work with industry participants to identify and provide access to infrastructure and data which can support identified academic research efforts.
Sponsored by National Science Foundation Awards 2017863 and 2010810