Title | Hardware Remediation at Scale |
Publication Type | Conference Paper |
Year of Publication | 2018 |
Authors | Lin, F., Beadon, M., Dixit, H. D., Vunnam, G., Desai, A., Sankar, S. |
Conference Name | 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W) |
Date Published | jun |
Keywords | anomaly detection, Automated Response Actions, composability, datacenter, fault tolerant computing, Film bulk acoustic resonators, Hardware, hardware availability, hardware failure modes, hardware issues, hardware remediation, learning (artificial intelligence), machine learning, maintenance engineering, Monitoring, natural language processing, pubcrawl, remediation efficiency, remediation flow, remediation system, Resiliency, Servers, Transient analysis, transient errors |
Abstract | Large scale services have automated hardware remediation to maintain the infrastructure availability at a healthy level. In this paper, we share the current remediation flow at Facebook, and how it is being monitored. We discuss a class of hardware issues that are transient and typically have higher rates during heavy load. We describe how our remediation system was enhanced to be efficient in detecting this class of issues. As hardware and systems change in response to the advancement in technology and scale, we have also utilized machine learning frameworks for hardware remediation to handle the introduction of new hardware failure modes. We present an ML methodology that uses a set of predictive thresholds to monitor remediation efficiency over time. We also deploy a recommendation system based on natural language processing, which is used to recommend repair actions for efficient diagnosis and repair. We also describe current areas of research that will enable us to improve hardware availability further. |
DOI | 10.1109/DSN-W.2018.00015 |
Citation Key | lin_hardware_2018 |