Barcelona Supercomputing Center
Resilience Design Patterns for High-Performance Computing
Pages
2
Time to read
2 mins
Publication
Language
English
Pages
2
Time to read
2 mins
Publication
Language
English
This document is a presentation on resilience design patterns in high-performance computing (HPC) systems, focusing on a structured approach to managing resilience at extreme scales. The speaker, Saurabh Hukerikar, outlines the reliability challenges faced by future HPC systems, particularly the anticipated high fault rates. The presentation discusses the fragmented nature of current resilience solutions, which include both application-level techniques and system-based approaches. It highlights the absence of formal methods and metrics for evaluating resilience comprehensively. The document introduces the concept of resilience-based design patterns, which serve as repeatable solutions to common problems in HPC. A catalog of these design patterns is presented, along with a framework that aids designers in understanding constraints and opportunities for implementation across various system layers. This framework is designed to facilitate flexible fault management and optimize performance, resilience, and power consumption trade-offs, ultimately aiming to establish a systematic methodology for resilience technology evaluation in extreme-scale HPC systems.