An overview of fault-tolerant techniques for HPC

An overview of fault-tolerant techniques for HPC

Yves Robert

Ecole Normale Supérieure de Lyon & University of Tennessee Knoxville

Yves.Robert@ens-lyon.fr, graal.ens-lyon.fr/~yrobert

Resilience is a critical issue for large-scale platforms. This tutorial provides a comprehensive survey on fault-tolerant techniques for high-performance computing. It is organized along four main topics:

(i) An overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal);
(ii) General-purpose techniques, which include several checkpoint and rollback recovery protocols (coordinated, hierarchical, in-memory), possibly combined with replication and prediction
(iii) Application-specific techniques, such as ABFT for grid-based algorithms or fixed-point convergence for iterative applications; and
(iv) Silent error detection and correction methods, either generic (verification and checkpoint) or application-specific. Relevant execution scenarios will be evaluated and compared through quantitative models (from Young’s approximation to Daly’s formulas and recent work).

The tutorial is open to all EuroPar’16 attendees who are interested in the current status and expected promise of fault-tolerant approaches for scientific applications. There are no audience prerequisites: background will be provided for all protocols and probabilistic models.

An earlier version of this EuroPar’2016 tutorial was given at the Winter School on Hot topics in Distributed Computing (HTDC’15). The tutorial notes for HTDC’15 are available at http://graal.ens-lyon.fr/~yrobert/htdc-flaine.pdf.

Introduction & Motivation (20 min)
- Large-scale computing platforms are failure-prone
- Failure types
- Failure probability distributions
Checkpointing: Protocols (20 min)
- Coordinated checkpointing
- Message logging
- Hierarchical checkpointing
Checkpointing: Probabilistic Models (50 min)
- Young approximation and Daly’s formula
- Coordinated versus hierarchical: performance prediction at scale
- In- memory checkpointing
- Coupling checkpointing with failure prediction
- Coupling checkpointing with replication
Application-Specific Fault-Tolerance Techniques (30 min)
- Bags of tasks
- Iterative algorithms and fixed-point convergence
- ABFT for linear algebra
Silent errors (40 min)
- Coupling checkpointing with verification mechanisms
- Application-specific methods Larger Perspective and Conclusion (20 min)
- Wrap-up of key approaches
- Overview of existing tools and softwares
- Perspectives: resilience at exascale
- Bibliographic pointers for further reference

Durations are indicative. Questions from participants will be taken on the fly.

An overview of fault-tolerant techniques for HPC

In this section