Fault Tolerance for HPC: Theory and Practice at SC'16





The interactive transcript could not be loaded.


Rating is available when the video has been rented.
This feature is not available right now. Please try again later.
Published on Sep 30, 2016

Resilience becomes a critical issue for large-scale platforms. This tutorial provides a comprehensive survey of fault-tolerant techniques for high-performance computing, with a fair balance between practice and theory. It is organized along four main topics:

(i) An overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal);

(ii) General-purpose techniques, which include several checkpoint and rollback recovery protocols, replication, prediction and silent error detection;

(iii) Application-specific techniques, such as ABFT for grid-based algorithms or fixed-point convergence for iterative applications, user-level checkpointing in memory; and

(iv) Practical deployment of fault tolerant techniques with User Level Fault Mitigation (a proposed MPI standard extension). Relevant examples based on ubiquitous computational solver routines will be protected with a mix of checkpoint-restart and advanced recovery techniques in a hands-on session.

The tutorial is open to all SC'16 attendees who are interested in the current status and expected promise of fault-tolerant approaches for scientific applications. There are no audience prerequisites: background will be provided for all protocols and probabilistic models. However, basic knowledge of MPI will be helpful for the hands-on session.


When autoplay is enabled, a suggested video will automatically play next.

Up next

to add this to Watch Later

Add to

Loading playlists...