Tutorial: Fault-Tolerance for HPC - Theory and Practice





The interactive transcript could not be loaded.


Rating is available when the video has been rented.
This feature is not available right now. Please try again later.
Published on Sep 16, 2015

In this video, we introduce the tutorial “Fault-Tolerance for HPC - Theory and Practice”, to take place at Supercomputing 2015 (SC’15), in Austin: http://sc15.supercomputing.org/schedu...

Resilience became a critical issue for large-scale platforms. This tutorial provides a comprehensive survey of fault-tolerant techniques for high-performance computing, with a fair balance between theory and practice. The tutorial is structured along four main topics:

(i) An overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal);

(ii) General-purpose techniques, including different flavors of checkpoint and rollback recovery protocols, replication, prediction and silent error detection;

(iii) Application-specific techniques, such as ABFT for grid-based algorithms or fixed-point convergence for iterative applications, user-level checkpointing in memory; and

(iv) Practical deployment of fault tolerant techniques with User Level Fault Mitigation (a proposed MPI standard extension). Relevant examples, based on ubiquitous computational solver routines, will be protected with a mix of checkpoint-restart and advanced recovery techniques in a hands-on session.

The tutorial is open to all SC'15 attendees who are interested in the current status, and expected promise of fault-tolerant approaches for scientific applications. There are no theoretical prerequisites: background will be provided for all protocols and probabilistic models. However, basic knowledge of MPI will be helpful for the hands-on session.

Join us in Austin and participate to the tutorial!


When autoplay is enabled, a suggested video will automatically play next.

Up next

to add this to Watch Later

Add to

Loading playlists...