Skip navigation, to content.

2006 Rochester Computational Science and Education Conference

An Overview of High Performance Computing and Self Adapting Numerical Software

Author: Jack Dongarra (University of Tennessee, Oak Ridge National Laboratory)

Abstract

In this talk we will look at how High Performance computing has changed over the last 10-year and look toward the future in terms of trends. A new generation of software libraries and algorithms are needed for the effective and reliable use of (wide area) dynamic, distributed and parallel environments. Some of the software and algorithm challenges have already been encountered, such as management of communication and memory hierarchies through a combination of compile—time and run—time techniques, but the increased scale of computation, depth of memory hierarchies, range of latencies, and increased run—time environment variability will make these problems much harder.

As the number of processors in today's high performance computers continues to grow, the mean-time-to-failure of these computers are becoming significantly shorter than the executing time of many current high performance computing applications. Although today's architectures are usually robust enough to survive node failures without suffering complete system failure, most today's high performance computing applications can not survive node failures and, therefore, whenever there is a node failure, have to abort themselves and restart from the beginning or a stable storage based checkpoint.

Along these lines we will discuss work on the development of fault tolerant based linear algebra algorithms. We will present an approach to building fault survivable high performance computing applications using diskless checkpointing with FT-MPI. We give a detailed presentation on how to write a fault survivable application with FT-MPI using diskless checkpointing and evaluate the performance overhead of our fault tolerance approach by using a preconditioned conjugate gradient equation solver as an example. Experiment results demonstrate our fault tolerance approach can survive a small portion of simultaneous processor failures with low performance overhead and little numerical impact.