|
Accession Number
|
DE2012-1044954
|
|
Title
|
Cooperative Application/OS DRAM Fault Recovery.
|
|
Publication Date
|
May 2012
|
|
Media Count
|
20p
|
|
Personal Author
|
K. B. Ferreira M. Hoemmen M. A. Heroux P. G. Bridges R. Brightwell
|
|
Abstract
|
Exascale systems will present considerable fault-tolerance challenges to applications ander several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applications will need toectively utilize future systems. In this paper, we describe work on a cross-layer application/OS framework to handle uncorrected memory errors. We illustrate the use of this framework through its integration with a new fault-tolerant iterative solver within the Trilinos library, and present initial convergence results.
|
|
Keywords
|
Computer codes Computers Convergence Dynamic Random Access Memory (DRAM) Errors Fault tolerant computers Memory devices Programming
|
|
|
Source Agency
|
Technical Information Center Oak Ridge Tennessee
|
|
NTIS Subject Category
|
62 - Computers, Control & Information Theory
|
|
Corporate Author
|
Sandia National Labs., Albuquerque, NM.
|
|
Document Type
|
Technical report
|
|
Title Note
|
N/A
|
|
NTIS Issue Number
|
1302
|
|
Contract Number
|
DE-AC04-94AL85000
|