Documents in the NTIS Technical Reports collection are the results of federally funded research. They are directly submitted to or collected by NTIS from Federal agencies for permanent accessibility to industry, academia and the public.  Before purchasing from NTIS, you may want to check for free access from (1) the issuing organization's website; (2) the U.S. Government Printing Office's Federal Digital System website http://www.gpo.gov/fdsys; (3) the federal government Internet portal USA.gov; or (4) a web search conducted using a commercial search engine such as http://www.google.com.
Accession Number DE2012-1044952
Title Evaluating Operating System Vulnerability to Memory Errors.
Publication Date May 2012
Media Count 26p
Personal Author D. Fiala F. Mueller K. Pedretti K. B. Ferreira P. G. Bridges R. Brightwell
Abstract Reliability is of great concern to the scalability of extreme-scale systems. Of particular concern are soft errors in main memory, which are a leading cause of failures on current systems and are predicted to be the leading cause on future systems. While gone into designing algorithms and applications that can continue to make progress in the presence of these errors without restarting, the most critical software running on a node, the operating system (OS), is currently left relatively unprotected. OS resiliency is of particular importance because, though this software typically represents a small footprint of a compute node's physical memory, recent studies show more memory errors in this region of memory than the remainder of the system. In this paper, we investigate the soft error vulnerability of two operating systems used in current and future high-performance computing systems: Kitten, the lightweight kernel developed at Sandia National Laboratories, and CLE, a high performance Linux-based operating system developed by Cray. For each of these platforms, we outline major structures and subsystems that are vulnerable to soft errors and describe methods that could be used to reconstruct damaged state. Our results show the Kitten lightweight operating system may be an easier target to harden against memory errors due to its smaller memory footprint, largely deterministic state, and simpler system structure.
Keywords Algorithms
Computer systems programs
Errors
Evaluation
Kernels
Memory
Reliability
Sandia National Laboratories
Targets
Vulnerability


 
Source Agency Technical Information Center Oak Ridge Tennessee
NTIS Subject Category 62B - Computer Software
77 - Nuclear Science & Technology
Corporate Author Sandia National Labs., Albuquerque, NM.
Document Type Technical report
Title Note N/A
NTIS Issue Number 1302
Contract Number DE-AC04-94AL85000

Science and Technology Highlights

See a sampling of the latest scientific, technical and engineering information from NTIS in the NTIS Technical Reports Newsletter

Acrobat Reader Mobile    Acrobat Reader