Stanford EE Computer Systems Colloquium

4:15PM, Wednesday, Oct 21, 2009
NEC Auditorium, Gates Computer Science Building B03
http://ee380.stanford.edu

DRAM errors in the wild: A large-scale field study

Bianca Schroeder
Computer Science Department, University of Toronto
About the talk:

Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in laboratory conditions, little has been reported on real DRAM failures in large production clusters. In this paper, we analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of DIMM days.

The goal of this talk is to answer questions such as the following: How common are memory errors in practice? What are their statistical properties? How are they affected by external factors, such as temperature and utilization, and by chip-specific factors, such as chip density, memory technology and DIMM age?

We find that DRAM error behavior in the field differs in many key aspects from commonly held assumptions. For example, we observe that DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode. We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account. Finally, unlike commonly feared, we don't observe any indication that newer generations of DIMMs have worse error behavior.

Slides:

There is no downloadable version of the slides for this talk available at this time.

About the speaker:

[Speaker Photo] Bianca is an assistant professor in the Computer Science Department at the University of Toronto and a member of the computer systems and networks group . Before joining UofT, she spent 2 years as a post-doc at Carnegie Mellon University working with Garth Gibson. She received her doctorate from the Computer Science Department at Carnegie Mellon University under the direction of Mor Harchol-Balter. She is a two-time winner of the IBM PhD fellowship and her work has won three best paper awards and one best presentation award. Both her work on hard drive reliability and her work on DRAM reliability have been featured in articles at a number of news sites, including Computerworld, Slashdot, PCWorld, StorageMojo and eWEEK.

Contact information:

Bianca Schroeder
Computer Science
email: biancia@cs.toronto.edu
web: http://www.cs.toronto.edu/~bianca/