Schedule:
Sept. 26, 2005: Speaker: Prof. Subhasish Mitra, Stanford University & Principal Engineer, Intel Corporation
Topic: Introduction to
Robust Systems
Mitra Slides
Oct. 3, 2005: Dr. Lawrence Votta, Distinguished Engineer, SUN Microsystems
Title: Telemetry, Control and Failure Prediction of Computer
Systems
Abstract
Each of us has experienced delays or failures of service
from computer systems:
automobile traffic congestion from failed signals,
slow Visa charge transaction clearing,
online banking transaction denial,
stock market order completion failure, etc.
Compounding the problem is further automation using
ever more complex computer networks and systems. The resulting
interdependencies make control and consequences of a failure
difficult to predict, with recovery actions that are error prone and
take a long time.
In this talk, we will describe how the introduction of system
telemetry and field replaceable unit (FRU) black boxes (BB)
capturing temporal state information enables us
to show how event A causes B. Moving past simple
correlation to causality operationalizes the goal of every
well engineered, fault tolerant, high availability system
"to learn from every failure."
Multivariate State Estimation Techniques (MSET) and
Sequential Probability Ratio Test (SPRT) can use the telemetry
and black box data recorded to recognize, control and
predict specific computer system failures several hours before
the failure. We will show three examples of this predictive capability
for three common computer system failures; the hardware failure of a
disk spindle bearing, the software failure caused by a memory leak and
the degrading of cooling from restrictive air intake. Finally, recent
research in next generation supercomputers
in the Defense Advanced Research Projects Agency (DARPA) High
Productivity Computer Systems (HPCS) shows that the productivity of
these systems is dependent on the fault tolerant architecture of the
system. A brief summary of how this dependency occurs will be
discussed. Computers will need to "compute correctly through failure"
using telemetry, control systems and failure prediction.
Bio
Dr. Lawrence Votta received his B.S. degree in Physics from the
University of Maryland, College Park, Maryland in 1973, and his Ph.D.
degree in Physics from the Massachusetts Institute of Technology,
Cambridge, Massachusetts in 1979. He is a Distinguished Engineer at
Sun Microsystems Inc. improving the software and system reliability
and availability of Sun's products while pursuing his research
interest in high availability computing and empirical software
engineering. He currently is the Productivity Principle Investigator
for Sun's DARPA High Productivity Systems Phase II program. Larry has
authored or coauthored more than 60 papers and chapters of 2 books in
Software Engineering (and 10 papers in Physics) including empirical
studies of software
development from highly controlled experiments investigating the best
methods for
design reviews and code inspection to anecdotal studies of a
developer's time usage in a large software development. Recently, his
work on combined hardware and software telemetry of complex systems
and its analysis have led to 9 patent submissions.
Larry is a member of the IEEE and ACM.
Oct. 10, 2005: Dr. Philip Shirvani, nVidia
Topic: Fault Tolerant Computing for Space Radiation Environments
Title: The ARGOS Experiment
Abstract:
A major concern in digital electronics used in space is radiation-induced transient errors. Radiation hardening is an effective yet costly solution to this problem. Commercial off-the-shelf (COTS) components have been considered as a low-cost alternative to radiation-hardened parts. In the ARGOS experiment, these two approaches were compared in an actual space experiment. We assessed the effectiveness of Software-Implemented Hardware Fault Tolerance (SIHFT) techniques in enhancing the reliability of COTS.
Bio:
Philip Shirvani received the B.S. degree (1991) in electrical engineering from Sharif University of Technology, Tehran, Iran; and the M.S. (1996) and Ph.D. (2001) degrees in electrical engineering from Stanford University. He is currently a hardware engineer at NVIDIA Corp.
Oct. 24, 2005: Prof. Hector Garcia Molina, Stanford University
Topic: Key Configuration Management
Abstract: We explore and characterize the space of alternatives for managing
the security and longevity of sensitive data. This is joint work with Chris
Olston at CMU.
Garcia Molina Slides
Oct. 31, 2005: Wendy Bartlett, Distinguished Technologist, HP
Topic: An Overview of the HP NonStop Server
This talk gives an introduction to the HP NonStop server, formerly known
as the Tandem system. This system was designed from the ground up 30
years ago to provide the commercial market with single fault tolerance,
high levels of data integrity and extreme scalability. These attributes
are the result of a combination of hardware and software features, which
will be described here. The talk includes a brief recap of the NonStop
Advanced Architecture that was described in a DSN talk last summer.
Wendy Bartlett is a Distinguished Technologist in the NonStop Enterprise
Division of HP. Her main focus is on system availability and
scalability. She holds an M.S. degree in Computer Science: Computer
Engineering from Stanford, and is a former RAT.
Nov. 7, 2005: Brendan Murphy, Microsoft Research
Topic: Estimating the Risk of Releasing Software
Abstract:
This talk will describe an ongoing investigation into the way
Window's software is developed and its subsequent quality on the end
customer systems.
This whole research direction resulted from an investigation
into the relationship between software development attributes (such as
size, complexity, test coverage) and subsequent failures affecting end
customers. The results of this work were confusing and counter
intuitive. This spawned a subsequent investigation into the relationship
between software churn and quality. Based on the results of this work a
model was derived to estimate the risk of releasing software. The
accuracy and limitations of the model will also be examined as part of
the talk. The talk will also discuss the current focus of the research,
examining the relationship between the people writing the software and
the overall software quality.
Bio
After graduating from Newcastle university Brendan joined ICL
where he worked on Fault simulators and ATPG's. After ICL he moved to
Digital where he ran a system reliability group that was responsible for
monitoring the reliability of VMS, UNIX and NT servers running on
customer sites. Brendan joined Microsoft Research in 1999 were he has
focused on the causes of system failures and the relationship between
the way software is developed and its subsequent end user quality.
Nov. 14, 2005: Shekhar Borkar, Intel Fellow and Director, Microprocessor Technologies Laboratory
Topic: Designing Reliable Systems with Unreliable Components
Abstract
VLSI system performance increased by five orders of magnitude in the last three decades, made possible by continued technology scaling. This treadmill will continue, providing integration capacity of billions of transistors; however, power, energy, variability, and reliability will be the barriers.
As technology scales, variability will continue to become worse. Random dopant fluctuations in the transistor channel and sub-wavelength lithography will yield static variations. Increased power densities will stress power grids creating dynamic supply voltage and temperature variations, thereby affecting circuit performance and leakage power. Since the behavior of the fabricated design will be different from what was intended, the effect of these static and dynamic variations will look like inherent unreliability in the design.
Soft error rates due to cosmic rays will continue to get worse, and the total state bits in a design will also double every two years, increasing intermittent error rate of a design by almost two orders of magnitude. As transistors become even smaller, degradation due to aging will become worse reducing transistor current over time and impacting performance of the design.
This paper will discuss these effects and propose research in microarchitecture, design, and testing, for designing with billions of unreliable components to yield reliable systems.
Borkar Slides
Biography
Shekhar Borkar graduated with MS in Physics from University of Bombay, MSEE from
University of Notre Dame in 1981, and joined Intel Corporation. He worked on the 8051 family
of microcontrollers, the iWarp multicomputer project, and subsequently on Intel's
supercomputers. He is an Intel Fellow and director of Circuit Research. His research interests
are high performance and low power digital circuits, and high-speed signaling. Shekhar is an
adjunct faculty member at Oregon Graduate Institute, and teaches VLSI design.
Nov. 21, 2005: Holiday
Nov. 28, 2005: Prof. Gregory T. A. Kovacs, Stanford University
Please note time & location change
Gates 259, 3:00-3:50pm
Topic: Understanding the Loss of the Space Shuttle Columbia Overview of the
Vehicle Forensic Analysis
Dec. 5, 2005: TBD