Robust Systems Seminar: EE 392U Autumn 2005-2006
Location: Education 206
Time: 2:15-3:05pm, Every Monday, starting Sept. 26, 2005

Instructors: Prof. Subhasish Mitra
Prof. Edward J. McCluskey


Course Description: Current research topics on understanding the causes of system failures, their impact, and techniques to build robust systems that either avoid or are resilient to such failures will be discussed. New research problems for high-performance robust system designs will be discussed. Speakers from industry and academia.

Schedule:

Sept. 26, 2005: Speaker: Prof. Subhasish Mitra, Stanford University & Principal Engineer, Intel Corporation
Topic: Introduction to Robust Systems

Mitra Slides


Oct. 3, 2005: Dr. Lawrence Votta, Distinguished Engineer, SUN Microsystems
Title: Telemetry, Control and Failure Prediction of Computer Systems


Abstract
Each of us has experienced delays or failures of service from computer systems: automobile traffic congestion from failed signals, slow Visa charge transaction clearing, online banking transaction denial, stock market order completion failure, etc. Compounding the problem is further automation using ever more complex computer networks and systems. The resulting interdependencies make control and consequences of a failure difficult to predict, with recovery actions that are error prone and take a long time. In this talk, we will describe how the introduction of system telemetry and field replaceable unit (FRU) black boxes (BB) capturing temporal state information enables us to show how event A causes B. Moving past simple correlation to causality operationalizes the goal of every well engineered, fault tolerant, high availability system "to learn from every failure." Multivariate State Estimation Techniques (MSET) and Sequential Probability Ratio Test (SPRT) can use the telemetry and black box data recorded to recognize, control and predict specific computer system failures several hours before the failure. We will show three examples of this predictive capability for three common computer system failures; the hardware failure of a disk spindle bearing, the software failure caused by a memory leak and the degrading of cooling from restrictive air intake. Finally, recent research in next generation supercomputers in the Defense Advanced Research Projects Agency (DARPA) High Productivity Computer Systems (HPCS) shows that the productivity of these systems is dependent on the fault tolerant architecture of the system. A brief summary of how this dependency occurs will be discussed. Computers will need to "compute correctly through failure" using telemetry, control systems and failure prediction. Bio

Dr. Lawrence Votta received his B.S. degree in Physics from the University of Maryland, College Park, Maryland in 1973, and his Ph.D. degree in Physics from the Massachusetts Institute of Technology, Cambridge, Massachusetts in 1979. He is a Distinguished Engineer at Sun Microsystems Inc. improving the software and system reliability and availability of Sun's products while pursuing his research interest in high availability computing and empirical software engineering. He currently is the Productivity Principle Investigator for Sun's DARPA High Productivity Systems Phase II program. Larry has authored or coauthored more than 60 papers and chapters of 2 books in Software Engineering (and 10 papers in Physics) including empirical studies of software development from highly controlled experiments investigating the best methods for design reviews and code inspection to anecdotal studies of a developer's time usage in a large software development. Recently, his work on combined hardware and software telemetry of complex systems and its analysis have led to 9 patent submissions. Larry is a member of the IEEE and ACM.

Oct. 10, 2005: Dr. Philip Shirvani, nVidia
Topic: Fault Tolerant Computing for Space Radiation Environments


Title: The ARGOS Experiment
Abstract:
A major concern in digital electronics used in space is radiation-induced transient errors. Radiation hardening is an effective yet costly solution to this problem. Commercial off-the-shelf (COTS) components have been considered as a low-cost alternative to radiation-hardened parts. In the ARGOS experiment, these two approaches were compared in an actual space experiment. We assessed the effectiveness of Software-Implemented Hardware Fault Tolerance (SIHFT) techniques in enhancing the reliability of COTS.
Bio:
Philip Shirvani received the B.S. degree (1991) in electrical engineering from Sharif University of Technology, Tehran, Iran; and the M.S. (1996) and Ph.D. (2001) degrees in electrical engineering from Stanford University. He is currently a hardware engineer at NVIDIA Corp.

Oct. 24, 2005: Prof. Hector Garcia Molina, Stanford University
Topic: Key Configuration Management


Abstract: We explore and characterize the space of alternatives for managing the security and longevity of sensitive data. This is joint work with Chris Olston at CMU.
Garcia Molina Slides


Oct. 31, 2005: Wendy Bartlett, Distinguished Technologist, HP
Topic: An Overview of the HP NonStop Server


This talk gives an introduction to the HP NonStop server, formerly known as the Tandem system. This system was designed from the ground up 30 years ago to provide the commercial market with single fault tolerance, high levels of data integrity and extreme scalability. These attributes are the result of a combination of hardware and software features, which will be described here. The talk includes a brief recap of the NonStop Advanced Architecture that was described in a DSN talk last summer.

Wendy Bartlett is a Distinguished Technologist in the NonStop Enterprise Division of HP. Her main focus is on system availability and scalability. She holds an M.S. degree in Computer Science: Computer Engineering from Stanford, and is a former RAT.

Nov. 7, 2005: Brendan Murphy, Microsoft Research
Topic: Estimating the Risk of Releasing Software


Abstract:
This talk will describe an ongoing investigation into the way Window's software is developed and its subsequent quality on the end customer systems.
This whole research direction resulted from an investigation into the relationship between software development attributes (such as size, complexity, test coverage) and subsequent failures affecting end customers. The results of this work were confusing and counter intuitive. This spawned a subsequent investigation into the relationship between software churn and quality. Based on the results of this work a model was derived to estimate the risk of releasing software. The accuracy and limitations of the model will also be examined as part of the talk. The talk will also discuss the current focus of the research, examining the relationship between the people writing the software and the overall software quality.

Bio
After graduating from Newcastle university Brendan joined ICL where he worked on Fault simulators and ATPG's. After ICL he moved to Digital where he ran a system reliability group that was responsible for monitoring the reliability of VMS, UNIX and NT servers running on customer sites. Brendan joined Microsoft Research in 1999 were he has focused on the causes of system failures and the relationship between the way software is developed and its subsequent end user quality.

Nov. 14, 2005: Shekhar Borkar, Intel Fellow and Director, Microprocessor Technologies Laboratory
Topic: Designing Reliable Systems with Unreliable Components


Abstract
VLSI system performance increased by five orders of magnitude in the last three decades, made possible by continued technology scaling. This treadmill will continue, providing integration capacity of billions of transistors; however, power, energy, variability, and reliability will be the barriers. As technology scales, variability will continue to become worse. Random dopant fluctuations in the transistor channel and sub-wavelength lithography will yield static variations. Increased power densities will stress power grids creating dynamic supply voltage and temperature variations, thereby affecting circuit performance and leakage power. Since the behavior of the fabricated design will be different from what was intended, the effect of these static and dynamic variations will look like inherent unreliability in the design. Soft error rates due to cosmic rays will continue to get worse, and the total state bits in a design will also double every two years, increasing intermittent error rate of a design by almost two orders of magnitude. As transistors become even smaller, degradation due to aging will become worse reducing transistor current over time and impacting performance of the design. This paper will discuss these effects and propose research in microarchitecture, design, and testing, for designing with billions of unreliable components to yield reliable systems.
Borkar Slides

Biography
Shekhar Borkar graduated with MS in Physics from University of Bombay, MSEE from University of Notre Dame in 1981, and joined Intel Corporation. He worked on the 8051 family of microcontrollers, the iWarp multicomputer project, and subsequently on Intel's supercomputers. He is an Intel Fellow and director of Circuit Research. His research interests are high performance and low power digital circuits, and high-speed signaling. Shekhar is an adjunct faculty member at Oregon Graduate Institute, and teaches VLSI design.

Nov. 21, 2005: Holiday

Nov. 28, 2005: Prof. Gregory T. A. Kovacs, Stanford University

Please note time & location change
Gates 259, 3:00-3:50pm


Topic: Understanding the Loss of the Space Shuttle Columbia ­ Overview of the Vehicle Forensic Analysis

Dec. 5, 2005: TBD