Lectures:
Tuesdays & Thursdays, 2:15-3:30 p.m.
Location:
Gates 260 or 359 (look at course
calendar for exact location)
Course description:
This is an
advanced course that will cover unique challenges and opportunities in building
robust systems. Major topics to be covered include:
causes of system failures; state-of-the-art modeling and their
limiations; techniques
for building robust systems that either avoid or are resilient to such failures
through built-in error detection, failure prediction, self-recovery, and
self-repair. Robust system design is a new exciting area of research. EE386
will be a research-oriented course and will explore new research problems.
Prerequisites:
The
students are expected to have necessary background in digital design (EE108A,
108B). Background in EE271 and EE282 is good but not absolutely necessary).
Textbook and course materials:
There is no textbook.
Lecture notes and paper references used in the course will be available
from the class web page.
Course requirements:
·
Seminar: After the 1st
month of class, which consists mainly of traditional lecture,
students (in groups of preferably 2) will be asked to select a topic from an
upcoming lecture. The students
are then expected to research the topic and to produce slides for leading class
discussion on the selected topic.
·
Final
project: With consultation of
the course staff, students (in groups of preferably 2) will design and
implement a research project relevant to the robust systems design topics
taught in class. The projects may be
related to on-going robust systems research at Stanford or may be independently
conceived but should be of appropriate scale and interest for the research
nature of this course.
·
Paper
Review: Students, individually, are expected to submit a short review
of assigned papers, identifying the following elements: strengths (1~3 sentences), weaknesses (1~3 sentences), possible
improvements and/or whether there is any scope for future research (1~3
sentences). Reviews are due before each class and should be emailed to the TA
with the subject “ee386 paper review”. Review should be included in the main
body of the email, and not as an attachment. If there are multiple papers
assigned for a day, choose only one of them to review.
Grading
(tentative):
·
Class participation: 10%
·
Paper Review: 10%
·
Seminar: 30%
·
Project: 50%
Instructor:
Subhasish Mitra (subh
at stanford dot edu)
Gates 333,
(650) 724-1915.
Office hours: Tuesday
& Thursday 3:30pm-4:30pm
Administrative assistant: Uma
Mulukutla (uma at
cs dot stanford
dot edu)
Gates 303, (650) 725-3726,
Fax (650) 725-6949
Teaching assistant: Sung-Boem
Park (sbpark84 at stanford dot edu)
Gates 239
Office hours:
Monday 2:30pm-3:30pm,Thursday 1pm-2pm
|
Date |
Location |
Subject |
Reviews |
Optional
Reading |
Due |
|
Tue 3/31 |
|
Introduction |
|
|
|
|
Thu 4/2 |
|
Hardware
Testing: Fault Models, Test Metrics |
|
Fault_models*,
|
|
|
Tue 4/7 |
|
Design for Testability |
|
DFT* |
|
|
Thu 4/9 |
Gates 260 |
Built In Self Test |
|||
|
Fri
4/10† |
Gates 260 |
Delay testing (1)† |
[Hamzaoglu 99], |
Paper review; Group formation; Seminar topic selection; 1st meeting signup |
|
|
Tue
4/14 |
|
No Class |
|||
|
Thu
4/16 |
Gates 260 |
Delay
testing (2) |
|||
|
Tue
4/21 |
No Class |
|
|||
|
Thu 4/23 |
|
No Class |
|
|
|
|
Tue 4/28 |
Gates 359 |
Robust
System Design – Models, Metrics & Redundancy (1) |
|
[Siewiorek 98], [Trivedi 01], [Pradhan 96] |
Project proposal; 2nd meeting signup |
|
Thu 4/30 |
|
No class |
|
|
|
|
Tue
5/5 |
No class |
|
|||
|
Thu
5/7 |
Gates 260 |
Robust
System Design – Models, Metrics & Redundancy (2) |
|
||
|
Fri 5/8† |
Gates 300 |
Error Correcting Codes† |
|
||
|
Mon
5/11† |
Gates 260 |
System
Effects of Errors & Evaluation† |
[Wang 04], [Mukherjee 03], [Seshia 07] |
[Sanda 08], [Rivers 08], [Bender 08] |
|
|
Tue
5/12 |
Gates 359 |
Circuit & Logic-level Techniques |
[Mitra 00], [Zhang 06] |
[Sanda 08], [Patel 82], [Franco 94] |
Paper review |
|
Thu
5/14 |
Gates 260 |
Architectural-level Techniques |
[Autstin 04], [Reinhardt 00] |
[Austin 99], [Mukherjee 02], [Bernick 05] |
|
|
Mon
5/18† |
Gates 300 |
Software Techniques† |
[Mahmood 88], [Oh 02b], [Lovellette 02] |
[Oh 02a], |
Paper review |
|
Tue
5/19 |
Gates 359 |
Application
Dependent Techniques |
[Li 07], [Pattabiraman 07] |
[Banerjee 84], [Huang 84], [Swift 03], [Wang 07], [Hangal 05], [Ernst 07] |
|
|
Thu
5/21 |
Gates 260 |
Checkpointing & Recovery |
[Meaney 05] [Bronevetsky 04] |
[Wang 05], [Shyam 06], [Bronevetsky 06] |
Project progress slides; 3rd meeting signup |
|
Tue
5/26 |
Gates 104 |
Process
variations - Guest Lecture: Jim Tschanz |
Paper review |
||
|
Thu
5/28 |
Gates 260 |
System and Circuit Failure Prediction |
[Agarwal 07], [Gross
06] |
[Gross
02] |
|
|
Mon
6/1† (5:30pm) |
Gates 300 |
On-line Self-Test & Validation† |
[Constantinides 07] , [Inoue 08], [Li 08] |
Paper review |
|
|
Tue 6/2 |
Gates 159 |
Project
Presentation |
|
|
Project presentation |
|
Tue 6/9 |
|
|
|
Project final report |
v † = Supplementary
classes planned for missing classes
v * = Incomplete book chapters that require registration to
view.
[Agarwal 07] Agarwal, M., et al. “Circuit failure prediction and
its application to transistor aging”, in Proc.
VLSI Test Symp., 2007, pp.277-286.
[Agrawal 82] Agrawal ,V.D., S.C. Seth, and P. Agrawal,
"Fault Coverage Requirements in Production Testing of LSI Circuits,"
in IEEE JSSC, vol.SC-17, pp.57-61,
Feb 1982.
[Austin 99] Austin T.M., “DIVA: a reliable substrate for deep
submicron microarchitecture design”, in Proc.
Intl. Symp. on Microarchitecture, 1999, pp.196-207.
[Austin 04] Austin T.M., “Making typical silicon matter with
Razor”, in Computer, vol. 37, no. 3,
Mar 2004, pp.57-65.
[Banerjee 84] Banjeree, P. J.A. Abraham, “Fault-secure
algorithms for multiple processor systems”, in Proc. Itnl. Symp. on Computer Architecture,1984, pp.279-287.
[Bender 08] Bender, C., et
al. “Soft-error resilience of the IBM POWER6 processor input/output
subsystem”, in IBM Journal of Research
and Development, vol. 52, no. 3, 2008, pp.285-292.
[Bernick 05] Bernick, D., et
al. “NonStop® advanced architecture”, in Proc. Intl. Conf. on Dependable Systems and Networks, 2005,
pp12-21.
[Bronevetsky 04] Bronevetsky, G., et al., “Application level checkpointing for shared memory
programs”, in Proc. Intl. Conf. on
Architectural Support for Programming Languages and Operating Systems,
2004, pp. 235-247.
[Bronevetsky 06] Bronevetsky, G., et al., “Recent Advances in Checkpoint Recovery Systems”, in Next Generation Systems Program Workshop at IPDPS, 2006.
[Constantinides 07] Constantinides, K., et al., “Software-based online detection of hardware defects
mechansisms, architectural support, and evaluation”, Proc. Intl. Symp. on Microarchitecture, 2007, pp. 97-108.
[Ernst 07] Ernst M.D., “The Daikon system for dynamic
detection of likely invariants”, Science
of Computer Programming. Vol. 69, no. 1-3, pp.35-45, Dec 2007.
[Franco 94] Franco, P., and E.J. McCluskey, “Online delay
testing of digital circuits”, in Proc. IEEE
VSLI Test Symp., 1994
[Gross
02] Gross, K.C. and W.Lu, “Early detection of signal and process anomalies in
enterprise computing systems”, Proc.
Intl. Conf. on Machine Learning and Applications, 2002, pp.204-210.
[Gross 06] Gross, K.C., K. Whisnant and A. Urmanov,
“Electronic Prognostics through continuous system telemetry”, Proc. Meeting of the Society for Machinery
Failure Prevention Technology, 2006.
[Hamzaoglu 99] Hamzaoglu, I., J.H. Patel, “Reducing Test Application
Time for Full Scan Embedded Cores”, in Proc.
Intl. Symp. on Fault-Tolerant Computing, 1999, pp. 260-267.
[Hangal 05] Hangal S., et
al. “IODINE: a tool to automatically infer dynamic invariants for hardware
designs”, in Proc. IEEE/ACM Design
Automation Conference, 2005.
[Heragu 96] Heragu, K., J.H. Patel, V.D. Agrawal, “Segment
delay faults: a new fault model”, in Proc.
VLSI Test Symp., 1996, pp.32-39.
[Huang 84] Huang, K.H., and J.A. Abraham, “Algorithm-based
fault tolerance for matrix operations”, in IEEE
Trans. on Computers, vol. C-33, no. 6, June 1984, pp.518-528.
[Inoue 08] Inoue, H., Y. Li and S. Mitra, “VAST:
Virtualization-assisted Concurrent Autonomous Self-Test”, in Proc. Intl. Test Conference, 2008, pp. 1-10.
[Li 07] Li. X, and D. Yeung, “Application-level correctness
and its impact on fault tolerance”, in Proc.
Intl. Symp. on High Performance Computer Architecture, 2007, pp. 181-192.
[Li 08] Li Y., S. Makar, and S. Mitra, “CASP: Concurrent
Autonomous Chip Self-Test using Stored Test Patterns”, Proc. Design, Automation and Test in Europe, 2008, pp. 885-890.
[Lovellette 02] Lovellette, M.N. et al., “Strategies for fault-tolerant, space-based computing:
lessons learned from the ARGOS testbed”, in Proc.
Aerospace Conf., 2002 pp.2109-2119.
[Rudnick 91] Rudnick, E.M.; T.M. Niermann, J.H., Patel, "Methods for
reducing events in sequential circuit fault simulation", in Proc. Intl. Conf. on Computer-Aided Design,
1991, pp. 546-549.
[Mahmood 88] Mahmood A., E.J. McCluskey, “Concurrent error detection using watchdog
processors-a survey”, IEEE Trans. On
Computers, vol. 37, no. 2, Feb 1988, pp.160-174.
[Meaney 05] Meaney, P.J. et
al., “IBM z990 soft error detection and recovery”, in IEEE Trans. on Device and Materials Reliability, vol. 5, no.3, Sept
2005, pp.419-427.
[Mitra 00] Mitra, S., E.J. McCluskey, “Which concurrent error detection scheme to
choose?”, in Proc. Intl. Test. Conf., 2000,
pp.985-994.
[Mitra 04b] Mitra, S., and K.S. Kim, “X-compact: An Efficient
Response Compaction Technique”, in IEEE
TCAD, vol.23, no.3, pp.421-432, Mar 2004.
[Mitra 04a] Mitra, S., S.S. Lumetta, and M. Mitzenmacher,
“X-tolerant Signature Analysis”, in Proc.
Intl. Test Conf. 2004, pp.432-441.
[Mukherjee 02] Mukherjee, S.S., M. Kontz, and S.K. Reinhardt,
“Detailed design and evaluation of redundant multi-threading alternatives”, in Proc. Intl. Symp. on Computer Architecture, 2002,
pp. 99-110.
[Mukherjee 03] Mukherjee, S.S., et al., “A systematic methodology to compute the architectural
vulnerability factors for a high-performance microprocessor”, in Proc. Intl. Symp. on Microarchitecture, 2003,
pp.29-41.
[Nicolaidis
[Niermann 92] Niermann, T.M., W-T. Cheng, J.H. Patel, "PROOFS: a fast,
memory-efficient sequential circuit fault simulator" in IEEE Trans. On Computer-Aided Design of
Integrated Circuits and Systems, vol.11,
no.2, pp.198-207, Feb 1992.
[Oh 02a] Oh, N., P.P. Shirvani, E.J. McCluskey, “Control-flow
checking by software signatures”, in IEEE
Trans. on Reliability, vol. 51, no. 1, Mar 2002, pp.111-122.
[Oh 02b] Oh, N., P.P. Shirvani, E.J. McCluskey, “Error
detection by duplicated instructions in super-scalar processors” in IEEE Trans. on Reliability, vol. 51, no.
1, Mar 2002, pp. 63-75.
[Patel 82] Patel J.H., et
al. “Concurrent error detection in ALUs by Recomputing with Shifted
Operands”, in IEEE Trans. Computers, vol.
C-31., no. 7, Jul 1982, pp.589-595,
[Pattabiraman 07] Pattabiraman, K., Z. Kalbarczyk, R.K. Iyer,
"Automated Derivation of Application-aware Error Detectors Using Static
Analysis", in Proc, Intl. On-Line
Testing Symposium, 2007, pp.211-21.
[Pradhan 96] Pradhan, D.K., ed. “Fault tolerant computer
system design”, Prentice-Hall, 1996
[Rajski 04] Rajski, J., et al. “Embedded Deterministic Test”,
in IEEE TCAD, vol. 23, no. 5, pp.
776-792, May 2004.
[Rivers 08] Rivers, J.A., et
al. “Phaser: Phased methodology for modeling the system-level effects of
soft errors”, in IBM Journal of Research
and Development, vol. 52, no. 3, 2008, pp.293-306.
[Reinhardt 00] Reinhardt, S.K. “Transient fault detection via
simultaneous multithreading”, in Proc.
Intl. Symp. Computer Architecture, 2000, pp.25-36.
[Sanda 08] Sanda, P.N.
et al., “Soft-error resilience of the IBM POWER6 processor”, in IBM Journal of Research and Development, vol.
52, no. 3, 2008, pp.275-284.
[Seshia 07] Seshia S.A., W.Li, S.Mitra, “Verification-guided
soft error resilience”, in Proc. Conf. on
Design, Automation and Test in Europe, 2007, pp.1442-1447.
[Shyam 06] Shyam, S., et
al., “Ultra low cost defect protection for microprocessor pipelines”, in Proc. Intl. Conf. on Architectural Support
for Programming Lanugages and Operating Systems, 2006, pp. 73-82.
[Siewiorek 98] Siewiorek, D.P. and R.S. Swarz, “Reliable
Computer Systems – Design and Evaluation”, Digital Press (distributed by
Butterworth), 1998.
[Swift 03] Swift, M. M., et
al. “Improving the Reliability of Commodity Operating Systems”, in Proc. ACM Symp. on Operating Systems
Principles, 2003.
[Trivedi 01] Trivedi, K.S., “Probability and statistics with
reliability, queuing, and computer science applications”, 2nd ed,
Wiley-Interscience, 2001.
[Wang 04] Wang, N.J., et
al. “Characterizing the effects of transient faults on a high performance
processor pipeline”, in Proc. Intl. Conf.
on Dependable Systems and Networks, 2004, pp 61-70.
[Wang 05] Wang, L., et
al., “Modeling coordinated checkpointing for large-scale supercomputers”,
in Proc. Intl. Conf. on Dependable
Systems and Networks, 2005, pp. 812-821.
[Wang 07] Wang, L., et
al., “An OS-level framework for providing application-aware reliability, in
Proc Pacific Rim Intl. Symp. On
Dependable Computing, 2005
[Zhang 06] Zhang, M., et.
al., “Sequential element design with built-in soft error resilience”, in IEEE Trans. On VLSI Systems, vol. 14,
no. 12, Dec 2006.
Project
Group Formation:
·
One person from each group should email
the TA his/her group members with the subject “ee386 group members”
Seminar
Topic Selection:
·
Among the last eight topics listed on
the course calendar, choose the ones you like and list them in the order of
preference.
·
Email the list to the TA with the
subject “ee386 seminar
preference”.
1st meeting: 4/16, 4/17
·
Discuss
the seminar and project topics.
Project Proposal:
·
Email
the proposal to the TA with the subject “ee386 project proposal”.
·
Proposal
should contain: project description, motivation, related work, methodology,
expected outcome and schedule.
·
Proposal
should be no longer than 2 pages, double column, font size 10.
Seminar:
·
First
draft of the slides must be submitted at least two weeks before your seminar.
·
Decide
on the most representative set of papers (1~3) that you want the students to
review before your seminar. You do not have to use the papers listed on the
course calendar for the selection. The decision must be made at least two weeks
before your seminar. Students will be required to review ONE PAPER PER WEEK,
thus do not assume that everybody has read your selected paper.
·
The
final draft of the slides must be submitted at least two days before your
seminar.
·
Your
seminar will last for the entire class period (1 hour 15 minutes).
2nd meeting: 5/1, 5/2
·
Discuss
the project proposal.
Project Progress Slides
·
Email
the slides to the TA with the subject “ee386 project progress”.
·
It
should contain: project description, methodology, preliminary result, and
remaining work.
3rd meeting: 5/21, 5/22
·
Present
the project progress slides to the TA.
Final Report:
·
Email
the report to the TA with the subject “ee386 project report”
·
The
report should contain: abstract, intro & motivation, proposed design,
methodology, experiments, related work, conclusion, reference.
·
The
report should be 3~6 pages, double column, font size 10.
·
Here
are some sample project reports from the previous year: paper1,
paper2
Last updated: May 31, 2009