Lectures:
Tuesdays & Thursdays, 2:15-3:30 p.m.
Location:
Building 200, Room 030
Instructor:
Prof. Subhasish Mitra
Course description:
Electronic systems are an indispensable part of all our lives.
Malfunctions in these systems have
consequences ranging from annoying computer crashes, loss of data and
services, to financial and
productivity losses, or even loss of human life. For example, in 2009,
a glitch in a single circuit
board in the air-traffic control system resulted in hundreds of
flights being canceled or delayed.
Such impacts continue to increase as systems become more complex,
interconnected, and pervasive.
Malfunctions may be caused by hardware or software failures, design
errors, malicious attacks,
or incorrect human interactions. Robust system design is required to
ensure that future systems
perform correctly despite rising levels of complexity and increasing
disturbances. Hardware failures
are especially a growing concern because:
Electronic systems are an indispensable part of all our lives.
Robust system design is a new exciting area of research.
EE386 is a research-oriented advanced course that will cover unique
challenges and opportunities in building robust systems,
ranging from immediate concerns blocking progress today to major obstacles in
the future. Topics that will be covered include causes of system failures;
state-of-the-art modeling and inadequacies; effective post-silicon
validation techniques; ways to
build robust systems that either avoid or are resilient to such failures
through built-in error detection, failure prediction, self-recovery, and
self-repair.
Prerequisites:
The
students are expected to have necessary background in digital design (EE108A,
108B). Background in EE271 and EE282 is good but not absolutely necessary).
Textbook and course materials:
There is no textbook.
Lecture notes and paper references used in the course will be available
from the class web page.
Course requirements:
·
Seminar: After the 1st
month of class, which consists mainly of traditional lecture,
students (in groups of preferably 2) will be asked to select a topic from an
upcoming lecture. The students
are then expected to research the topic and to produce slides for leading class
discussion on the selected topic.
·
Final
project: With consultation of
the course staff, students (in groups of preferably 2) will design and
implement a research project relevant to the robust systems design topics
taught in class. The projects may be
related to on-going robust systems research at Stanford or may be independently
conceived but should be of appropriate scale and interest for the research
nature of this course.
·
Paper
Review: Students, individually, are expected to submit a short review
of assigned papers, identifying the following elements: strengths (1~3 sentences), weaknesses (1~3 sentences), possible
improvements and/or whether there is any scope for future research (1~3
sentences). Reviews are due before each class and should be emailed to the TA
with the subject “ee386 paper review”. Review should be included in the main
body of the email, and not as an attachment. If there are multiple papers
assigned for a day, choose only one of them to review.
Grading
(tentative):
· Class participation: 10%
· Paper Review: 10%
· Seminar: 30%
· Project: 50%
Instructor:
Subhasish Mitra (subh
at stanford dot edu)
Gates 333,
(650) 724-1915.
Office hours: Tuesday & Thursday, 3:30pm-4:30pm, Gates 333
Administrative assistant: Peche Turner (peche at
cs dot stanford
dot edu)
Gates 275, (650) 723-5396,
Fax (650) 725-7411
Teaching assistant:
Date |
Location |
Subject |
Reviews |
Optional
Reading |
Due |
Tue 3/29 |
|
Introduction |
|
|
|
Thu 3/31 |
|
Hardware
Testing: Fault Models, Test Metrics |
|
|
|
Tue 4/5 |
|
Design for Testability |
|
|
|
Thu 4/7 |
|
Test Compression & Built-in Self-test |
[Hamzaoglu 99], [Mitra 04a], [Mitra 04b], [Rajski 04] |
||
Tue
4/12 |
|
No Class |
|
|
|
Thu
4/14 |
|
Delay Testing (1) |
Delay Testing* | No redings due, but pls prepare for your lecture | |
Tue
4/19 |
|
Delay
Testing (2) |
|||
Thu
4/21 |
Gates 459 2:15-4:45pm |
Robust systems: models, metrics, redundancy |
Modeling&Redundancy [Siewiorek 98], [Trivedi 01], [Pradhan 96] |
|
|
Tue 4/26 |
|
Error correcting codes |
|
Coding | |
Thu 4/28 |
|
System-level effects of errors & evaluation - Slide |
[Wang 04], [Mukherjee 03], [Seshia 07] |
[Sanda 08], [Rivers 08], [Bender 08] |
|
Tue
5/3 |
No class |
|
Due end of Tuesday (5/3) Project plan (short -- 1 paragraph -- description on what you will be doing for your project) |
||
Thu
5/5 |
|
Extended lecture: Circuit- and logic-level techniques - Slide Architectural techniques (by Claire) - Slide |
[Mitra 00], [Zhang 06] |
[Sanda 08], [Patel 82], [Franco 94] |
|
Tue
5/10 |
|
Software techniques (by Hai) - Slide |
[Mahmood 88], [Oh 02b], [Lovellette 02] [Pattabiraman 07] |
|
|
Thu
5/12 |
|
Extended lecture: Application-aware techniques (by Richard) - Slide Checkpoint / Recovery (by Dan) - Slide |
|||
Tue
5/17 |
|
Extended lecture: On-line Self-test & Diagnostics / Failure Prediction (by Chen) - Slide Post-silicon validation (By Akshay and David) Slide |
[Meaney 05] [Bronevetsky 04] [Agarwal 07] [Gross 06] [Constantinides 07] [Inoue 08] [Li 08] |
|
|
Thu 5/19 | No Class | ||||
Tue
5/24 |
Gates 104 2:30-3:30 pm |
Variations -- guest lecture by Jim Tschanz & Recovery |
|||
Thu
5/26 |
|
No Class |
|||
Tue
5/31 |
|
Project presentations |
v † = Supplementary
classes planned for missing classes
v * = Incomplete book chapters that require registration to
view.
[Agarwal 07] Agarwal, M., et al. “Circuit failure prediction and
its application to transistor aging”, in Proc.
VLSI Test Symp., 2007, pp.277-286.
[Agrawal 82] Agrawal ,V.D., S.C. Seth, and P. Agrawal,
"Fault Coverage Requirements in Production Testing of LSI Circuits,"
in IEEE JSSC, vol.SC-17, pp.57-61,
Feb 1982.
[Austin 99] Austin T.M., “DIVA: a reliable substrate for deep
submicron microarchitecture design”, in Proc.
Intl. Symp. on Microarchitecture, 1999, pp.196-207.
[Austin 04] Austin T.M., “Making typical silicon matter with
Razor”, in Computer, vol. 37, no. 3,
Mar 2004, pp.57-65.
[Banerjee 84] Banjeree, P. J.A. Abraham, “Fault-secure
algorithms for multiple processor systems”, in Proc. Itnl. Symp. on Computer Architecture,1984, pp.279-287.
[Bender 08] Bender, C., et
al. “Soft-error resilience of the IBM POWER6 processor input/output
subsystem”, in IBM Journal of Research
and Development, vol. 52, no. 3, 2008, pp.285-292.
[Bernick 05] Bernick, D., et
al. “NonStop® advanced architecture”, in Proc. Intl. Conf. on Dependable Systems and Networks, 2005,
pp12-21.
[Bronevetsky 04] Bronevetsky, G., et al., “Application level checkpointing for shared memory
programs”, in Proc. Intl. Conf. on
Architectural Support for Programming Languages and Operating Systems,
2004, pp. 235-247.
[Bronevetsky 06] Bronevetsky, G., et al., “Recent Advances in Checkpoint Recovery Systems”, in Next Generation Systems Program Workshop at IPDPS, 2006.
[Constantinides 07] Constantinides, K., et al., “Software-based online detection of hardware defects
mechansisms, architectural support, and evaluation”, Proc. Intl. Symp. on Microarchitecture, 2007, pp. 97-108.
[Ernst 07] Ernst M.D., “The Daikon system for dynamic
detection of likely invariants”, Science
of Computer Programming. Vol. 69, no. 1-3, pp.35-45, Dec 2007.
[Franco 94] Franco, P., and E.J. McCluskey, “Online delay
testing of digital circuits”, in Proc. IEEE
VSLI Test Symp., 1994
[Gross
02] Gross, K.C. and W.Lu, “Early detection of signal and process anomalies in
enterprise computing systems”, Proc.
Intl. Conf. on Machine Learning and Applications, 2002, pp.204-210.
[Gross 06] Gross, K.C., K. Whisnant and A. Urmanov,
“Electronic Prognostics through continuous system telemetry”, Proc. Meeting of the Society for Machinery
Failure Prevention Technology, 2006.
[Hamzaoglu 99] Hamzaoglu, I., J.H. Patel, “Reducing Test Application
Time for Full Scan Embedded Cores”, in Proc.
Intl. Symp. on Fault-Tolerant Computing, 1999, pp. 260-267.
[Hangal 05] Hangal S., et
al. “IODINE: a tool to automatically infer dynamic invariants for hardware
designs”, in Proc. IEEE/ACM Design
Automation Conference, 2005.
[Heragu 96] Heragu, K., J.H. Patel, V.D. Agrawal, “Segment
delay faults: a new fault model”, in Proc.
VLSI Test Symp., 1996, pp.32-39.
[Huang 84] Huang, K.H., and J.A. Abraham, “Algorithm-based
fault tolerance for matrix operations”, in IEEE
Trans. on Computers, vol. C-33, no. 6, June 1984, pp.518-528.
[Inoue 08] Inoue, H., Y. Li and S. Mitra, “VAST:
Virtualization-assisted Concurrent Autonomous Self-Test”, in Proc. Intl. Test Conference, 2008, pp. 1-10.
[Li 07] Li. X, and D. Yeung, “Application-level correctness
and its impact on fault tolerance”, in Proc.
Intl. Symp. on High Performance Computer Architecture, 2007, pp. 181-192.
[Li 08] Li Y., S. Makar, and S. Mitra, “CASP: Concurrent
Autonomous Chip Self-Test using Stored Test Patterns”, Proc. Design, Automation and Test in Europe, 2008, pp. 885-890.
[Lovellette 02] Lovellette, M.N. et al., “Strategies for fault-tolerant, space-based computing:
lessons learned from the ARGOS testbed”, in Proc.
Aerospace Conf., 2002 pp.2109-2119.
[Rudnick 91] Rudnick, E.M.; T.M. Niermann, J.H., Patel, "Methods for
reducing events in sequential circuit fault simulation", in Proc. Intl. Conf. on Computer-Aided Design,
1991, pp. 546-549.
[Mahmood 88] Mahmood A., E.J. McCluskey, “Concurrent error detection using watchdog
processors-a survey”, IEEE Trans. On
Computers, vol. 37, no. 2, Feb 1988, pp.160-174.
[Meaney 05] Meaney, P.J. et
al., “IBM z990 soft error detection and recovery”, in IEEE Trans. on Device and Materials Reliability, vol. 5, no.3, Sept
2005, pp.419-427.
[Mitra 00] Mitra, S., E.J. McCluskey, “Which concurrent error detection scheme to
choose?”, in Proc. Intl. Test. Conf., 2000,
pp.985-994.
[Mitra 04b] Mitra, S., and K.S. Kim, “X-compact: An Efficient
Response Compaction Technique”, in IEEE
TCAD, vol.23, no.3, pp.421-432, Mar 2004.
[Mitra 04a] Mitra, S., S.S. Lumetta, and M. Mitzenmacher,
“X-tolerant Signature Analysis”, in Proc.
Intl. Test Conf. 2004, pp.432-441.
[Mukherjee 02] Mukherjee, S.S., M. Kontz, and S.K. Reinhardt,
“Detailed design and evaluation of redundant multi-threading alternatives”, in Proc. Intl. Symp. on Computer Architecture, 2002,
pp. 99-110.
[Mukherjee 03] Mukherjee, S.S., et al., “A systematic methodology to compute the architectural
vulnerability factors for a high-performance microprocessor”, in Proc. Intl. Symp. on Microarchitecture, 2003,
pp.29-41.
[Nicolaidis
[Niermann 92] Niermann, T.M., W-T. Cheng, J.H. Patel, "PROOFS: a fast,
memory-efficient sequential circuit fault simulator" in IEEE Trans. On Computer-Aided Design of
Integrated Circuits and Systems, vol.11,
no.2, pp.198-207, Feb 1992.
[Oh 02a] Oh, N., P.P. Shirvani, E.J. McCluskey, “Control-flow
checking by software signatures”, in IEEE
Trans. on Reliability, vol. 51, no. 1, Mar 2002, pp.111-122.
[Oh 02b] Oh, N., P.P. Shirvani, E.J. McCluskey, “Error
detection by duplicated instructions in super-scalar processors” in IEEE Trans. on Reliability, vol. 51, no.
1, Mar 2002, pp. 63-75.
[Patel 82] Patel J.H., et
al. “Concurrent error detection in ALUs by Recomputing with Shifted
Operands”, in IEEE Trans. Computers, vol.
C-31., no. 7, Jul 1982, pp.589-595,
[Pattabiraman 07] Pattabiraman, K., Z. Kalbarczyk, R.K. Iyer,
"Automated Derivation of Application-aware Error Detectors Using Static
Analysis", in Proc, Intl. On-Line
Testing Symposium, 2007, pp.211-21.
[Pradhan 96] Pradhan, D.K., ed. “Fault tolerant computer
system design”, Prentice-Hall, 1996
[Rajski 04] Rajski, J., et al. “Embedded Deterministic Test”,
in IEEE TCAD, vol. 23, no. 5, pp.
776-792, May 2004.
[Rivers 08] Rivers, J.A., et
al. “Phaser: Phased methodology for modeling the system-level effects of
soft errors”, in IBM Journal of Research
and Development, vol. 52, no. 3, 2008, pp.293-306.
[Reinhardt 00] Reinhardt, S.K. “Transient fault detection via
simultaneous multithreading”, in Proc.
Intl. Symp. Computer Architecture, 2000, pp.25-36.
[Sanda 08] Sanda, P.N.
et al., “Soft-error resilience of the IBM POWER6 processor”, in IBM Journal of Research and Development, vol.
52, no. 3, 2008, pp.275-284.
[Seshia 07] Seshia S.A., W.Li, S.Mitra, “Verification-guided
soft error resilience”, in Proc. Conf. on
Design, Automation and Test in Europe, 2007, pp.1442-1447.
[Shyam 06] Shyam, S., et
al., “Ultra low cost defect protection for microprocessor pipelines”, in Proc. Intl. Conf. on Architectural Support
for Programming Lanugages and Operating Systems, 2006, pp. 73-82.
[Siewiorek 98] Siewiorek, D.P. and R.S. Swarz, “Reliable
Computer Systems – Design and Evaluation”, Digital Press (distributed by
Butterworth), 1998.
[Swift 03] Swift, M. M., et
al. “Improving the Reliability of Commodity Operating Systems”, in Proc. ACM Symp. on Operating Systems
Principles, 2003.
[Trivedi 01] Trivedi, K.S., “Probability and statistics with
reliability, queuing, and computer science applications”, 2nd ed,
Wiley-Interscience, 2001.
[Wang 04] Wang, N.J., et
al. “Characterizing the effects of transient faults on a high performance
processor pipeline”, in Proc. Intl. Conf.
on Dependable Systems and Networks, 2004, pp 61-70.
[Wang 05] Wang, L., et
al., “Modeling coordinated checkpointing for large-scale supercomputers”,
in Proc. Intl. Conf. on Dependable
Systems and Networks, 2005, pp. 812-821.
[Wang 07] Wang, L., et
al., “An OS-level framework for providing application-aware reliability, in
Proc Pacific Rim Intl. Symp. On
Dependable Computing, 2005
[Zhang 06] Zhang, M., et.
al., “Sequential element design with built-in soft error resilience”, in IEEE Trans. On VLSI Systems, vol. 14,
no. 12, Dec 2006.
Project
Group Formation:
· One person from each group should email
the TA his/her group members with the subject “ee386 group members”
Seminar
Topic Selection:
· Among the last eight topics listed on
the course calendar, choose the ones you like and list them in the order of
preference.
· Email the list to the TA with the
subject “ee386 seminar
preference”.
1st meeting:
· Discuss
the seminar and project topics.
Project Proposal:
· Email
the proposal to the TA with the subject “ee386 project proposal”.
· Proposal
should contain: project description, motivation, related work, methodology,
expected outcome and schedule.
· Proposal
should be no longer than 2 pages, double column, font size 10.
Seminar:
· First
draft of the slides must be submitted at least two weeks before your seminar.
· Decide
on the most representative set of papers (1~3) that you want the students to
review before your seminar. You do not have to use the papers listed on the
course calendar for the selection. The decision must be made at least two weeks
before your seminar. Students will be required to review ONE PAPER PER WEEK,
thus do not assume that everybody has read your selected paper.
· The
final draft of the slides must be submitted at least two days before your
seminar.
· Your
seminar will last for the entire class period (1 hour 15 minutes).
2nd meeting:
· Discuss
the project proposal.
Project Progress Slides
· Email
the slides to the TA with the subject “ee386 project progress”.
· It
should contain: project description, methodology, preliminary result, and
remaining work.
3rd meeting:
· Present
the project progress slides to the TA.
Final Report:
· Email
the report to the TA with the subject “ee386 project report”
· The
report should contain: abstract, intro & motivation, proposed design,
methodology, experiments, related work, conclusion, reference.
· The
report should be 3~6 pages, double column, font size 10.
· Here
are some sample project reports from the previous year: paper1,
paper2
Last updated: May 27, 2011