EE386: Robust System Design

Stanford University, Spring Quarter 2010-2011

Professor Subhasish Mitra



Announcements


Course info

Lectures: Tuesdays & Thursdays, 2:15-3:30 p.m.    

Location: Building 200, Room 030

Instructor: Prof. Subhasish Mitra

 

Course description:

Electronic systems are an indispensable part of all our lives. Malfunctions in these systems have consequences ranging from annoying computer crashes, loss of data and services, to financial and productivity losses, or even loss of human life. For example, in 2009, a glitch in a single circuit board in the air-traffic control system resulted in hundreds of flights being canceled or delayed. Such impacts continue to increase as systems become more complex, interconnected, and pervasive. Malfunctions may be caused by hardware or software failures, design errors, malicious attacks, or incorrect human interactions. Robust system design is required to ensure that future systems perform correctly despite rising levels of complexity and increasing disturbances. Hardware failures are especially a growing concern because:

 

Electronic systems are an indispensable part of all our lives. Robust system design is a new exciting area of research. EE386 is a research-oriented advanced course that will cover unique challenges and opportunities in building robust systems, ranging from immediate concerns blocking progress today to major obstacles in the future. Topics that will be covered include causes of system failures; state-of-the-art modeling and inadequacies; effective post-silicon validation techniques; ways to build robust systems that either avoid or are resilient to such failures through built-in error detection, failure prediction, self-recovery, and self-repair.

 

Prerequisites:

The students are expected to have necessary background in digital design (EE108A, 108B). Background in EE271 and EE282 is good but not absolutely necessary).

 

Textbook and course materials:

There is no textbook.  Lecture notes and paper references used in the course will be available from the class web page.

 

Course requirements:

·        Seminar: After the 1st month of class, which consists mainly of traditional lecture, students (in groups of preferably 2) will be asked to select a topic from an upcoming lecture. The students are then expected to research the topic and to produce slides for leading class discussion on the selected topic.

 

·        Final project:  With consultation of the course staff, students (in groups of preferably 2) will design and implement a research project relevant to the robust systems design topics taught in class.  The projects may be related to on-going robust systems research at Stanford or may be independently conceived but should be of appropriate scale and interest for the research nature of this course.

 

·        Paper Review: Students, individually, are expected to submit a short review of assigned papers, identifying the following elements: strengths (1~3 sentences), weaknesses (1~3 sentences), possible improvements and/or whether there is any scope for future research (1~3 sentences). Reviews are due before each class and should be emailed to the TA with the subject “ee386 paper review”. Review should be included in the main body of the email, and not as an attachment. If there are multiple papers assigned for a day, choose only one of them to review.

 

Grading (tentative):

·        Class participation: 10%

·       Paper Review: 10%

·        Seminar: 30%

·        Project: 50%

 

top of EE386 page


Contact

Instructor: Subhasish Mitra (subh at stanford dot edu)

Gates 333, (650) 724-1915. 

Office hours: Tuesday & Thursday, 3:30pm-4:30pm, Gates 333

 

Administrative assistant: Peche Turner (peche at cs dot stanford dot edu)

  Gates 275, (650) 723-5396, Fax (650) 725-7411

 

Teaching assistant:


 

top of EE386 page


Course Calendar

The list of references can be found here.

Date

Location

Subject

Reviews

Optional Reading

Due

Tue 3/29

 

Introduction

 

 

Thu 3/31

 

Hardware Testing: Fault Models, Test Metrics

Review of electonic system failures

Fault_models*, 
[Agrawal 82]

Due end of Tuesday (4/5)

Mail the review to ee386@ee.stanford.edu

Tue 4/5

 

Design for Testability

 

DFT*

Thu 4/7

Test Compression & Built-in Self-test

[Hamzaoglu 99],
[Mitra 04a],
[Mitra 04b],
[Rajski 04]

Advanced DFT*


Due end of Tuesday (4/12)

Mail the review to ee386@ee.stanford.edu

Tue 4/12

No Class

Thu 4/14

 

Delay Testing (1)

Delay Testing* No redings due, but pls prepare for your lecture

Tue 4/19

Delay Testing (2)

Thu 4/21

 Gates 459 2:15-4:45pm

Robust systems: models, metrics, redundancy

Modeling&Redundancy

[Siewiorek 98],

[Trivedi 01],

[Pradhan 96]

 

Tue 4/26

Error correcting codes

 

Coding

Thu 4/28

 

System-level effects of errors & evaluation - Slide

[Wang 04],

[Mukherjee 03],

[Seshia 07]

[Sanda 08],

[Rivers 08],

[Bender 08]

Due end of Tuesday (5/3)

Mail the review to ee386@ee.stanford.edu

Tue 5/3

No class

 

Due end of Tuesday (5/3)
Project plan (short -- 1 paragraph -- description on what you will be
doing for your project)

Thu 5/5

 Gates 392 2:15-4:45pm

Extended lecture:

Circuit- and logic-level techniques - Slide

Architectural techniques (by Claire) - Slide

[Mitra 00],

[Zhang 06]

[Sanda 08],

[Patel 82],

[Franco 94]

Due end of Tuesday (5/10)

Mail the review to ee386@ee.stanford.edu

Tue 5/10

Software techniques (by Hai) - Slide

[Mahmood 88],

[Oh 02b],

[Lovellette 02]

[Pattabiraman 07]

Due end of Tuesday (5/17)

Mail the review to ee386@ee.stanford.edu

Thu 5/12

Gates 392 2:15-4:45pm

Extended lecture:

Application-aware techniques (by Richard) - Slide

Checkpoint / Recovery (by Dan) - Slide

Tue 5/17

Gates 459 2:15-4:45pm

Extended lecture:

On-line Self-test & Diagnostics / Failure Prediction (by Chen) - Slide

Post-silicon validation (By Akshay and David) Slide

[Meaney 05]

[Bronevetsky 04]

[Agarwal 07]

[Gross 06]

[Constantinides 07] 

[Inoue 08]

[Li 08]

Due end of Tuesday (5/24)

Mail the review to ee386@ee.stanford.edu

Thu 5/19
  No Class

Tue 5/24

Gates 104 2:30-3:30 pm

Variations -- guest lecture by Jim Tschanz & Recovery

Thu 5/26

No Class

Tue 5/31

Gates 459 2:15-3:30pm

Project presentations

v   = Supplementary classes planned for missing classes

v  * = Incomplete book chapters that require registration to view.

top of EE386 page

 


References

[Agarwal 07] Agarwal, M., et al. “Circuit failure prediction and its application to transistor aging”, in Proc. VLSI Test Symp., 2007, pp.277-286.

[Agrawal 82] Agrawal ,V.D., S.C. Seth, and P. Agrawal, "Fault Coverage Requirements in Production Testing of LSI Circuits," in IEEE JSSC, vol.SC-17, pp.57-61, Feb 1982.

[Austin 99] Austin T.M., “DIVA: a reliable substrate for deep submicron microarchitecture design”, in Proc. Intl. Symp. on Microarchitecture, 1999, pp.196-207.

[Austin 04] Austin T.M., “Making typical silicon matter with Razor”, in Computer, vol. 37, no. 3, Mar 2004, pp.57-65.

[Banerjee 84] Banjeree, P. J.A. Abraham, “Fault-secure algorithms for multiple processor systems”, in Proc. Itnl. Symp. on Computer Architecture,1984, pp.279-287.

[Bender 08] Bender, C., et al. “Soft-error resilience of the IBM POWER6 processor input/output subsystem”, in IBM Journal of Research and Development, vol. 52, no. 3, 2008, pp.285-292.

[Bernick 05] Bernick, D., et al. “NonStop® advanced architecture”, in Proc. Intl. Conf. on Dependable Systems and Networks, 2005, pp12-21.

[Bronevetsky 04] Bronevetsky, G., et al., “Application level checkpointing for shared memory programs”, in Proc. Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, 2004, pp. 235-247.

[Bronevetsky 06] Bronevetsky, G., et al., “Recent Advances in Checkpoint Recovery Systems”, in Next Generation Systems Program Workshop at IPDPS, 2006.

[Constantinides 07] Constantinides, K., et al., “Software-based online detection of hardware defects mechansisms, architectural support, and evaluation”, Proc. Intl. Symp. on Microarchitecture, 2007, pp. 97-108.

[Ernst 07] Ernst M.D., “The Daikon system for dynamic detection of likely invariants”, Science of Computer Programming. Vol. 69, no. 1-3, pp.35-45, Dec 2007.

[Franco 94] Franco, P., and E.J. McCluskey, “Online delay testing of digital circuits”, in Proc. IEEE VSLI Test Symp., 1994

[Gross 02] Gross, K.C. and W.Lu, “Early detection of signal and process anomalies in enterprise computing systems”, Proc. Intl. Conf. on Machine Learning and Applications, 2002, pp.204-210.

[Gross 06] Gross, K.C., K. Whisnant and A. Urmanov, “Electronic Prognostics through continuous system telemetry”, Proc. Meeting of the Society for Machinery Failure Prevention Technology, 2006.

[Hamzaoglu 99] Hamzaoglu, I., J.H. Patel, “Reducing Test Application Time for Full Scan Embedded Cores”, in Proc. Intl. Symp. on Fault-Tolerant Computing, 1999, pp. 260-267.

[Hangal 05] Hangal S., et al. “IODINE: a tool to automatically infer dynamic invariants for hardware designs”, in Proc. IEEE/ACM Design Automation Conference, 2005.

[Heragu 96] Heragu, K., J.H. Patel, V.D. Agrawal, “Segment delay faults: a new fault model”, in Proc. VLSI Test Symp., 1996, pp.32-39.

[Huang 84] Huang, K.H., and J.A. Abraham, “Algorithm-based fault tolerance for matrix operations”, in IEEE Trans. on Computers, vol. C-33, no. 6, June 1984, pp.518-528.

[Inoue 08] Inoue, H., Y. Li and S. Mitra, “VAST: Virtualization-assisted Concurrent Autonomous Self-Test”, in Proc. Intl. Test Conference, 2008, pp. 1-10.

[Li 07] Li. X, and D. Yeung, “Application-level correctness and its impact on fault tolerance”, in Proc. Intl. Symp. on High Performance Computer Architecture, 2007, pp. 181-192.

[Li 08] Li Y., S. Makar, and S. Mitra, “CASP: Concurrent Autonomous Chip Self-Test using Stored Test Patterns”, Proc. Design, Automation and Test in Europe, 2008, pp. 885-890.

[Lovellette 02] Lovellette, M.N. et al., “Strategies for fault-tolerant, space-based computing: lessons learned from the ARGOS testbed”, in Proc. Aerospace Conf., 2002 pp.2109-2119.

[Rudnick 91] Rudnick, E.M.; T.M. Niermann, J.H., Patel, "Methods for reducing events in sequential circuit fault simulation", in Proc. Intl. Conf. on Computer-Aided Design, 1991, pp. 546-549.

[Mahmood 88] Mahmood A., E.J. McCluskey,  “Concurrent error detection using watchdog processors-a survey”, IEEE Trans. On Computers, vol. 37, no. 2, Feb 1988, pp.160-174.

[Meaney 05] Meaney, P.J. et al., “IBM z990 soft error detection and recovery”, in IEEE Trans. on Device and Materials Reliability, vol. 5, no.3, Sept 2005, pp.419-427.

[Mitra 00] Mitra, S., E.J. McCluskey,  “Which concurrent error detection scheme to choose?”, in Proc. Intl. Test. Conf., 2000, pp.985-994.

[Mitra 04b] Mitra, S., and K.S. Kim, “X-compact: An Efficient Response Compaction Technique”, in IEEE TCAD, vol.23, no.3, pp.421-432, Mar 2004.

[Mitra 04a] Mitra, S., S.S. Lumetta, and M. Mitzenmacher, “X-tolerant Signature Analysis”, in Proc. Intl. Test Conf. 2004, pp.432-441.

[Mukherjee 02] Mukherjee, S.S., M. Kontz, and S.K. Reinhardt, “Detailed design and evaluation of redundant multi-threading alternatives”, in Proc. Intl. Symp. on Computer Architecture, 2002, pp. 99-110.

[Mukherjee 03] Mukherjee, S.S., et al., “A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor”, in Proc. Intl. Symp. on Microarchitecture, 2003, pp.29-41.

[Nicolaidis

[Niermann 92] Niermann, T.M., W-T. Cheng, J.H. Patel, "PROOFS: a fast, memory-efficient sequential circuit fault simulator" in IEEE Trans. On Computer-Aided Design of Integrated Circuits and Systems,  vol.11, no.2, pp.198-207, Feb 1992.

[Oh 02a] Oh, N., P.P. Shirvani, E.J. McCluskey, “Control-flow checking by software signatures”, in IEEE Trans. on Reliability, vol. 51, no. 1, Mar 2002, pp.111-122.

[Oh 02b] Oh, N., P.P. Shirvani, E.J. McCluskey, “Error detection by duplicated instructions in super-scalar processors” in IEEE Trans. on Reliability, vol. 51, no. 1, Mar 2002, pp. 63-75.

[Patel 82] Patel J.H., et al. “Concurrent error detection in ALUs by Recomputing with Shifted Operands”, in IEEE Trans. Computers, vol. C-31., no. 7, Jul 1982, pp.589-595,

[Pattabiraman 07] Pattabiraman, K., Z. Kalbarczyk, R.K. Iyer, "Automated Derivation of Application-aware Error Detectors Using Static Analysis", in Proc, Intl. On-Line Testing Symposium, 2007, pp.211-21.

[Pradhan 96] Pradhan, D.K., ed. “Fault tolerant computer system design”, Prentice-Hall, 1996

[Rajski 04] Rajski, J., et al. “Embedded Deterministic Test”, in IEEE TCAD, vol. 23, no. 5, pp. 776-792, May 2004.

[Rivers 08] Rivers, J.A., et al. “Phaser: Phased methodology for modeling the system-level effects of soft errors”, in IBM Journal of Research and Development, vol. 52, no. 3, 2008, pp.293-306.

[Reinhardt 00] Reinhardt, S.K. “Transient fault detection via simultaneous multithreading”, in Proc. Intl. Symp. Computer Architecture, 2000, pp.25-36.

[Sanda 08] Sanda, P.N. et al., “Soft-error resilience of the IBM POWER6 processor”, in IBM Journal of Research and Development, vol. 52, no. 3, 2008, pp.275-284.

[Seshia 07] Seshia S.A., W.Li, S.Mitra, “Verification-guided soft error resilience”, in Proc. Conf. on Design, Automation and Test in Europe, 2007, pp.1442-1447.

[Shyam 06] Shyam, S., et al., “Ultra low cost defect protection for microprocessor pipelines”, in Proc. Intl. Conf. on Architectural Support for Programming Lanugages and Operating Systems, 2006, pp. 73-82.

[Siewiorek 98] Siewiorek, D.P. and R.S. Swarz, “Reliable Computer Systems – Design and Evaluation”, Digital Press (distributed by Butterworth), 1998.

[Swift 03] Swift, M. M., et al. “Improving the Reliability of Commodity Operating Systems”, in Proc. ACM Symp. on Operating Systems Principles, 2003.

[Trivedi 01] Trivedi, K.S., “Probability and statistics with reliability, queuing, and computer science applications”, 2nd ed, Wiley-Interscience, 2001.

[Wang 04] Wang, N.J., et al. “Characterizing the effects of transient faults on a high performance processor pipeline”, in Proc. Intl. Conf. on Dependable Systems and Networks, 2004, pp 61-70.

[Wang 05] Wang, L., et al., “Modeling coordinated checkpointing for large-scale supercomputers”, in Proc. Intl. Conf. on Dependable Systems and Networks, 2005, pp. 812-821.

[Wang 07] Wang, L., et al., “An OS-level framework for providing application-aware reliability, in Proc Pacific Rim Intl. Symp. On Dependable Computing, 2005

[Zhang 06] Zhang, M., et. al., “Sequential element design with built-in soft error resilience”, in IEEE Trans. On VLSI Systems, vol. 14, no. 12, Dec 2006.

 

top of EE386 page

 

Project/Seminar Guidelines

Project Group Formation:

·        One person from each group should email the TA his/her group members with the subject “ee386 group members”   

 

Seminar Topic Selection:

·        Among the last eight topics listed on the course calendar, choose the ones you like and list them in the order of preference.

·        Email the list to the TA with the subject “ee386 seminar preference”.

 

1st meeting:

·        Discuss the seminar and project topics. 

 

Project Proposal:

·        Email the proposal to the TA with the subject “ee386 project proposal”.

·        Proposal should contain: project description, motivation, related work, methodology, expected outcome and schedule.

·        Proposal should be no longer than 2 pages, double column, font size 10.

 

Seminar:

·        First draft of the slides must be submitted at least two weeks before your seminar.

·        Decide on the most representative set of papers (1~3) that you want the students to review before your seminar. You do not have to use the papers listed on the course calendar for the selection. The decision must be made at least two weeks before your seminar. Students will be required to review ONE PAPER PER WEEK, thus do not assume that everybody has read your selected paper. 

·        The final draft of the slides must be submitted at least two days before your seminar.

·        Your seminar will last for the entire class period (1 hour 15 minutes).

 

2nd meeting:

·        Discuss the project proposal.

 

Project Progress Slides

·        Email the slides to the TA with the subject “ee386 project progress”.

·        It should contain: project description, methodology, preliminary result, and remaining work.

 

3rd meeting:

·        Present the project progress slides to the TA.

 

Final Report:

·        Email the report to the TA with the subject “ee386 project report”

·        The report should contain: abstract, intro & motivation, proposed design, methodology, experiments, related work, conclusion, reference.

·        The report should be 3~6 pages, double column, font size 10.

·        Here are some sample project reports from the previous year: paper1, paper2

 

top of EE386 page

 

Last updated: May 27, 2011