EE386: Robust System Design

Stanford University, Spring Quarter 2008-2009

Professor Subhasish Mitra



Announcements


Course info

Lectures: Tuesdays & Thursdays, 2:15-3:30 p.m.

Location: Gates 260 or 359 (look at course calendar for exact location)

 

Course description:

This is an advanced course that will cover unique challenges and opportunities in building robust systems. Major topics to be covered include: causes of system failures; state-of-the-art modeling and their limiations; techniques for building robust systems that either avoid or are resilient to such failures through built-in error detection, failure prediction, self-recovery, and self-repair. Robust system design is a new exciting area of research. EE386 will be a research-oriented course and will explore new research problems.

 

Prerequisites:

The students are expected to have necessary background in digital design (EE108A, 108B). Background in EE271 and EE282 is good but not absolutely necessary).

 

Textbook and course materials:

There is no textbook.  Lecture notes and paper references used in the course will be available from the class web page.

 

Course requirements:

·        Seminar: After the 1st month of class, which consists mainly of traditional lecture, students (in groups of preferably 2) will be asked to select a topic from an upcoming lecture. The students are then expected to research the topic and to produce slides for leading class discussion on the selected topic.

 

·        Final project:  With consultation of the course staff, students (in groups of preferably 2) will design and implement a research project relevant to the robust systems design topics taught in class.  The projects may be related to on-going robust systems research at Stanford or may be independently conceived but should be of appropriate scale and interest for the research nature of this course.

 

·        Paper Review: Students, individually, are expected to submit a short review of assigned papers, identifying the following elements: strengths (1~3 sentences), weaknesses (1~3 sentences), possible improvements and/or whether there is any scope for future research (1~3 sentences). Reviews are due before each class and should be emailed to the TA with the subject “ee386 paper review”. Review should be included in the main body of the email, and not as an attachment. If there are multiple papers assigned for a day, choose only one of them to review.

 

Grading (tentative):

·        Class participation: 10%

·        Paper Review: 10%

·        Seminar: 30%

·        Project: 50%

 

top of EE386 page


Contact

Instructor: Subhasish Mitra (subh at stanford dot edu)

Gates 333, (650) 724-1915. 

Office hours: Tuesday & Thursday 3:30pm-4:30pm

 

Administrative assistant: Uma Mulukutla (uma at cs dot stanford dot edu)

  Gates 303, (650) 725-3726, Fax (650) 725-6949

 

Teaching assistant: Sung-Boem Park (sbpark84 at stanford dot edu)

  Gates 239

Office hours:  Monday 2:30pm-3:30pm,Thursday 1pm-2pm

 

top of EE386 page


Course Calendar

The list of references can be found here.

Date

Location

Subject

Reviews

Optional Reading

Due

Tue 3/31

 

Introduction

 

Testing_basics*

 

Thu 4/2

 

Hardware Testing: Fault Models, Test Metrics

 

Fault_models*,
[Agrawal 82]

 

Tue 4/7

 

Design for Testability

 

DFT*

Thu 4/9

Gates 260

Built In Self Test

Fri 4/10

Gates 260

Delay testing (1)

[Hamzaoglu 99],
[Mitra 04a],
[Mitra 04b],
[Rajski 04]

Delay Testing*

Paper review;

Group formation;

Seminar topic selection;

1st meeting signup

Tue 4/14

 

No Class

Thu 4/16

Gates 260

Delay testing (2)

Tue 4/21

No Class

 

Thu 4/23

 

No Class

 

 

 

Tue 4/28

Gates 359

Robust System Design – Models, Metrics & Redundancy (1)

 

[Siewiorek 98],

[Trivedi 01],

[Pradhan 96]

Project proposal;

2nd meeting signup

Thu 4/30

 

No class

 

 

 

Tue 5/5

No class

 

Thu 5/7

Gates 260

Robust System Design – Models, Metrics & Redundancy (2)

Modeling&Redundancy

 

Fri 5/8

Gates 300

Error Correcting Codes

Coding

 

Mon 5/11

Gates 260

System Effects of Errors & Evaluation

[Wang 04],

[Mukherjee 03],

[Seshia 07]

[Sanda 08],

[Rivers 08],

[Bender 08]

 

Tue 5/12

Gates 359

Circuit & Logic-level Techniques

[Mitra 00],

[Zhang 06]

[Sanda 08],

[Patel 82],

[Franco 94]

Paper review

Thu 5/14

Gates 260

Architectural-level Techniques

[Autstin 04],

[Reinhardt 00]

[Austin 99],

[Mukherjee 02],

[Bernick 05]

 

Mon 5/18

Gates 300

Software Techniques

[Mahmood 88],

[Oh 02b],

[Lovellette 02]

[Oh 02a],

Paper review

Tue 5/19

Gates 359

Application Dependent Techniques

[Li 07],

[Pattabiraman 07]

[Banerjee 84],

[Huang 84],

[Swift 03],

[Wang 07],

[Hangal 05],

[Ernst 07]

 

Thu 5/21

Gates 260

Checkpointing & Recovery

[Meaney 05]

[Bronevetsky 04]

[Wang 05],

[Shyam 06],

[Bronevetsky 06]

Project progress slides;

3rd meeting signup

Tue 5/26

Gates 104

Process variations - Guest Lecture: Jim Tschanz

Paper review

Thu 5/28

Gates 260

System and Circuit Failure Prediction

[Agarwal 07],

[Gross 06]

[Gross 02]

Mon 6/1

(5:30pm)

Gates 300

On-line Self-Test & Validation

[Constantinides 07] ,

[Inoue 08],

[Li 08]

Paper review

Tue 6/2

Gates 159

Project Presentation

 

 

Project presentation

Tue 6/9

 

 

 

Project final report

v  = Supplementary classes planned for missing classes

v  * = Incomplete book chapters that require registration to view.

top of EE386 page

 


References

[Agarwal 07] Agarwal, M., et al. “Circuit failure prediction and its application to transistor aging”, in Proc. VLSI Test Symp., 2007, pp.277-286.

[Agrawal 82] Agrawal ,V.D., S.C. Seth, and P. Agrawal, "Fault Coverage Requirements in Production Testing of LSI Circuits," in IEEE JSSC, vol.SC-17, pp.57-61, Feb 1982.

[Austin 99] Austin T.M., “DIVA: a reliable substrate for deep submicron microarchitecture design”, in Proc. Intl. Symp. on Microarchitecture, 1999, pp.196-207.

[Austin 04] Austin T.M., “Making typical silicon matter with Razor”, in Computer, vol. 37, no. 3, Mar 2004, pp.57-65.

[Banerjee 84] Banjeree, P. J.A. Abraham, “Fault-secure algorithms for multiple processor systems”, in Proc. Itnl. Symp. on Computer Architecture,1984, pp.279-287.

[Bender 08] Bender, C., et al. “Soft-error resilience of the IBM POWER6 processor input/output subsystem”, in IBM Journal of Research and Development, vol. 52, no. 3, 2008, pp.285-292.

[Bernick 05] Bernick, D., et al. “NonStop® advanced architecture”, in Proc. Intl. Conf. on Dependable Systems and Networks, 2005, pp12-21.

[Bronevetsky 04] Bronevetsky, G., et al., “Application level checkpointing for shared memory programs”, in Proc. Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, 2004, pp. 235-247.

[Bronevetsky 06] Bronevetsky, G., et al., “Recent Advances in Checkpoint Recovery Systems”, in Next Generation Systems Program Workshop at IPDPS, 2006.

[Constantinides 07] Constantinides, K., et al., “Software-based online detection of hardware defects mechansisms, architectural support, and evaluation”, Proc. Intl. Symp. on Microarchitecture, 2007, pp. 97-108.

[Ernst 07] Ernst M.D., “The Daikon system for dynamic detection of likely invariants”, Science of Computer Programming. Vol. 69, no. 1-3, pp.35-45, Dec 2007.

[Franco 94] Franco, P., and E.J. McCluskey, “Online delay testing of digital circuits”, in Proc. IEEE VSLI Test Symp., 1994

[Gross 02] Gross, K.C. and W.Lu, “Early detection of signal and process anomalies in enterprise computing systems”, Proc. Intl. Conf. on Machine Learning and Applications, 2002, pp.204-210.

[Gross 06] Gross, K.C., K. Whisnant and A. Urmanov, “Electronic Prognostics through continuous system telemetry”, Proc. Meeting of the Society for Machinery Failure Prevention Technology, 2006.

[Hamzaoglu 99] Hamzaoglu, I., J.H. Patel, “Reducing Test Application Time for Full Scan Embedded Cores”, in Proc. Intl. Symp. on Fault-Tolerant Computing, 1999, pp. 260-267.

[Hangal 05] Hangal S., et al. “IODINE: a tool to automatically infer dynamic invariants for hardware designs”, in Proc. IEEE/ACM Design Automation Conference, 2005.

[Heragu 96] Heragu, K., J.H. Patel, V.D. Agrawal, “Segment delay faults: a new fault model”, in Proc. VLSI Test Symp., 1996, pp.32-39.

[Huang 84] Huang, K.H., and J.A. Abraham, “Algorithm-based fault tolerance for matrix operations”, in IEEE Trans. on Computers, vol. C-33, no. 6, June 1984, pp.518-528.

[Inoue 08] Inoue, H., Y. Li and S. Mitra, “VAST: Virtualization-assisted Concurrent Autonomous Self-Test”, in Proc. Intl. Test Conference, 2008, pp. 1-10.

[Li 07] Li. X, and D. Yeung, “Application-level correctness and its impact on fault tolerance”, in Proc. Intl. Symp. on High Performance Computer Architecture, 2007, pp. 181-192.

[Li 08] Li Y., S. Makar, and S. Mitra, “CASP: Concurrent Autonomous Chip Self-Test using Stored Test Patterns”, Proc. Design, Automation and Test in Europe, 2008, pp. 885-890.

[Lovellette 02] Lovellette, M.N. et al., “Strategies for fault-tolerant, space-based computing: lessons learned from the ARGOS testbed”, in Proc. Aerospace Conf., 2002 pp.2109-2119.

[Rudnick 91] Rudnick, E.M.; T.M. Niermann, J.H., Patel, "Methods for reducing events in sequential circuit fault simulation", in Proc. Intl. Conf. on Computer-Aided Design, 1991, pp. 546-549.

[Mahmood 88] Mahmood A., E.J. McCluskey,  “Concurrent error detection using watchdog processors-a survey”, IEEE Trans. On Computers, vol. 37, no. 2, Feb 1988, pp.160-174.

[Meaney 05] Meaney, P.J. et al., “IBM z990 soft error detection and recovery”, in IEEE Trans. on Device and Materials Reliability, vol. 5, no.3, Sept 2005, pp.419-427.

[Mitra 00] Mitra, S., E.J. McCluskey,  “Which concurrent error detection scheme to choose?”, in Proc. Intl. Test. Conf., 2000, pp.985-994.

[Mitra 04b] Mitra, S., and K.S. Kim, “X-compact: An Efficient Response Compaction Technique”, in IEEE TCAD, vol.23, no.3, pp.421-432, Mar 2004.

[Mitra 04a] Mitra, S., S.S. Lumetta, and M. Mitzenmacher, “X-tolerant Signature Analysis”, in Proc. Intl. Test Conf. 2004, pp.432-441.

[Mukherjee 02] Mukherjee, S.S., M. Kontz, and S.K. Reinhardt, “Detailed design and evaluation of redundant multi-threading alternatives”, in Proc. Intl. Symp. on Computer Architecture, 2002, pp. 99-110.

[Mukherjee 03] Mukherjee, S.S., et al., “A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor”, in Proc. Intl. Symp. on Microarchitecture, 2003, pp.29-41.

[Nicolaidis

[Niermann 92] Niermann, T.M., W-T. Cheng, J.H. Patel, "PROOFS: a fast, memory-efficient sequential circuit fault simulator" in IEEE Trans. On Computer-Aided Design of Integrated Circuits and Systems,  vol.11, no.2, pp.198-207, Feb 1992.

[Oh 02a] Oh, N., P.P. Shirvani, E.J. McCluskey, “Control-flow checking by software signatures”, in IEEE Trans. on Reliability, vol. 51, no. 1, Mar 2002, pp.111-122.

[Oh 02b] Oh, N., P.P. Shirvani, E.J. McCluskey, “Error detection by duplicated instructions in super-scalar processors” in IEEE Trans. on Reliability, vol. 51, no. 1, Mar 2002, pp. 63-75.

[Patel 82] Patel J.H., et al. “Concurrent error detection in ALUs by Recomputing with Shifted Operands”, in IEEE Trans. Computers, vol. C-31., no. 7, Jul 1982, pp.589-595,

[Pattabiraman 07] Pattabiraman, K., Z. Kalbarczyk, R.K. Iyer, "Automated Derivation of Application-aware Error Detectors Using Static Analysis", in Proc, Intl. On-Line Testing Symposium, 2007, pp.211-21.

[Pradhan 96] Pradhan, D.K., ed. “Fault tolerant computer system design”, Prentice-Hall, 1996

[Rajski 04] Rajski, J., et al. “Embedded Deterministic Test”, in IEEE TCAD, vol. 23, no. 5, pp. 776-792, May 2004.

[Rivers 08] Rivers, J.A., et al. “Phaser: Phased methodology for modeling the system-level effects of soft errors”, in IBM Journal of Research and Development, vol. 52, no. 3, 2008, pp.293-306.

[Reinhardt 00] Reinhardt, S.K. “Transient fault detection via simultaneous multithreading”, in Proc. Intl. Symp. Computer Architecture, 2000, pp.25-36.

[Sanda 08] Sanda, P.N. et al., “Soft-error resilience of the IBM POWER6 processor”, in IBM Journal of Research and Development, vol. 52, no. 3, 2008, pp.275-284.

[Seshia 07] Seshia S.A., W.Li, S.Mitra, “Verification-guided soft error resilience”, in Proc. Conf. on Design, Automation and Test in Europe, 2007, pp.1442-1447.

[Shyam 06] Shyam, S., et al., “Ultra low cost defect protection for microprocessor pipelines”, in Proc. Intl. Conf. on Architectural Support for Programming Lanugages and Operating Systems, 2006, pp. 73-82.

[Siewiorek 98] Siewiorek, D.P. and R.S. Swarz, “Reliable Computer Systems – Design and Evaluation”, Digital Press (distributed by Butterworth), 1998.

[Swift 03] Swift, M. M., et al. “Improving the Reliability of Commodity Operating Systems”, in Proc. ACM Symp. on Operating Systems Principles, 2003.

[Trivedi 01] Trivedi, K.S., “Probability and statistics with reliability, queuing, and computer science applications”, 2nd ed, Wiley-Interscience, 2001.

[Wang 04] Wang, N.J., et al. “Characterizing the effects of transient faults on a high performance processor pipeline”, in Proc. Intl. Conf. on Dependable Systems and Networks, 2004, pp 61-70.

[Wang 05] Wang, L., et al., “Modeling coordinated checkpointing for large-scale supercomputers”, in Proc. Intl. Conf. on Dependable Systems and Networks, 2005, pp. 812-821.

[Wang 07] Wang, L., et al., “An OS-level framework for providing application-aware reliability, in Proc Pacific Rim Intl. Symp. On Dependable Computing, 2005

[Zhang 06] Zhang, M., et. al., “Sequential element design with built-in soft error resilience”, in IEEE Trans. On VLSI Systems, vol. 14, no. 12, Dec 2006.

 

top of EE386 page

 

Project/Seminar Guidelines

Project Group Formation:

·        One person from each group should email the TA his/her group members with the subject “ee386 group members”   

 

Seminar Topic Selection:

·        Among the last eight topics listed on the course calendar, choose the ones you like and list them in the order of preference.

·        Email the list to the TA with the subject “ee386 seminar preference”.

 

1st meeting: 4/16, 4/17

·        Discuss the seminar and project topics. 

 

Project Proposal:

·        Email the proposal to the TA with the subject “ee386 project proposal”.

·        Proposal should contain: project description, motivation, related work, methodology, expected outcome and schedule.

·        Proposal should be no longer than 2 pages, double column, font size 10.

 

Seminar:

·        First draft of the slides must be submitted at least two weeks before your seminar.

·        Decide on the most representative set of papers (1~3) that you want the students to review before your seminar. You do not have to use the papers listed on the course calendar for the selection. The decision must be made at least two weeks before your seminar. Students will be required to review ONE PAPER PER WEEK, thus do not assume that everybody has read your selected paper. 

·        The final draft of the slides must be submitted at least two days before your seminar.

·        Your seminar will last for the entire class period (1 hour 15 minutes).

 

2nd meeting: 5/1, 5/2

·        Discuss the project proposal.

 

Project Progress Slides

·        Email the slides to the TA with the subject “ee386 project progress”.

·        It should contain: project description, methodology, preliminary result, and remaining work.

 

3rd meeting: 5/21, 5/22

·        Present the project progress slides to the TA.

 

Final Report:

·        Email the report to the TA with the subject “ee386 project report”

·        The report should contain: abstract, intro & motivation, proposed design, methodology, experiments, related work, conclusion, reference.

·        The report should be 3~6 pages, double column, font size 10.

·        Here are some sample project reports from the previous year: paper1, paper2

 

top of EE386 page

 

Last updated: May 31, 2009