Lecture 7: May 12, 2009
Prognostics Monitoring in Data Centers
Dr. Alexey Urmanov and Dr. Anton Bougaev, Sun Microsystems
Bios
Aleksey Urmanov is a research scientist at
Sun Microsystems. He earned his doctoral degree in
Nuclear Engineering at the University of Tennessee in 2002.
Dr. Urmanov’s research activities are centered around his
interest in pattern recognition, statistical learning
theory and ill-posed problems in engineering. His most
recent activities at Sun focus on developing health monitoring
and prognostics methods for EP-enabled computer servers.
He is a founder and an Editor of the Journal of Pattern
Recognition Research (www.JPRR.org).
Anton Bougaev holds a M.S. and a Ph.D. degrees in
Nuclear Engineering from Purdue University.
Before joining Sun Microsystems Inc. in 2007,
he was a lecturer in Nuclear Engineering Department
and a member of Applied Intelligent Systems Laboratory
(AISL), of Purdue University, West Lafayette, USA.
Dr. Bougaev is a founder and the Editor-in-Chief of
the Journal of Pattern Recognition Research.
His current focus is in reliability physics
with emphasis on complex
system analysis and the physics of failures which
are based on the data driven pattern recognition
techniques.
Abstract
Prognostics solutions for mission critical systems
require a comprehensive methodology for proactively
detecting and isolating failures, recommending and
guiding condition-based maintenance actions, and
estimating in real time the remaining useful life
of critical components and associated subsystems.
A major challenge has been to extend the benefits
of prognostics to include computer servers and other
electronic components. The key enabler for prognostics
capabilities is monitoring time series signals
relating to the health of executing components
and subsystems. Time series signals are processed
in real time using pattern recognition for proactive
anomaly detection and for remaining useful life
estimation. Examples will be presented of the use
of pattern recognition techniques for early detection
of a number of mechanisms that are known to cause
failures in electronic systems,
including: environmental issues; software aging;
degraded or failed sensors; degradation of hardware
components; degradation of mechanical, electronic,
and optical interconnects. Prognostics pattern
classification is helping to substantially increase
component reliability margins and system availability
goals while reducing costly sources of "no trouble found"
events that have become a significant warranty-cost issue.
Lecture Notes
Lecture 7 Charts in PDF