Y. Li and Z. Lan
Failure prediction, proactive fault management, high performance computing
Unlike rollback-recovery, proactive fault management takes preventive actions before the occurrence of failures. In this survey paper, we classify the current research of proactive fault management into two broad categories: failure analysis and prediction, and proactive techniques. Analytical methods have been widely used to analyse and forecast contiguous values, while data mining or machine learning methods are mostly suited to categorical data. Various proactive fault management systems have been recently developed, each of them exploring different proactive techniques to achieve its specific design goal. Our investigation shows that research should be conducted in the context of high performance computing to enable efficient proactive fault management for the emerging large-scale supercomputers.
Important Links:
Go Back