Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems

Saurabh Jha, Shengkun Cui, Subho S. Banerjee, Tianyin Xu, Jeremy Enos, Mike Showerman, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer

Supercomputing 2020

Abstract

Large-scale high-performance computing systems frequently experience a wide range of failure modes, such as reliability failures (e.g., hang or crash), and resource overload-related failures (e.g., congestion collapse), impacting systems and applications. Despite the adverse effects of these failures, current systems do not provide methodologies for proactively detecting, localizing, and diagnosing failures. We present Kaleidoscope, a near real-time failure detection and diagnosis framework, consisting of of hierarchical domain-guided machine learning models that identify the failing components, the corresponding failure mode, and point to the most likely cause indicative of the failure in near real-time (within one minute of failure occurrence). Kaleidoscope has been deployed on Blue Waters supercomputer and evaluated with more than two years of production telemetry data. Our evaluation shows that Kaleidoscope successfully localized 99.3% and pinpointed the root causes of 95.8% of 843 real-world production issues, with less than 0.01% runtime overhead.

Awards

Best Paper & Best Student Paper Finalist

Citation

@inproceedings{Jha2020_SC,
  author = {Jha, Saurabh and Cui, Shengkun and Banerjee, Subho S. and Xu, Tianyin and Enos, Jeremy and Showerman, Mike and Kalbarczyk, Zbigniew T. and Iyer, Ravishankar K.},
  title = {Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems},
  year = {2020},
  isbn = {9781728199986},
  publisher = {IEEE Press},
  booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
  articleno = {65},
  numpages = {16},
  location = {Atlanta, Georgia},
  series = {SC '20}
}

Related Projects

Intelligence Augmented Compter Systems

Subho Sankar Banerjee