SIParCS 2024 - Analiese Gonzalez
Improving Data Center Visibility with Machine Learning
NSF NCAR’s supercomputers, Casper and Derecho, are powerful machines performing extensive calculations for researchers, often involving large datasets. These systems house high-performance nodes essential for complex computational tasks. Despite their reliability, node failures occasionally occur. This project leverages AI and machine learning techniques to attempt to predict when a node may fail, aiming to reduce troubleshooting time and prevent future issues. By using advanced methods such as neural networks, we seek to anticipate hardware failures. Neural networks are particularly useful because they can autonomously make intelligent decisions, reducing the need for intervention from NSF NCAR staff. Casper and Derecho provide real-time data center metrics stored in a timescale database, encompassing 20 tables of time-series data. This data includes metrics about Cheyenne, Casper, Gust, and Derecho. The student analyzed various variables, including CPU usage and memory usage, using a K-Means clustering model developed in JupyterHub with Python to explore patterns in the data. After examining the clusters, the student utilized neural networks with sliding windows to predict node unavailability. The models revealed differences and similarities between the two machines, such as how the time a node changes state affects node failure highly for both, but CPU and memory usage impacts Derecho more than Casper.
Mentors: Ben Matthews, Jenett Tillotson
Slides and poster