SIParCS 2019 - Broday Walker
Deploying File System Performance Metrics Through XSEDE Metrics on Demand (XDMoD)
The National Center for Atmospheric Research (NCAR) – Wyoming Supercomputing Center provides users with high-performance computing capabilities through Cheyenne, a supercomputer composed of 4032 nodes which run thousands of jobs per day. To assist in observing and maintaining Cheyenne, system administrators require detailed statistics regarding the usage of the supercomputer.
Open XSEDE Metrics on Demand (XDMoD) was selected to accomplish the task of characterizing resource usage on Cheyenne. In conjunction with XDMoD, the Job Performance module (SUPReMM) is used to summarize job-level performance data that is collected using Ganglia, a distributed monitoring system running on each node. XDMoD/SUPReMM ingests the Ganglia data from Cheyenne and summarizes the job-level performance, which is made available online through NCAR’s XDMoD portal. The data collected and displayed by XDMoD can be used to monitor the overall efficiency of Cheyenne and to diagnose issues with jobs that run on Cheyenne. With a user name and a job ID number, XDMoD can quickly locate relevant records and metrics that can be used for debugging.
In the process of installing and upgrading XDMoD 8.1, we added four new filesystem metrics to monitor data transfer rates on Cheyenne. These new metrics will be made available in a weekly report which is automatically generated and distributed to system administrators. In addition, the standard procedures for installing, configuring, and upgrading NCAR’s XDMoD instance were documented step-by-step in a wiki for future reference.
Mentor: Shiquan Su
Slides