SIParCS 2024 - Thomas Sorkin
Analysis of the Slingshot High-Speed Network
NCAR’s flagship supercomputer, Derecho, employs the Hewlett Packard Enterprise (HPE) Slingshot interconnect to link its 2570 nodes and 200 switches. Slingshot, an Ethernet-based high-speed network using a Dragonfly topology, offers increased bisection bandwidth and lower network diameter compared to other interconnects. However, the complexity of Slingshot’s design makes intuiting about the network challenging. This project consists of visualizations in Grafana and in Python (using matplotlib) of the network fabric to analyze Slingshot’s non-deterministic congestion control properties and assess the impact of network variability on job performance. Furthermore, these visualizations provide a clear representation of the Dragonfly topology. By visualizing the underlying Dragonfly topology, which ensures a maximum of three hops between any two endpoints, we address the difficulty in efficiently representing this highly connected structure. Understanding and optimizing this novel interconnect is crucial because many of the Top500 systems, such as Frontier, use Slingshot. This visualization utilizes network counters gathered by a counter collection pipeline configured on Derecho’s network fabric manager using systemd, and written in the networkx Python package, Bash, and Telegraf. These counters are examined to intuit switch performance metrics such as bandwidth usage, congestion frequency, and idle network time. By combining the visualization with performance metrics, this project quantifies Slingshot’s performance attributes and behavior. This work helps to inform Derecho’s network tuning and job placement policies, and contributes towards maximizing the computational efficiency of supercomputers.
Mentors: William Shanks, Storm Knight
Slides and poster