SIParCS 2024 - Tri Nguyen

Autoscaling for HPC Runners

In this project, we aim to enhance the CI/CD infrastructure within NCAR by exploring the scalability and workflow efficiency of organizational runners. ⁤⁤Although repository-specific self-hosted runners were created for earlier initiatives, they are difficult to put up widely for organizations. ⁤⁤The long-term vision is to develop a centralized CI/CD server, similar to Exascale Computing Project's CI/CD system, featuring autoscaling, PBS scheduler integration and robust authentication security. ⁤

In order to prepare future proposals, we will prototype the server and collect data on scalability, usage limitations, and hardware requirements, etc. ⁤⁤⁤In order to accomplish this, we investigated two options for deploying an auto-scaling runner cluster. ⁤⁤The first involves using container-in-container technology enabling portable rootless container deployments to enhance security and avoid the complexities of installing tools such as Kubernetes and Helm, that may require root access. ⁤⁤However, after encountering significant roadblocks with this approach, we then investigated second solution: using webhooks to autoscale the runners with direct calls to the Github API. ⁤⁤This method allows the creation of just in time runners to execute the jobs then terminate immediately upon completion. ⁤

Mentors: Haiying Xu, Brian Vanderwende

Slides and poster

Nguyen_Tri_slides

Nguyen_Tri_Poster

SIParCS 2024 - Tri Nguyen

Autoscaling for HPC Runners

NSF NCAR

UCAR