SIParCS 2024 - Tri Nguyen
Autoscaling for HPC Runners
In this project, we aim to enhance the CI/CD infrastructure within NCAR by exploring the scalability and workflow efficiency of organizational runners. Although repository-specific self-hosted runners were created for earlier initiatives, they are difficult to put up widely for organizations. The long-term vision is to develop a centralized CI/CD server, similar to Exascale Computing Project's CI/CD system, featuring autoscaling, PBS scheduler integration and robust authentication security.
In order to prepare future proposals, we will prototype the server and collect data on scalability, usage limitations, and hardware requirements, etc. In order to accomplish this, we investigated two options for deploying an auto-scaling runner cluster. The first involves using container-in-container technology enabling portable rootless container deployments to enhance security and avoid the complexities of installing tools such as Kubernetes and Helm, that may require root access. However, after encountering significant roadblocks with this approach, we then investigated second solution: using webhooks to autoscale the runners with direct calls to the Github API. This method allows the creation of just in time runners to execute the jobs then terminate immediately upon completion.
Mentors: Haiying Xu, Brian Vanderwende
Slides and poster