SIParCS 2017- Hoang Nguyen
Improving HPC Scheduler Through Machine Learning and Statistics
The Portable Batch System (PBS) currently running on the Cheyenne and Laramie systems at NCAR is a powerful workload manager for scheduling, managing, and monitoring HPC jobs. However, its scheduler is prone to delaying job execution due to overestimation of the running time for submitted jobs. This not only wastes hardware resources, but it also extends the time some users must wait for their jobs to execute. This project focused on resolving the problem of unnecessary user wait times caused by mispredictions of job running times. The data for this research was based on Cheyenne accounting logs from January – July 2017. To resolve this problem, our project employed Deep Neural Network (DNN) and Random Forest (RF) algorithms to predict and classify jobs’ actual running time more accurately. The DNN model attempts to predict the actual running time, and the RF model focuses on classifying jobs into different groups of time-estimation errors. We developed the models and compared them with incoming data from the Cheyenne system through PBS Simulator, which simulated the scheduler behaviors with input accounting logs. The optimized scheduler demonstrated a cumulative savings of up to 24.7 node-hours for the Cheyenne system (22 seconds per day per node). By combining our RF and DNN model on accounting logs, the PBS-optimized scheduler will be able to execute jobs much more efficiently.
Mentors: Ben Matthews, Tom Kleespies