Leone, Davide
(2025)
Data-Informed Workload Dispatching in HPC Systems.
[Laurea magistrale], Università di Bologna, Corso di Studio in
Artificial intelligence [LM-DM270], Documento full-text non disponibile
Il full-text non è disponibile per scelta dell'autore.
(
Contatta l'autore)
Abstract
High-Performance Computing (HPC) systems are critical to scientific discovery and large-scale data processing, yet efficient workload dispatching remains a persistent challenge, particularly under high system load. Traditional scheduling policies, such as Shortest Job First (SJF) and EASY Backfilling (EASY-BF), rely on user-provided runtime estimates, which are often inaccurate and lead to suboptimal resource utilization.
This thesis explores the integration of machine learning (ML) models into HPC workload dispatching by replacing static runtime estimates with predictive models trained on historical job data. We evaluate both regression-based and classification-based approaches, using enriched feature sets that incorporate user behavior and job descriptors. In addition, we propose a novel scheduling policy, Smallest Energy First (SEF), which prioritizes jobs based on predicted energy consumption, aligning dispatching with energy-efficiency goals.
Using the PM100 dataset from the CINECA Marconi100 supercomputer and simulating workloads via the AccaSim framework, we conduct extensive experiments under both light-load and high-load conditions. Experimental results show that ML-driven dispatching substantially reduces average waiting time and job slowdown, particularly under heavy system load. Moreover, the proposed SEF policy outperforms prediction-enhanced versions of traditional schedulers in minimizing waiting time and achieves comparable or better performance in terms of average slowdown. Our findings highlight the potential of predictive scheduling to enhance performance and sustainability in HPC systems.
Abstract
High-Performance Computing (HPC) systems are critical to scientific discovery and large-scale data processing, yet efficient workload dispatching remains a persistent challenge, particularly under high system load. Traditional scheduling policies, such as Shortest Job First (SJF) and EASY Backfilling (EASY-BF), rely on user-provided runtime estimates, which are often inaccurate and lead to suboptimal resource utilization.
This thesis explores the integration of machine learning (ML) models into HPC workload dispatching by replacing static runtime estimates with predictive models trained on historical job data. We evaluate both regression-based and classification-based approaches, using enriched feature sets that incorporate user behavior and job descriptors. In addition, we propose a novel scheduling policy, Smallest Energy First (SEF), which prioritizes jobs based on predicted energy consumption, aligning dispatching with energy-efficiency goals.
Using the PM100 dataset from the CINECA Marconi100 supercomputer and simulating workloads via the AccaSim framework, we conduct extensive experiments under both light-load and high-load conditions. Experimental results show that ML-driven dispatching substantially reduces average waiting time and job slowdown, particularly under heavy system load. Moreover, the proposed SEF policy outperforms prediction-enhanced versions of traditional schedulers in minimizing waiting time and achieves comparable or better performance in terms of average slowdown. Our findings highlight the potential of predictive scheduling to enhance performance and sustainability in HPC systems.
Tipologia del documento
Tesi di laurea
(Laurea magistrale)
Autore della tesi
Leone, Davide
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
High-Performance Computing, scheduling policies, runtime prediction, machine learning
Data di discussione della Tesi
22 Luglio 2025
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Leone, Davide
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
High-Performance Computing, scheduling policies, runtime prediction, machine learning
Data di discussione della Tesi
22 Luglio 2025
URI
Gestione del documento: