Tagliani, Michele
(2025)
Resource Management of HPC Infrastructures based on Kubernetes.
[Laurea magistrale], Università di Bologna, Corso di Studio in
Ingegneria informatica [LM-DM270]
Documenti full-text disponibili:
Abstract
With the increasing adoption of AI applications, researchers require different types of computing resources based on their workload. High Performance Computing (HPC) infrastructure is typically favoured for tasks like model training and classic computational applications. In contrast, Cloud environments, particularly those built on Kubernetes, are preferred for data processing tasks, inference services, and databases. Currently, HPC and Cloud clusters often operate on separated infrastructures and utilize distinct cluster management tools. This segregation poses several problems for administrators, like increased operational burden due to managing different systems with different tools and inefficient utilization of valuable resources, such as GPUs, which cannot be dynamically transferred between physically separated clusters, limiting scalability and leading to underutilization.
In this thesis, we propose a new method that converges the management of HPC and Cloud environments at the node level, targeting Slurm and Kubernetes as workload managers of choice. To achieve our goal, we extend the Kubernetes management tool Cluster API (CAPI) with support of Virtual Kubelets, enabling the provisioning and bootstrapping of Slurm clusters. Our solution highlights the benefits of adopting Cluster API as a unifying interface to set up and scale Kubernetes and Slurm clusters while ensuring dedicated access to assigned computing resources, therefore reducing the risk of contention.
Abstract
With the increasing adoption of AI applications, researchers require different types of computing resources based on their workload. High Performance Computing (HPC) infrastructure is typically favoured for tasks like model training and classic computational applications. In contrast, Cloud environments, particularly those built on Kubernetes, are preferred for data processing tasks, inference services, and databases. Currently, HPC and Cloud clusters often operate on separated infrastructures and utilize distinct cluster management tools. This segregation poses several problems for administrators, like increased operational burden due to managing different systems with different tools and inefficient utilization of valuable resources, such as GPUs, which cannot be dynamically transferred between physically separated clusters, limiting scalability and leading to underutilization.
In this thesis, we propose a new method that converges the management of HPC and Cloud environments at the node level, targeting Slurm and Kubernetes as workload managers of choice. To achieve our goal, we extend the Kubernetes management tool Cluster API (CAPI) with support of Virtual Kubelets, enabling the provisioning and bootstrapping of Slurm clusters. Our solution highlights the benefits of adopting Cluster API as a unifying interface to set up and scale Kubernetes and Slurm clusters while ensuring dedicated access to assigned computing resources, therefore reducing the risk of contention.
Tipologia del documento
Tesi di laurea
(Laurea magistrale)
Autore della tesi
Tagliani, Michele
Relatore della tesi
Scuola
Corso di studio
Indirizzo
CURRICULUM INGEGNERIA INFORMATICA
Ordinamento Cds
DM270
Parole chiave
HPC, Slurm, Kubernetes, Cluster API, Metal3, infrastructure, Converged Computing
Data di discussione della Tesi
4 Dicembre 2025
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Tagliani, Michele
Relatore della tesi
Scuola
Corso di studio
Indirizzo
CURRICULUM INGEGNERIA INFORMATICA
Ordinamento Cds
DM270
Parole chiave
HPC, Slurm, Kubernetes, Cluster API, Metal3, infrastructure, Converged Computing
Data di discussione della Tesi
4 Dicembre 2025
URI
Statistica sui download
Gestione del documento: