Workload-Aware Autoscaling Support to LLM Inference in Cloud-Native Environments

Sgreccia, Tommaso (2026) Workload-Aware Autoscaling Support to LLM Inference in Cloud-Native Environments. [Laurea magistrale], Università di Bologna, Corso di Studio in Ingegneria informatica [LM-DM270], Documento ad accesso riservato.

Salva citazione

Documenti full-text disponibili:

Documento PDF (Thesis)
Full-text non accessibile fino al 1 Ottobre 2026.
Disponibile con Licenza: Salvo eventuali più ampie autorizzazioni dell'autore, la tesi può essere liberamente consultata e può essere effettuato il salvataggio e la stampa di una copia per fini strettamente personali di studio, di ricerca e di insegnamento, con espresso divieto di qualunque utilizzo direttamente o indirettamente commerciale. Ogni altro diritto sul materiale è riservato
Download (6MB) | Contatta l'autore

Abstract

The growing adoption of Large Language Models (LLM) in production environments and the emerging Model-as-a-Service paradigm pose new challenges for cloud infrastructures. Unlike traditional web services, generative inference workloads exhibit highly heterogeneous request profiles, a two-phase execution model with distinct computational and memory demands, and resource constraints that general-purpose autoscaling mechanisms are not designed to handle. This thesis investigates workload-aware autoscaling strategies for LLM inference in cloud-native environments. After reviewing the foundations of cloud computing, modern orchestration platforms, and the current state of the art in large-scale model serving, we present the design and implementation of a custom autoscaling controller that integrates two complementary scaling approaches: a reactive mode based on real-time resource saturation indicators, and a proactive mode that leverages a model based on queueing theory to grant defined service level objectives. An experimental evaluation conducted on production-grade clusters confirms that the proposed strategies improve both resource efficiency and latency compliance compared to standard cloud-native autoscaling technologies.

Abstract

Tipologia del documento

Tesi di laurea (Laurea magistrale)

Autore della tesi

Sgreccia, Tommaso

Relatore della tesi

Corradi, Antonio

Correlatore della tesi

Sabbioni, Andrea

Scuola

Ingegneria e Architettura

Corso di studio