Optimizing Small Language Models: An Experimental Investigation in Compressing Distilled LLaMA Architectures

Galavotti, Angelo (2025) Optimizing Small Language Models: An Experimental Investigation in Compressing Distilled LLaMA Architectures. [Laurea magistrale], Università di Bologna, Corso di Studio in Artificial intelligence [LM-DM270], Documento ad accesso riservato.
Documenti full-text disponibili:
[thumbnail of Thesis] Documento PDF (Thesis)
Full-text accessibile solo agli utenti istituzionali dell'Ateneo
Disponibile con Licenza: Salvo eventuali più ampie autorizzazioni dell'autore, la tesi può essere liberamente consultata e può essere effettuato il salvataggio e la stampa di una copia per fini strettamente personali di studio, di ricerca e di insegnamento, con espresso divieto di qualunque utilizzo direttamente o indirettamente commerciale. Ogni altro diritto sul materiale è riservato

Download (3MB) | Contatta l'autore

Abstract

Large Language Models have transformed our interaction with technology, yet their massive scale introduces major drawbacks. Their substantial energy consumption and water usage create environmental pressures, while cloud-based architectures limit user autonomy and enable pervasive data collection. Additionally, their resource-intensive nature leads to entirely different challenges for edge deployment, where memory constraints and computational limitations make standard LLMs impractical. This thesis presents a comprehensive compression pipeline that systematically applies multiple optimization strategies to the compact LLaMA 3.2 1B Instruct model: depth pruning removes redundant transformer layers based on importance metrics, width pruning applies structured sparsity to attention matrices using the WANDA algorithm, Low-Rank Adaptation recovers performance while enabling task-specific fine-tuning, 4-bit quantization further reduces memory footprint through GPTQ, and Eigenspace Low-Rank Approximation provides training-free accuracy recovery. The experimental evaluation demonstrates that integrating multiple optimization techniques can significantly decrease memory footprint while preserving core language functionality. LoRA adaptation proves particularly effective at recovering capabilities lost during aggressive pruning. However, compression beyond specific thresholds leads to rapid performance deterioration that recovery procedures cannot fully mitigate, revealing fundamental architectural limits in transformer compression.

Abstract
Tipologia del documento
Tesi di laurea (Laurea magistrale)
Autore della tesi
Galavotti, Angelo
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Large Language Model, LLM, pruning, optimization, compression, LoRA, LLaMA, PyTorch, Python, quantization, Small Language Model, SLM, edge devices
Data di discussione della Tesi
22 Luglio 2025
URI

Altri metadati

Statistica sui download

Gestione del documento: Visualizza il documento

^