Arienti, Giacomo
(2025)
Controllable Question Answering through Learned Representations: A Sparse Autoencoder Approach to LLM Steering.
[Laurea], Università di Bologna, Corso di Studio in
Ingegneria e scienze informatiche [L-DM270] - Cesena, Documento full-text non disponibile
Il full-text non è disponibile per scelta dell'autore.
(
Contatta l'autore)
Abstract
Large Language Models (LLMs) are central to open-domain Question Answering (QA), but their lack of interpretability and controllability poses critical challenges. Steering methods such as prompting, fine-tuning, and representation engineering have been explored, while Sparse Autoencoders (SAEs) offer a way to uncover latent features that can be directly manipulated. However, SAE-based approaches still require manual work in feature selection and factor tuning, and often yield unreliable interventions. Our KL-divergence analysis shows that steering multiple features without coordination causes activations to diverge from the base model distribution, explaining the focus on single-feature interventions in prior work.
We propose a training pipeline that automates feature selection and enables coordinated multi-feature steering. SAE features are annotated with concise descriptions via logit-lens analysis, and a similarity model aligns QA prompts with these descriptions for automatic retrieval. Calibrated intervention factors then balance contributions across features, avoiding instability and mitigating divergence. Reinforcement-guided optimization further balances fluency, factuality, and concept coverage while avoiding costly fine-tuning, enabling scalable and interpretable control over LLM behavior.
Experiments show that the steered model achieves higher truthfulness, stronger concept alignment, and better containment of reference information compared to the baseline, while maintaining fluency with only minor reductions in lexical diversity. Qualitative analysis demonstrates that multi-feature steering corrects baseline failures by aligning generations with intended semantics, an effect rarely achieved by single-feature interventions. These results confirm that SAE-based representation steering, combined with automated selection and calibrated reinforcement, provides a principled path toward controllable QA.
Abstract
Large Language Models (LLMs) are central to open-domain Question Answering (QA), but their lack of interpretability and controllability poses critical challenges. Steering methods such as prompting, fine-tuning, and representation engineering have been explored, while Sparse Autoencoders (SAEs) offer a way to uncover latent features that can be directly manipulated. However, SAE-based approaches still require manual work in feature selection and factor tuning, and often yield unreliable interventions. Our KL-divergence analysis shows that steering multiple features without coordination causes activations to diverge from the base model distribution, explaining the focus on single-feature interventions in prior work.
We propose a training pipeline that automates feature selection and enables coordinated multi-feature steering. SAE features are annotated with concise descriptions via logit-lens analysis, and a similarity model aligns QA prompts with these descriptions for automatic retrieval. Calibrated intervention factors then balance contributions across features, avoiding instability and mitigating divergence. Reinforcement-guided optimization further balances fluency, factuality, and concept coverage while avoiding costly fine-tuning, enabling scalable and interpretable control over LLM behavior.
Experiments show that the steered model achieves higher truthfulness, stronger concept alignment, and better containment of reference information compared to the baseline, while maintaining fluency with only minor reductions in lexical diversity. Qualitative analysis demonstrates that multi-feature steering corrects baseline failures by aligning generations with intended semantics, an effect rarely achieved by single-feature interventions. These results confirm that SAE-based representation steering, combined with automated selection and calibrated reinforcement, provides a principled path toward controllable QA.
Tipologia del documento
Tesi di laurea
(Laurea)
Autore della tesi
Arienti, Giacomo
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Sparse Auto Encoder,Question Answering,Controllable Text Generation,Large Language Model,Natural Language Processing
Data di discussione della Tesi
2 Ottobre 2025
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Arienti, Giacomo
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Sparse Auto Encoder,Question Answering,Controllable Text Generation,Large Language Model,Natural Language Processing
Data di discussione della Tesi
2 Ottobre 2025
URI
Gestione del documento: