Tedeschini, Luca
(2026)
Investigating Mechanistic Interpretability in Large Language Models through Model Stitching Across Architectures.
[Laurea magistrale], Università di Bologna, Corso di Studio in
Artificial intelligence [LM-DM270]
Documenti full-text disponibili:
Abstract
Mechanistic interpretability studies how Large Language Models reason by reverse engineering their internal representations. Because it is not feasible to produce a complete mathematical explanation of how LLMs operate, research focuses on testing smaller hypotheses that gradually build a broader understanding.
Previous work by Chen et al. demonstrated model stitching within the same model family, showing that an affine mapping can exist between residual streams of language models. This mapping can be used in practical applications such as transferring the weights of sparse autoencoders. This thesis extends that idea to inter family stitching, where models have different architectures and tokenizers.
Model stitching is used as an interpretability tool to analyze how two models differ by connecting them and studying the behavior of the resulting system. The work focuses on open weight models including Llama 3, Gemma 2, and Qwen 2.5, and explores the alignment of latent spaces using affine transformations and sparse autoencoders. The methodology addresses vocabulary mismatches through dataset alignment and direct latent mapping.
Experiments on benchmarks such as HellaSwag and MMLU show that hybrid models can function with performance degradation comparable to intra family stitching. In particular, the Llama 3 to Qwen 2.5 configuration shows strong semantic coherence. The research also identifies the Single Tokenizer phenomenon, where the tokenizer of one model can decode the outputs of another without modifying its weights while still producing meaningful results. This suggests that strong semantic alignment between models may allow them to operate within a foreign token space.
Overall, the findings support the hypothesis that different LLM architectures converge toward internal representations that can be mapped linearly, indicating the existence of a shared semantic space.
Abstract
Mechanistic interpretability studies how Large Language Models reason by reverse engineering their internal representations. Because it is not feasible to produce a complete mathematical explanation of how LLMs operate, research focuses on testing smaller hypotheses that gradually build a broader understanding.
Previous work by Chen et al. demonstrated model stitching within the same model family, showing that an affine mapping can exist between residual streams of language models. This mapping can be used in practical applications such as transferring the weights of sparse autoencoders. This thesis extends that idea to inter family stitching, where models have different architectures and tokenizers.
Model stitching is used as an interpretability tool to analyze how two models differ by connecting them and studying the behavior of the resulting system. The work focuses on open weight models including Llama 3, Gemma 2, and Qwen 2.5, and explores the alignment of latent spaces using affine transformations and sparse autoencoders. The methodology addresses vocabulary mismatches through dataset alignment and direct latent mapping.
Experiments on benchmarks such as HellaSwag and MMLU show that hybrid models can function with performance degradation comparable to intra family stitching. In particular, the Llama 3 to Qwen 2.5 configuration shows strong semantic coherence. The research also identifies the Single Tokenizer phenomenon, where the tokenizer of one model can decode the outputs of another without modifying its weights while still producing meaningful results. This suggests that strong semantic alignment between models may allow them to operate within a foreign token space.
Overall, the findings support the hypothesis that different LLM architectures converge toward internal representations that can be mapped linearly, indicating the existence of a shared semantic space.
Tipologia del documento
Tesi di laurea
(Laurea magistrale)
Autore della tesi
Tedeschini, Luca
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Large Language Models, Machine learning, Interpretability, Natural Language Processing
Data di discussione della Tesi
26 Marzo 2026
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Tedeschini, Luca
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Large Language Models, Machine learning, Interpretability, Natural Language Processing
Data di discussione della Tesi
26 Marzo 2026
URI
Statistica sui download
Gestione del documento: