Investigating Mechanistic Interpretability in Large Language Models through Model Stitching Across Architectures

Tedeschini, Luca (2026) Investigating Mechanistic Interpretability in Large Language Models through Model Stitching Across Architectures. [Laurea magistrale], Università di Bologna, Corso di Studio in Artificial intelligence [LM-DM270]

Salva citazione

Documenti full-text disponibili:

Documento PDF (Thesis)
Disponibile con Licenza: Creative Commons: Attribuzione - Non commerciale - Non opere derivate 4.0 (CC BY-NC-ND 4.0)
Download (2MB)

Abstract

Mechanistic interpretability studies how Large Language Models reason by reverse engineering their internal representations. Because it is not feasible to produce a complete mathematical explanation of how LLMs operate, research focuses on testing smaller hypotheses that gradually build a broader understanding. Previous work by Chen et al. demonstrated model stitching within the same model family, showing that an affine mapping can exist between residual streams of language models. This mapping can be used in practical applications such as transferring the weights of sparse autoencoders. This thesis extends that idea to inter family stitching, where models have different architectures and tokenizers. Model stitching is used as an interpretability tool to analyze how two models differ by connecting them and studying the behavior of the resulting system. The work focuses on open weight models including Llama 3, Gemma 2, and Qwen 2.5, and explores the alignment of latent spaces using affine transformations and sparse autoencoders. The methodology addresses vocabulary mismatches through dataset alignment and direct latent mapping. Experiments on benchmarks such as HellaSwag and MMLU show that hybrid models can function with performance degradation comparable to intra family stitching. In particular, the Llama 3 to Qwen 2.5 configuration shows strong semantic coherence. The research also identifies the Single Tokenizer phenomenon, where the tokenizer of one model can decode the outputs of another without modifying its weights while still producing meaningful results. This suggests that strong semantic alignment between models may allow them to operate within a foreign token space. Overall, the findings support the hypothesis that different LLM architectures converge toward internal representations that can be mapped linearly, indicating the existence of a shared semantic space.

Abstract

Tipologia del documento

Tesi di laurea (Laurea magistrale)

Autore della tesi

Tedeschini, Luca

Relatore della tesi

Torroni, Paolo

Correlatore della tesi

Capitani, Andrea ; Ruggeri, Federico

Scuola

Ingegneria e Architettura

Corso di studio