Judging the Judges: A Case Study on LLM-as-a-Judge for Retrieval-Augmented Generation

Adragna, Giorgio (2025) Judging the Judges: A Case Study on LLM-as-a-Judge for Retrieval-Augmented Generation. [Laurea magistrale], Università di Bologna, Corso di Studio in Artificial intelligence [LM-DM270], Documento full-text non disponibile
Il full-text non è disponibile per scelta dell'autore. (Contatta l'autore)

Abstract

Large Language Models are increasingly used to evaluate other LLMs (“LLM-as-a-Judge”), yet concerns remain about bias, stability, and transparency. This thesis introduces a systematic, task-agnostic LLM-as-a-Judge pipeline and validates it on a document-grounded Retrieval-Augmented Generation use case in an enterprise setting. The pipeline is evidence-first and evaluate-then-score: judges decompose questions, cite snippets from provided passages, give short rationales, and then assign categorical/ordinal labels for Truthfulness, Relevance, Completeness, Conciseness, plus a lexicographic Overall. We implement single and pairwise protocols, fix decoding to reduce variance, and audit three biases (position, tie, verbosity). We curate two bilingual (EN/IT) validation sets (single and pairwise). Six state-of-the-art judges from multiple vendors are evaluated using quadratically weighted Cohen’s κ, with rank correlations and macro-F1 as supporting checks. The proposed methodology, datasets, and analysis provide a reproducible blueprint for deploying LLM-as-a-Judge in production RAG evaluation and a foundation for future work on multi-annotator calibration, dataset scaling, and broader domain/language coverage.

Abstract
Tipologia del documento
Tesi di laurea (Laurea magistrale)
Autore della tesi
Adragna, Giorgio
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
LLM-as-a-Judge, LLM, Judge, Natural Language Processing, NLP, GenAI, Generative AI
Data di discussione della Tesi
7 Ottobre 2025
URI

Altri metadati

Gestione del documento: Visualizza il documento

^