Evaluating Visual Grounding in Multimodal Large Language Models for Medical Question Answering

Balzani, Riccardo (2026) Evaluating Visual Grounding in Multimodal Large Language Models for Medical Question Answering. [Laurea], Università di Bologna, Corso di Studio in Ingegneria e scienze informatiche [L-DM270] - Cesena, Documento full-text non disponibile

Salva citazione

Il full-text non è disponibile per scelta dell'autore. (Contatta l'autore)

Abstract

Vision-Language Models (VLMs) for medical VQA are often limited by "black-box" reasoning and hallucinations. Furthermore, standard metrics like Exact Match fail to capture medical semantic nuances. This thesis introduces a modular Visual Grounding pipeline that localizes Regions of Interest (RoIs) before answer generation. The pipeline integrates biomedical NER, cross-modal attention (BiomedCLIP + gScoreCAM), and Intelligent Prompt Injection to provide spatial context to the VLM. Evaluated on MIMIC-CXR-Ext and GEMeX using MedGemma and OctoMed models, the framework improved diagnostic accuracy by 21 percentage points and ROUGE-1 by 5.2×. An LLM-as-a-Judge evaluation further revealed that 66.81% of responses were clinically correct, far exceeding what string-based metrics suggest. These results demonstrate a path toward more transparent and trustworthy AI-assisted diagnostics.

Abstract

Tipologia del documento

Tesi di laurea (Laurea)

Autore della tesi

Balzani, Riccardo

Relatore della tesi

Moro, Gianluca

Correlatore della tesi

Molfetta, Lorenzo ; Fantazzini, Stefano

Scuola

Ingegneria e Architettura

Corso di studio