Evaluating Visual Grounding in Multimodal Large Language Models for Medical Question Answering

Balzani, Riccardo (2026) Evaluating Visual Grounding in Multimodal Large Language Models for Medical Question Answering. [Laurea], Università di Bologna, Corso di Studio in Ingegneria e scienze informatiche [L-DM270] - Cesena, Documento full-text non disponibile
Il full-text non è disponibile per scelta dell'autore. (Contatta l'autore)

Abstract

Vision-Language Models (VLMs) for medical VQA are often limited by "black-box" reasoning and hallucinations. Furthermore, standard metrics like Exact Match fail to capture medical semantic nuances. This thesis introduces a modular Visual Grounding pipeline that localizes Regions of Interest (RoIs) before answer generation. The pipeline integrates biomedical NER, cross-modal attention (BiomedCLIP + gScoreCAM), and Intelligent Prompt Injection to provide spatial context to the VLM. Evaluated on MIMIC-CXR-Ext and GEMeX using MedGemma and OctoMed models, the framework improved diagnostic accuracy by 21 percentage points and ROUGE-1 by 5.2×. An LLM-as-a-Judge evaluation further revealed that 66.81% of responses were clinically correct, far exceeding what string-based metrics suggest. These results demonstrate a path toward more transparent and trustworthy AI-assisted diagnostics.

Abstract
Tipologia del documento
Tesi di laurea (Laurea)
Autore della tesi
Balzani, Riccardo
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Visual Prompting,Medical Image Segmentation,Clinical AI,Multimodal Language Models,Natural Language Processing
Data di discussione della Tesi
13 Marzo 2026
URI

Altri metadati

Gestione del documento: Visualizza il documento

^