Buzzoni, Mattia
(2026)
Graph-of-Mark: Graph-Based Visual Prompting for Enhanced Spatial Reasoning in Multimodal Language Models.
[Laurea magistrale], Università di Bologna, Corso di Studio in
Artificial intelligence [LM-DM270], Documento ad accesso riservato.
Documenti full-text disponibili:
Abstract
In recent years, with the widespread adoption of Multimodal Language Models (MLMs), there has been growing interest in integrating visual information into language models. Despite notable progress, current models still face significant limitations in understanding and performing spatial reasoning over complex visual scenes. Several visual prompting approaches have been proposed to mitigate these challenges. For example, Set-of-Mark partitions an image into multiple regions and assigns numerical
identifiers before passing the modified input to the MLM.
However, such methods often neglect the spatial relationships between objects, restricting the model’s ability to capture the global context of the scene. This thesis presents Graph of Marks, a hybrid framework that combines traditional computer vision techniques with multimodal language models to construct enriched scene graphs for structured visual representation. The proposed method integrates multiple object detectors (OWLv2, YOLOv8, Detectron2), automatic segmentation algorithms (SAM),
and modules for extracting spatial and semantic relations. The resulting scene graphs capture both the detected objects and their interconnections, providing the MLM with the original
image augmented with visual annotations (bounding boxes, segmentation masks, identifiers) as well as structured textual descriptions that explicitly encode spatial and semantic relationships. Experiments were conducted on widely used datasets such as RefCOCOg, GQA, VQAv1, and VQAv2,
employing state-of-the-art models including Gemma-3, Qwen2.5-VL, Qwen3-VL and LlamaV-o1. The evaluation focused on assessing spatial reasoning and relational understanding.
Results demonstrate that integrating structural information through scene graphs, in combination with enriched visual and textual prompts, leads to significant improvements
in answer accuracy, particularly in tasks requiring complex spatial reasoning.
Abstract
In recent years, with the widespread adoption of Multimodal Language Models (MLMs), there has been growing interest in integrating visual information into language models. Despite notable progress, current models still face significant limitations in understanding and performing spatial reasoning over complex visual scenes. Several visual prompting approaches have been proposed to mitigate these challenges. For example, Set-of-Mark partitions an image into multiple regions and assigns numerical
identifiers before passing the modified input to the MLM.
However, such methods often neglect the spatial relationships between objects, restricting the model’s ability to capture the global context of the scene. This thesis presents Graph of Marks, a hybrid framework that combines traditional computer vision techniques with multimodal language models to construct enriched scene graphs for structured visual representation. The proposed method integrates multiple object detectors (OWLv2, YOLOv8, Detectron2), automatic segmentation algorithms (SAM),
and modules for extracting spatial and semantic relations. The resulting scene graphs capture both the detected objects and their interconnections, providing the MLM with the original
image augmented with visual annotations (bounding boxes, segmentation masks, identifiers) as well as structured textual descriptions that explicitly encode spatial and semantic relationships. Experiments were conducted on widely used datasets such as RefCOCOg, GQA, VQAv1, and VQAv2,
employing state-of-the-art models including Gemma-3, Qwen2.5-VL, Qwen3-VL and LlamaV-o1. The evaluation focused on assessing spatial reasoning and relational understanding.
Results demonstrate that integrating structural information through scene graphs, in combination with enriched visual and textual prompts, leads to significant improvements
in answer accuracy, particularly in tasks requiring complex spatial reasoning.
Tipologia del documento
Tesi di laurea
(Laurea magistrale)
Autore della tesi
Buzzoni, Mattia
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Visual Prompting, Spatial Reasoning, Scene Graphs, Multimodal Language Models, Natural Language Processing
Data di discussione della Tesi
6 Febbraio 2026
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Buzzoni, Mattia
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Visual Prompting, Spatial Reasoning, Scene Graphs, Multimodal Language Models, Natural Language Processing
Data di discussione della Tesi
6 Febbraio 2026
URI
Gestione del documento: