Graph-of-Mark: Graph-Based Visual Prompting for Enhanced Spatial Reasoning in Multimodal Language Models

Buzzoni, Mattia (2026) Graph-of-Mark: Graph-Based Visual Prompting for Enhanced Spatial Reasoning in Multimodal Language Models. [Laurea magistrale], Università di Bologna, Corso di Studio in Artificial intelligence [LM-DM270], Documento ad accesso riservato.
Documenti full-text disponibili:
[thumbnail of Thesis] Documento PDF (Thesis)
Full-text non accessibile fino al 31 Gennaio 2027.
Disponibile con Licenza: Creative Commons: Attribuzione - Non commerciale - Condividi allo stesso modo 4.0 (CC BY-NC-SA 4.0)

Download (11MB) | Contatta l'autore

Abstract

In recent years, with the widespread adoption of Multimodal Language Models (MLMs), there has been growing interest in integrating visual information into language models. Despite notable progress, current models still face significant limitations in understanding and performing spatial reasoning over complex visual scenes. Several visual prompting approaches have been proposed to mitigate these challenges. For example, Set-of-Mark partitions an image into multiple regions and assigns numerical identifiers before passing the modified input to the MLM. However, such methods often neglect the spatial relationships between objects, restricting the model’s ability to capture the global context of the scene. This thesis presents Graph of Marks, a hybrid framework that combines traditional computer vision techniques with multimodal language models to construct enriched scene graphs for structured visual representation. The proposed method integrates multiple object detectors (OWLv2, YOLOv8, Detectron2), automatic segmentation algorithms (SAM), and modules for extracting spatial and semantic relations. The resulting scene graphs capture both the detected objects and their interconnections, providing the MLM with the original image augmented with visual annotations (bounding boxes, segmentation masks, identifiers) as well as structured textual descriptions that explicitly encode spatial and semantic relationships. Experiments were conducted on widely used datasets such as RefCOCOg, GQA, VQAv1, and VQAv2, employing state-of-the-art models including Gemma-3, Qwen2.5-VL, Qwen3-VL and LlamaV-o1. The evaluation focused on assessing spatial reasoning and relational understanding. Results demonstrate that integrating structural information through scene graphs, in combination with enriched visual and textual prompts, leads to significant improvements in answer accuracy, particularly in tasks requiring complex spatial reasoning.

Abstract
Tipologia del documento
Tesi di laurea (Laurea magistrale)
Autore della tesi
Buzzoni, Mattia
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Visual Prompting, Spatial Reasoning, Scene Graphs, Multimodal Language Models, Natural Language Processing
Data di discussione della Tesi
6 Febbraio 2026
URI

Altri metadati

Gestione del documento: Visualizza il documento

^