Evaluating Large Language Models for Dimensional Fact Model Design with Automated Pipelines

Rubboli, Luca (2025) Evaluating Large Language Models for Dimensional Fact Model Design with Automated Pipelines. [Laurea magistrale], Università di Bologna, Corso di Studio in Ingegneria e scienze informatiche [LM-DM270] - Cesena

Salva citazione

Documenti full-text disponibili:

Documento PDF (Thesis)
Disponibile con Licenza: Creative Commons: Attribuzione - Condividi allo stesso modo 4.0 (CC BY-SA 4.0)
Download (2MB)

Abstract

This work investigates the use of large language models for conceptual design of multidimensional data warehouses, comparing supply-driven and demand-driven approaches. In the supply-driven approach, Dimensional Fact Model schemata is generated from source relational schemas, whereas in the demand-driven approach, schemata is generated from textual end-user requirements. Multiple LLMs are evaluated, including GPT, LLaMA, Falcon and Mistral, using automated pipelines for YAML-based schema extraction, metrics computation and visualization. Eval- uation metrics include node- and edge-level precision, recall and F1-score, as well as custom error metrics reflecting domain-specific schema errors. Experiments are run on CPU and GPU environments, with automated scripts ensuring repro- ducibility and consistent execution across multiple runs. Results show that prompt engineering significantly improves model performance: for supply-driven design, average F1-scores nearly double, while for demand-driven design, careful prompt design increases scores by up to 20%. GPT-5 demonstrates slight improvements over GPT-4, particularly in capturing relational dependencies. The study also highlights practical limitations, including memory constraints with larger import models, variability in execution times and the need for manual post-processing rules. Future work includes expanding the exercise dataset, developing automated alignment strategies, exploring interactive multi-turn schema design and experi- menting with fine-tuning large import models to enhance both accuracy and ef- ficiency. These results provide a systematic foundation for leveraging LLMs in automated data warehouse conceptual design, balancing effectiveness and compu- tational resources.

Abstract

Tipologia del documento

Tesi di laurea (Laurea magistrale)

Autore della tesi

Rubboli, Luca

Relatore della tesi

Gallinucci, Enrico

Scuola

Ingegneria e Architettura

Corso di studio

Ingegneria e scienze informatiche [LM-DM270] - Cesena

Ordinamento Cds