Documenti full-text disponibili:
Abstract
This work investigates the use of large language models for conceptual design of
multidimensional data warehouses, comparing supply-driven and demand-driven
approaches. In the supply-driven approach, Dimensional Fact Model schemata is
generated from source relational schemas, whereas in the demand-driven approach,
schemata is generated from textual end-user requirements. Multiple LLMs are
evaluated, including GPT, LLaMA, Falcon and Mistral, using automated pipelines
for YAML-based schema extraction, metrics computation and visualization. Eval-
uation metrics include node- and edge-level precision, recall and F1-score, as well
as custom error metrics reflecting domain-specific schema errors. Experiments
are run on CPU and GPU environments, with automated scripts ensuring repro-
ducibility and consistent execution across multiple runs. Results show that prompt
engineering significantly improves model performance: for supply-driven design,
average F1-scores nearly double, while for demand-driven design, careful prompt
design increases scores by up to 20%. GPT-5 demonstrates slight improvements
over GPT-4, particularly in capturing relational dependencies. The study also
highlights practical limitations, including memory constraints with larger import
models, variability in execution times and the need for manual post-processing
rules. Future work includes expanding the exercise dataset, developing automated
alignment strategies, exploring interactive multi-turn schema design and experi-
menting with fine-tuning large import models to enhance both accuracy and ef-
ficiency. These results provide a systematic foundation for leveraging LLMs in
automated data warehouse conceptual design, balancing effectiveness and compu-
tational resources.
Abstract
This work investigates the use of large language models for conceptual design of
multidimensional data warehouses, comparing supply-driven and demand-driven
approaches. In the supply-driven approach, Dimensional Fact Model schemata is
generated from source relational schemas, whereas in the demand-driven approach,
schemata is generated from textual end-user requirements. Multiple LLMs are
evaluated, including GPT, LLaMA, Falcon and Mistral, using automated pipelines
for YAML-based schema extraction, metrics computation and visualization. Eval-
uation metrics include node- and edge-level precision, recall and F1-score, as well
as custom error metrics reflecting domain-specific schema errors. Experiments
are run on CPU and GPU environments, with automated scripts ensuring repro-
ducibility and consistent execution across multiple runs. Results show that prompt
engineering significantly improves model performance: for supply-driven design,
average F1-scores nearly double, while for demand-driven design, careful prompt
design increases scores by up to 20%. GPT-5 demonstrates slight improvements
over GPT-4, particularly in capturing relational dependencies. The study also
highlights practical limitations, including memory constraints with larger import
models, variability in execution times and the need for manual post-processing
rules. Future work includes expanding the exercise dataset, developing automated
alignment strategies, exploring interactive multi-turn schema design and experi-
menting with fine-tuning large import models to enhance both accuracy and ef-
ficiency. These results provide a systematic foundation for leveraging LLMs in
automated data warehouse conceptual design, balancing effectiveness and compu-
tational resources.
Tipologia del documento
Tesi di laurea
(Laurea magistrale)
Autore della tesi
Rubboli, Luca
Relatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
LLMs,DFM,Conceptual Modeling,Business Intelligence,Supply driven,demand driven,prompt engineering
Data di discussione della Tesi
2 Ottobre 2025
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Rubboli, Luca
Relatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
LLMs,DFM,Conceptual Modeling,Business Intelligence,Supply driven,demand driven,prompt engineering
Data di discussione della Tesi
2 Ottobre 2025
URI
Statistica sui download
Gestione del documento: