Conca, Edoardo
(2025)
Evaluation of Synthetic Data's Impact on Financial Predictive Models.
[Laurea magistrale], Università di Bologna, Corso di Studio in
Artificial intelligence [LM-DM270], Documento full-text non disponibile
Il full-text non è disponibile per scelta dell'autore.
(
Contatta l'autore)
Abstract
Synthetic data generation has emerged as a promising solution to overcome data scarcity, privacy constraints, and class imbalance in high-stakes domains such as finance. This thesis investigates the use of generative models to create synthetic tabular data tailored for predictive tasks including credit scoring, fraud detection, and loan default classification. The study focuses on evaluating whether synthetic data can effectively substitute or complement real datasets in machine learning pipelines without compromising performance or regulatory compliance.
The methodology involves training and tuning state-of-the-art generative models across three experimental scenarios: synthetic-only training, minority class augmentation, and statistical-predictive benchmarking. Each scenario is assessed using a combination of traditional classification metrics (Precision, Recall, F1-score, ROC AUC. etc.) and synthetic-specific utility and similarity metrics (e.g., KSComplement, TVComplement, BinaryClassifierEfficacy).
Results show that, under carefully optimized conditions, synthetic data can achieve comparable predictive performance to real data while enhancing fairness and protecting privacy. Augmenting real datasets with targeted synthetic samples proves especially effective in mitigating class imbalance. However, the quality of the synthetic data is highly sensitive to the structure and pre-processing of the original dataset.
The thesis concludes that while synthetic data is not a universal substitute, it is a valuable and increasingly mature tool for enriching real-world financial datasets, enabling responsible and scalable machine learning applications in regulated environments.
Abstract
Synthetic data generation has emerged as a promising solution to overcome data scarcity, privacy constraints, and class imbalance in high-stakes domains such as finance. This thesis investigates the use of generative models to create synthetic tabular data tailored for predictive tasks including credit scoring, fraud detection, and loan default classification. The study focuses on evaluating whether synthetic data can effectively substitute or complement real datasets in machine learning pipelines without compromising performance or regulatory compliance.
The methodology involves training and tuning state-of-the-art generative models across three experimental scenarios: synthetic-only training, minority class augmentation, and statistical-predictive benchmarking. Each scenario is assessed using a combination of traditional classification metrics (Precision, Recall, F1-score, ROC AUC. etc.) and synthetic-specific utility and similarity metrics (e.g., KSComplement, TVComplement, BinaryClassifierEfficacy).
Results show that, under carefully optimized conditions, synthetic data can achieve comparable predictive performance to real data while enhancing fairness and protecting privacy. Augmenting real datasets with targeted synthetic samples proves especially effective in mitigating class imbalance. However, the quality of the synthetic data is highly sensitive to the structure and pre-processing of the original dataset.
The thesis concludes that while synthetic data is not a universal substitute, it is a valuable and increasingly mature tool for enriching real-world financial datasets, enabling responsible and scalable machine learning applications in regulated environments.
Tipologia del documento
Tesi di laurea
(Laurea magistrale)
Autore della tesi
Conca, Edoardo
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
synthetic data, GANS, TVAE
Data di discussione della Tesi
22 Luglio 2025
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Conca, Edoardo
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
synthetic data, GANS, TVAE
Data di discussione della Tesi
22 Luglio 2025
URI
Gestione del documento: