Conditional flow matching for synthetic medical data generation

Bonazzi, Matteo (2025) Conditional flow matching for synthetic medical data generation. [Laurea magistrale], Università di Bologna, Corso di Studio in Physics [LM-DM270], Documento ad accesso riservato.
Documenti full-text disponibili:
[thumbnail of Thesis] Documento PDF (Thesis)
Full-text non accessibile fino al 29 Aprile 2026.
Disponibile con Licenza: Creative Commons: Attribuzione - Non commerciale - Condividi allo stesso modo 4.0 (CC BY-NC-SA 4.0)

Download (10MB) | Contatta l'autore

Abstract

Synthetic data plays an increasingly important role in machine learning, particularly in the medical field, as it overcomes issues of scarcity, imbalance and privacy; its use for data augmentation and class balancing enables the development of more robust models for downstream tasks. The main goal of this thesis is to explore Conditional Flow Matching as a training paradigm for the generation of two types of data: 3D CT scans of patients with Non-Small Cell Lung Cancer (NSCLC-Radiomics dataset) and tabular hematological biomarkers from Alzheimer’s patients (ADNI dataset). For tabular synthesis, an XGBoost-based conditional flow model was trained to be able to regress Gaussian noise into biomarker distributions. Two model variants were developed: one generating all the features in ADNI, and a biomarkers-only model restricted to those used by the PhenoAge algorithm for biological age estimation. Privacy and quality were assessed with the Syndat framework; the biomarkers-only model outperformed the base model on distribution similarity, discrimination and correlation scores, while privacy-risk metrics remained comparable. Clinical validation was performed by computing biological age with PhenoAge: synthetic data from both models reproduced the expected increase in biological-age advance across diagnosis groups. For CT synthesis, a 3D U-Net was trained to approximate the conditional vector field mapping noise to the target distribution, generating CT scans via numerical solution of NODE. Generated volumes include paired semantic masks and were evaluated with MS-SSIM = 0.54 ± 0.09, PSNR = 12.0 ± 1.2 dB, and FID = 0.009. The MS-SSIM and PSNR values match those obtained when comparing training and test data, suggesting that they reflect natural variability in the dataset. The results obtained demonstrate that Conditional Flow Matching is a promising approach for producing high-quality synthetic medical data across multiple modalities.

Abstract
Tipologia del documento
Tesi di laurea (Laurea magistrale)
Autore della tesi
Bonazzi, Matteo
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Indirizzo
Applied Physics
Ordinamento Cds
DM270
Parole chiave
Condtional Flow Matching,Synthetic data generation,Synthetic data,CFM,Lung cancer,Tabular data generation,CT generation
Data di discussione della Tesi
29 Ottobre 2025
URI

Altri metadati

Gestione del documento: Visualizza il documento

^