Documenti full-text disponibili:
Abstract
Traditional information retrieval methods have relied on similarity matching, where representations generated by encoder-only models are used to rank and retrieve relevant documents.
The recent advancements in pre-trained language models have enabled a new paradigm--generative information retrieval.
This approach leverages the generative capabilities of large language models (LLMs) to produce relevant document identifiers directly.
However, generative techniques remain unexplored in medicine.
Within this domain, physicians often need to quickly find relevant radiographic images and reports of previous cases to assist in diagnosis and treatment.
For these reasons, we present GENerative Information Retrieval of Chest X-Rays (GenIrCxr).
We assign a numerical identifier to each report by applying hierarchical k-means on top of PubMedBERT semantics-aware representations.
Then, we train a decoder-only LLM from scratch to generate report identifiers in response to queries from a medical expert.
Significant effort was invested in developing optimization techniques to enhance model performance, including custom output vocabularies and constrained beam search generation at inference time.
We train GenIrCxr using a custom dataset built on top of MIMIC-CXR-JPG v2.0.0, where the input query describes medical concepts of interest and the output is the semantic identifier of the target report computed offline.
Retrieval performance is measured using Recall@K and Mean Reciprocal Rank as automatic metrics.
Given the complexity of this task, GenIrCxr demonstrates strong performance, which we validate through comparative experiments against several encoders. GenIrCxr not only surpasses these baselines but also outperforms a seq-to-seq model specifically designed for generative information retrieval, accepted at NeurIPS 2022.
Abstract
Traditional information retrieval methods have relied on similarity matching, where representations generated by encoder-only models are used to rank and retrieve relevant documents.
The recent advancements in pre-trained language models have enabled a new paradigm--generative information retrieval.
This approach leverages the generative capabilities of large language models (LLMs) to produce relevant document identifiers directly.
However, generative techniques remain unexplored in medicine.
Within this domain, physicians often need to quickly find relevant radiographic images and reports of previous cases to assist in diagnosis and treatment.
For these reasons, we present GENerative Information Retrieval of Chest X-Rays (GenIrCxr).
We assign a numerical identifier to each report by applying hierarchical k-means on top of PubMedBERT semantics-aware representations.
Then, we train a decoder-only LLM from scratch to generate report identifiers in response to queries from a medical expert.
Significant effort was invested in developing optimization techniques to enhance model performance, including custom output vocabularies and constrained beam search generation at inference time.
We train GenIrCxr using a custom dataset built on top of MIMIC-CXR-JPG v2.0.0, where the input query describes medical concepts of interest and the output is the semantic identifier of the target report computed offline.
Retrieval performance is measured using Recall@K and Mean Reciprocal Rank as automatic metrics.
Given the complexity of this task, GenIrCxr demonstrates strong performance, which we validate through comparative experiments against several encoders. GenIrCxr not only surpasses these baselines but also outperforms a seq-to-seq model specifically designed for generative information retrieval, accepted at NeurIPS 2022.
Tipologia del documento
Tesi di laurea
(Laurea)
Autore della tesi
Mazzi, Riccardo
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Generative Information Retrieval,Large Language Model From Scratch,Medical Report Embedding,Hierarchical Clustering,Chest X-Ray
Data di discussione della Tesi
28 Novembre 2024
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Mazzi, Riccardo
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Generative Information Retrieval,Large Language Model From Scratch,Medical Report Embedding,Hierarchical Clustering,Chest X-Ray
Data di discussione della Tesi
28 Novembre 2024
URI
Gestione del documento: