Subtopic-oriented biomedical summarization using pretrained language models

Xia, Tian Cheng (2023) Subtopic-oriented biomedical summarization using pretrained language models. [Laurea], Università di Bologna, Corso di Studio in Informatica [L-DM270]
Documenti full-text disponibili:
[thumbnail of Thesis] Documento PDF (Thesis)
Disponibile con Licenza: Creative Commons: Attribuzione - Condividi allo stesso modo 4.0 (CC BY-SA 4.0)

Download (672kB)

Abstract

The ever-growing number of publications in the biomedical field is causing difficulties in finding insightful knowledge. In this work, we propose a subtopic-oriented summarization framework that aims to provide an overview on the state-of-the-art of a given subject. The method we propose clusters the papers retrieved from a query and then, for each cluster, extracts the subtopics and summarizes the abstracts. We conducted various experiments to select the most appropriate clustering approach and concluded that the best choices are MiniLM for text embedding, UMAP for dimensionality reduction and OPTICS as clustering algorithm. For summarization, we fine-tuned both general-domain and biomedical pretrained language models for the task of extractive summarization and selected Longformer as the most suited model. Experimental results on multi-document summarization datasets show that the proposed framework improves the overall recall of the generated summary with a small decrease in precision, which corresponds to slightly longer summaries but closer to the ground truth.

Abstract
Tipologia del documento
Tesi di laurea (Laurea)
Autore della tesi
Xia, Tian Cheng
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
extractive summarization,biomedical summarization,large language models,pretrained language models,text clustering
Data di discussione della Tesi
11 Ottobre 2023
URI

Altri metadati

Statistica sui download

Gestione del documento: Visualizza il documento

^