From Words to Codes: Large Language Models for ICD-9 Extraction in Clinical Documents

Addimando, Salvatore Antonio (2023) From Words to Codes: Large Language Models for ICD-9 Extraction in Clinical Documents. [Laurea], Università di Bologna, Corso di Studio in Ingegneria e scienze informatiche [L-DM270] - Cesena, Documento ad accesso riservato.
Documenti full-text disponibili:
[img] Documento PDF (Thesis)
Full-text non accessibile fino al 31 Dicembre 2024.
Disponibile con Licenza: Creative Commons: Attribuzione - Non commerciale - Non opere derivate 4.0 (CC BY-NC-ND 4.0)

Download (2MB) | Contatta l'autore

Abstract

In the realm of real-world named entity recognition and classification (NERC), the utilization of ICD-9 codes proves to be invaluable. Annotation scarcity and the need to generalize to unseen types present formidable obstacles in the field. In such situations, leveraging ICD-9 codes becomes a pivotal strategy to address these challenges effectively. ICD-9 codes offer a standardized system for categorizing medical diagnoses and procedures, providing a well-established framework to classify entities within the healthcare domain. By incorporating ICD-9 codes into NERC systems, practitioners can harness a wealth of prior knowledge and domain-specific information, thus enhancing the accuracy and efficiency of their NERC models. This integration empowers NERC systems to achieve remarkable outcomes, ensuring the precise identification and classification of medical entities, ultimately benefiting both healthcare providers and patients. Although large language models (LLMs) hold great potential, computational cost and inefficiency can limit their applicability, favoring smaller specialized networks. In this paper, we introduce \textsc{ICD-Juicer}, a novel LLM distillation framework tailored for improving NERC performance in resource-constrained environments, specifically focusing on the extraction of ICD-9 codes from clinical documents. Mechanically, \textsc{ICD-Juicer} transfers LLM clinical knowledge to BERT-based models. Accuracy is further enhanced by incorporating textual target class short descriptions in the prompt. We conducted extensive prompt engineering with different LLMs on an extremely large dataset, counting almost 2 million medical reports covering a wide variety of domains. It is important to note that the prompt provides document-level annotations from the MIMIC-III dataset to restrict GPT-3.5-turbo's output options. This way we managed to get consistent outputs throughout the creation of the augmented dataset.

Abstract
Tipologia del documento
Tesi di laurea (Laurea)
Autore della tesi
Addimando, Salvatore Antonio
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Natural Language Processing,Named Entity Recognition,Knowledge Distillation,Large Language Models,ICD-9-CM
Data di discussione della Tesi
5 Ottobre 2023
URI

Altri metadati

Gestione del documento: Visualizza il documento

^