From Words to Codes: Large Language Models for ICD-9 Extraction in Clinical Documents

Addimando, Salvatore Antonio (2023) From Words to Codes: Large Language Models for ICD-9 Extraction in Clinical Documents. [Laurea], Università di Bologna, Corso di Studio in Ingegneria e scienze informatiche [L-DM270] - Cesena

Salva citazione

Documenti full-text disponibili:

Documento PDF (Thesis)
Disponibile con Licenza: Creative Commons: Attribuzione - Non commerciale - Non opere derivate 4.0 (CC BY-NC-ND 4.0)
Download (2MB)

Abstract

In the realm of real-world named entity recognition and classification (NERC), the utilization of ICD-9 codes proves to be invaluable. Annotation scarcity and the need to generalize to unseen types present formidable obstacles in the field. In such situations, leveraging ICD-9 codes becomes a pivotal strategy to address these challenges effectively. ICD-9 codes offer a standardized system for categorizing medical diagnoses and procedures, providing a well-established framework to classify entities within the healthcare domain. By incorporating ICD-9 codes into NERC systems, practitioners can harness a wealth of prior knowledge and domain-specific information, thus enhancing the accuracy and efficiency of their NERC models. This integration empowers NERC systems to achieve remarkable outcomes, ensuring the precise identification and classification of medical entities, ultimately benefiting both healthcare providers and patients. Although large language models (LLMs) hold great potential, computational cost and inefficiency can limit their applicability, favoring smaller specialized networks. In this paper, we introduce \textsc{ICD-Juicer}, a novel LLM distillation framework tailored for improving NERC performance in resource-constrained environments, specifically focusing on the extraction of ICD-9 codes from clinical documents. Mechanically, \textsc{ICD-Juicer} transfers LLM clinical knowledge to BERT-based models. Accuracy is further enhanced by incorporating textual target class short descriptions in the prompt. We conducted extensive prompt engineering with different LLMs on an extremely large dataset, counting almost 2 million medical reports covering a wide variety of domains. It is important to note that the prompt provides document-level annotations from the MIMIC-III dataset to restrict GPT-3.5-turbo's output options. This way we managed to get consistent outputs throughout the creation of the augmented dataset.

Abstract

Tipologia del documento

Tesi di laurea (Laurea)

Autore della tesi

Addimando, Salvatore Antonio

Relatore della tesi

Moro, Gianluca

Correlatore della tesi

Frisoni, Giacomo

Scuola

Ingegneria e Architettura

Corso di studio