Leveraging LLMs as noisy label generators for named entity recognition

Lopez, Antonio (2024) Leveraging LLMs as noisy label generators for named entity recognition. [Laurea magistrale], Università di Bologna, Corso di Studio in Artificial intelligence [LM-DM270]
Documenti full-text disponibili:
[thumbnail of Thesis] Documento PDF (Thesis)
Disponibile con Licenza: Salvo eventuali più ampie autorizzazioni dell'autore, la tesi può essere liberamente consultata e può essere effettuato il salvataggio e la stampa di una copia per fini strettamente personali di studio, di ricerca e di insegnamento, con espresso divieto di qualunque utilizzo direttamente o indirettamente commerciale. Ogni altro diritto sul materiale è riservato

Download (954kB)

Abstract

Large Language Models (LLMs) have been proven effective on various tasks due to their adaptability. Thanks to their reasoning and in-context learning ability several Natural Language Processing (NLP) tasks could now be taken into account by these models reaching good results without any training. One of the most famous and important tasks in NLP is Named Entity Recognition (NER), a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories. LLMs can effectively address this task when provided with a clear description and relevant examples. However, despite their strong performance, running these models requires substantial computational resources. In this case, smaller encoder-only models like GLiNER or fine-tuned BERT-like can ensure better performance at a lower cost. The main limitation of these models is that they are usually bound to the data seen in training while LLMs can easily adapt to different scenarios. In this work, a distillation pipeline is proposed to leverage the ability of large language models to act as label generators, creating synthetic data from unsupervised sources. This synthetic data is then used to distill smaller models capable of effectively replacing their teacher. The task addressed is Named Entity Recognition (NER) on the BUSTER dataset, which contains manually annotated financial transactions.

Abstract
Tipologia del documento
Tesi di laurea (Laurea magistrale)
Autore della tesi
Lopez, Antonio
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Natural Language Processing,Named Entity Recognition,Distillation,GLiNER,BUSTER,Passage retrieval
Data di discussione della Tesi
5 Dicembre 2024
URI

Altri metadati

Statistica sui download

Gestione del documento: Visualizza il documento

^