Leveraging LLMs as noisy label generators for named entity recognition

Lopez, Antonio (2024) Leveraging LLMs as noisy label generators for named entity recognition. [Laurea magistrale], Università di Bologna, Corso di Studio in Artificial intelligence [LM-DM270]

Salva citazione

Documenti full-text disponibili:

Documento PDF (Thesis)
Disponibile con Licenza: Salvo eventuali più ampie autorizzazioni dell'autore, la tesi può essere liberamente consultata e può essere effettuato il salvataggio e la stampa di una copia per fini strettamente personali di studio, di ricerca e di insegnamento, con espresso divieto di qualunque utilizzo direttamente o indirettamente commerciale. Ogni altro diritto sul materiale è riservato
Download (954kB)

Abstract

Large Language Models (LLMs) have been proven effective on various tasks due to their adaptability. Thanks to their reasoning and in-context learning ability several Natural Language Processing (NLP) tasks could now be taken into account by these models reaching good results without any training. One of the most famous and important tasks in NLP is Named Entity Recognition (NER), a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories. LLMs can effectively address this task when provided with a clear description and relevant examples. However, despite their strong performance, running these models requires substantial computational resources. In this case, smaller encoder-only models like GLiNER or fine-tuned BERT-like can ensure better performance at a lower cost. The main limitation of these models is that they are usually bound to the data seen in training while LLMs can easily adapt to different scenarios. In this work, a distillation pipeline is proposed to leverage the ability of large language models to act as label generators, creating synthetic data from unsupervised sources. This synthetic data is then used to distill smaller models capable of effectively replacing their teacher. The task addressed is Named Entity Recognition (NER) on the BUSTER dataset, which contains manually annotated financial transactions.

Abstract

Tipologia del documento

Tesi di laurea (Laurea magistrale)

Autore della tesi

Lopez, Antonio

Relatore della tesi

Torroni, Paolo

Correlatore della tesi

Zugarini, Andrea

Scuola

Ingegneria e Architettura

Corso di studio