Named Entity Recognition for Historical Italian Texts: Overcoming Data Limitations through Strategic Annotation and Model Adaptation

Esposito, Simone (2025) Named Entity Recognition for Historical Italian Texts: Overcoming Data Limitations through Strategic Annotation and Model Adaptation. [Laurea magistrale], Università di Bologna, Corso di Studio in Artificial intelligence [LM-DM270]

Salva citazione

Documenti full-text disponibili:

	Documento PDF (Thesis) Disponibile con Licenza: Creative Commons: Attribuzione - Non commerciale - Condividi allo stesso modo 4.0 (CC BY-NC-SA 4.0) Download (4MB)
	Documento PDF (Supplementary file) Disponibile con Licenza: Creative Commons: Attribuzione - Non commerciale - Non opere derivate 4.0 (CC BY-NC-ND 4.0) Download (144kB)

Abstract

This thesis addresses the challenge of Named Entity Recognition (NER) in historical Italian (Volgare) texts from the 13th to 16th centuries, a period characterized by linguistic variability and non-standardized orthography. Traditional NER approaches optimized for modern languages fail to capture the unique features of Volgare Italian, which presents unstandardized spelling, regional variations, and complex naming patterns. To overcome these limitations, we developed a comprehensive annotation framework with nine entity categories (PER, LOC, FAM, POP, OPR, DAT, EVE, DOC, and ORG) and created customized annotation tools that balance manual effort with semi-automated techniques. We evaluated multiple approaches, comparing zero-shot and few-shot performance of large language models (Claude 3.7 Sonnet, SLIMER-IT) against fine-tuned smaller models (DistilBERT). Our experiments revealed several key findings: (1) smaller context windows (128 characters) consistently outperformed larger ones across all models; (2) Claude demonstrated impressive zero-shot capabilities with recall exceeding 0.85; (3) fine-tuned models developed specialized abilities to recognize certain entity subtypes; and (4) ensemble approaches combining complementary models showed significant performance gains. Testing on a completely annotated text of Angelo Poliziano's "Orfeo" confirmed these findings while highlighting opportunities for continued development. This research establishes methodological foundations for processing historical Italian texts and offers insights for similar challenges in other historical languages.

Abstract

Tipologia del documento

Tesi di laurea (Laurea magistrale)

Autore della tesi

Esposito, Simone

Relatore della tesi

Torroni, Paolo

Correlatore della tesi

Donati, Nicolò ; D'Addio, Ciro

Scuola

Ingegneria e Architettura

Corso di studio

Artificial intelligence [LM-DM270]

Ordinamento Cds

DM270

Parole chiave

Named Entity Recognition, Historical Italian, Volgare, Annotation Framework, Zero-shot Learning, Few-shot Learning, Fine-tuning, Ensemble Models, Digital Humanities, Context Windows, BERT, Claude, SLIMER-IT, Bootstrapping

Data di discussione della Tesi

25 Marzo 2025

URI

https://amslaurea.unibo.it/id/eprint/35312

Altri metadati

Statistica sui download

Vedi altre statistiche

Gestione del documento:

Strumenti di navigazione

Collezioni AlmaDL

Named Entity Recognition for Historical Italian Texts: Overcoming Data Limitations through Strategic Annotation and Model Adaptation

Abstract

Altri metadati

Statistica sui download