Named Entity Recognition for Historical Italian Texts: Overcoming Data Limitations through Strategic Annotation and Model Adaptation

Esposito, Simone (2025) Named Entity Recognition for Historical Italian Texts: Overcoming Data Limitations through Strategic Annotation and Model Adaptation. [Laurea magistrale], Università di Bologna, Corso di Studio in Artificial intelligence [LM-DM270]
Documenti full-text disponibili:
[thumbnail of Thesis] Documento PDF (Thesis)
Disponibile con Licenza: Creative Commons: Attribuzione - Non commerciale - Condividi allo stesso modo 4.0 (CC BY-NC-SA 4.0)

Download (4MB)
[thumbnail of Supplementary file] Documento PDF (Supplementary file)
Disponibile con Licenza: Creative Commons: Attribuzione - Non commerciale - Non opere derivate 4.0 (CC BY-NC-ND 4.0)

Download (144kB)

Abstract

This thesis addresses the challenge of Named Entity Recognition (NER) in historical Italian (Volgare) texts from the 13th to 16th centuries, a period characterized by linguistic variability and non-standardized orthography. Traditional NER approaches optimized for modern languages fail to capture the unique features of Volgare Italian, which presents unstandardized spelling, regional variations, and complex naming patterns. To overcome these limitations, we developed a comprehensive annotation framework with nine entity categories (PER, LOC, FAM, POP, OPR, DAT, EVE, DOC, and ORG) and created customized annotation tools that balance manual effort with semi-automated techniques. We evaluated multiple approaches, comparing zero-shot and few-shot performance of large language models (Claude 3.7 Sonnet, SLIMER-IT) against fine-tuned smaller models (DistilBERT). Our experiments revealed several key findings: (1) smaller context windows (128 characters) consistently outperformed larger ones across all models; (2) Claude demonstrated impressive zero-shot capabilities with recall exceeding 0.85; (3) fine-tuned models developed specialized abilities to recognize certain entity subtypes; and (4) ensemble approaches combining complementary models showed significant performance gains. Testing on a completely annotated text of Angelo Poliziano's "Orfeo" confirmed these findings while highlighting opportunities for continued development. This research establishes methodological foundations for processing historical Italian texts and offers insights for similar challenges in other historical languages.

Abstract
Tipologia del documento
Tesi di laurea (Laurea magistrale)
Autore della tesi
Esposito, Simone
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Named Entity Recognition, Historical Italian, Volgare, Annotation Framework, Zero-shot Learning, Few-shot Learning, Fine-tuning, Ensemble Models, Digital Humanities, Context Windows, BERT, Claude, SLIMER-IT, Bootstrapping
Data di discussione della Tesi
25 Marzo 2025
URI

Altri metadati

Statistica sui download

Gestione del documento: Visualizza il documento

^