Esposito, Simone
(2025)
Named Entity Recognition for Historical Italian Texts: Overcoming Data Limitations through Strategic Annotation and Model Adaptation.
[Laurea magistrale], Università di Bologna, Corso di Studio in
Artificial intelligence [LM-DM270]
Documenti full-text disponibili:
Abstract
This thesis addresses the challenge of Named Entity Recognition (NER) in historical Italian (Volgare) texts from the 13th to 16th centuries, a period characterized by linguistic variability and non-standardized orthography. Traditional NER approaches optimized for modern languages fail to capture the unique features of Volgare Italian, which presents unstandardized spelling, regional variations, and complex naming patterns. To overcome these limitations, we developed a comprehensive annotation framework with nine entity categories (PER, LOC, FAM, POP, OPR, DAT, EVE, DOC, and ORG) and created customized annotation tools that balance manual effort with semi-automated techniques. We evaluated multiple approaches, comparing zero-shot and few-shot performance of large language models (Claude 3.7 Sonnet, SLIMER-IT) against fine-tuned smaller models (DistilBERT). Our experiments revealed several key findings: (1) smaller context windows (128 characters) consistently outperformed larger ones across all models; (2) Claude demonstrated impressive zero-shot capabilities with recall exceeding 0.85; (3) fine-tuned models developed specialized abilities to recognize certain entity subtypes; and (4) ensemble approaches combining complementary models showed significant performance gains. Testing on a completely annotated text of Angelo Poliziano's "Orfeo" confirmed these findings while highlighting opportunities for continued development. This research establishes methodological foundations for processing historical Italian texts and offers insights for similar challenges in other historical languages.
Abstract
This thesis addresses the challenge of Named Entity Recognition (NER) in historical Italian (Volgare) texts from the 13th to 16th centuries, a period characterized by linguistic variability and non-standardized orthography. Traditional NER approaches optimized for modern languages fail to capture the unique features of Volgare Italian, which presents unstandardized spelling, regional variations, and complex naming patterns. To overcome these limitations, we developed a comprehensive annotation framework with nine entity categories (PER, LOC, FAM, POP, OPR, DAT, EVE, DOC, and ORG) and created customized annotation tools that balance manual effort with semi-automated techniques. We evaluated multiple approaches, comparing zero-shot and few-shot performance of large language models (Claude 3.7 Sonnet, SLIMER-IT) against fine-tuned smaller models (DistilBERT). Our experiments revealed several key findings: (1) smaller context windows (128 characters) consistently outperformed larger ones across all models; (2) Claude demonstrated impressive zero-shot capabilities with recall exceeding 0.85; (3) fine-tuned models developed specialized abilities to recognize certain entity subtypes; and (4) ensemble approaches combining complementary models showed significant performance gains. Testing on a completely annotated text of Angelo Poliziano's "Orfeo" confirmed these findings while highlighting opportunities for continued development. This research establishes methodological foundations for processing historical Italian texts and offers insights for similar challenges in other historical languages.
Tipologia del documento
Tesi di laurea
(Laurea magistrale)
Autore della tesi
Esposito, Simone
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Named Entity Recognition, Historical Italian, Volgare, Annotation Framework, Zero-shot Learning, Few-shot Learning, Fine-tuning, Ensemble Models, Digital Humanities, Context Windows, BERT, Claude, SLIMER-IT, Bootstrapping
Data di discussione della Tesi
25 Marzo 2025
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Esposito, Simone
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Named Entity Recognition, Historical Italian, Volgare, Annotation Framework, Zero-shot Learning, Few-shot Learning, Fine-tuning, Ensemble Models, Digital Humanities, Context Windows, BERT, Claude, SLIMER-IT, Bootstrapping
Data di discussione della Tesi
25 Marzo 2025
URI
Statistica sui download
Gestione del documento: