ROMAGNA M-IA: Language Models for Autocompletion and Clustering in Italian Cuisine

Strocchi, Alessandro (2025) ROMAGNA M-IA: Language Models for Autocompletion and Clustering in Italian Cuisine. [Laurea magistrale], Università di Bologna, Corso di Studio in Specialized translation [LM-DM270] - Forli', Documento ad accesso riservato.
Documenti full-text disponibili:
[thumbnail of Thesis] Documento PDF (Thesis)
Full-text accessibile solo agli utenti istituzionali dell'Ateneo
Disponibile con Licenza: Salvo eventuali più ampie autorizzazioni dell'autore, la tesi può essere liberamente consultata e può essere effettuato il salvataggio e la stampa di una copia per fini strettamente personali di studio, di ricerca e di insegnamento, con espresso divieto di qualunque utilizzo direttamente o indirettamente commerciale. Ogni altro diritto sul materiale è riservato

Download (639kB) | Contatta l'autore

Abstract

This dissertation aims to contribute to the field of domain-specific natural language processing, in particular fine-tuning language models for specialized tasks in a specific and closed domain for two primary tasks: autocompletion and clustering. The work is motivated by the practical need to handle specialized terminology, regional variations, and informal writing styles found in Italian menus. First, several models (Word2Vec, LSTM, and GePpeTto) are assessed for their ability to autocomplete menu items. Perplexity and Jaccard coefficient are used to gauge performance, revealing that GePpeTto, despite having a higher perplexity, consistently produces more contextually relevant completions due to its autoregressive transformer architecture. Second, BERT (Italian) and UmBerto are compared for clustering a corpus of Italian dishes and ingredients. HDBSCAN is chosen for its robustness to variable-density clusters, with the Davies-Bouldin Index as the key perfor- mance criterion. Both models achieve a Davies-Bouldin Index below 1, indicating meaningful clusters, although BERT attains better separation and less overlap. The findings demonstrate that advanced transformer-based models can be fine-tuned to address the challenges of domain-specific language, including irregular vocabulary and regionally influenced spelling. By highlighting the strengths and limitations of each model, the dissertation points to potential improvements in both autocompletion and clustering, such as more extensive domain-focused training and hybrid model architectures, ultimately contributing to enhanced language technology in the culinary domain.

Abstract
Tipologia del documento
Tesi di laurea (Laurea magistrale)
Autore della tesi
Strocchi, Alessandro
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Indirizzo
CURRICULUM TRANSLATION AND TECHNOLOGY
Ordinamento Cds
DM270
Parole chiave
NLP,Language Models,Autocompletion,Clustering
Data di discussione della Tesi
18 Marzo 2025
URI

Altri metadati

Statistica sui download

Gestione del documento: Visualizza il documento

^