ROMAGNA M-IA: Language Models for Autocompletion and Clustering in Italian Cuisine

Strocchi, Alessandro (2025) ROMAGNA M-IA: Language Models for Autocompletion and Clustering in Italian Cuisine. [Laurea magistrale], Università di Bologna, Corso di Studio in Specialized translation [LM-DM270] - Forli', Documento ad accesso riservato.

Salva citazione

Documenti full-text disponibili:

Documento PDF (Thesis)
Full-text accessibile solo agli utenti istituzionali dell'Ateneo
Disponibile con Licenza: Salvo eventuali più ampie autorizzazioni dell'autore, la tesi può essere liberamente consultata e può essere effettuato il salvataggio e la stampa di una copia per fini strettamente personali di studio, di ricerca e di insegnamento, con espresso divieto di qualunque utilizzo direttamente o indirettamente commerciale. Ogni altro diritto sul materiale è riservato
Download (639kB) | Contatta l'autore

Abstract

This dissertation aims to contribute to the field of domain-specific natural language processing, in particular fine-tuning language models for specialized tasks in a specific and closed domain for two primary tasks: autocompletion and clustering. The work is motivated by the practical need to handle specialized terminology, regional variations, and informal writing styles found in Italian menus. First, several models (Word2Vec, LSTM, and GePpeTto) are assessed for their ability to autocomplete menu items. Perplexity and Jaccard coefficient are used to gauge performance, revealing that GePpeTto, despite having a higher perplexity, consistently produces more contextually relevant completions due to its autoregressive transformer architecture. Second, BERT (Italian) and UmBerto are compared for clustering a corpus of Italian dishes and ingredients. HDBSCAN is chosen for its robustness to variable-density clusters, with the Davies-Bouldin Index as the key perfor- mance criterion. Both models achieve a Davies-Bouldin Index below 1, indicating meaningful clusters, although BERT attains better separation and less overlap. The findings demonstrate that advanced transformer-based models can be fine-tuned to address the challenges of domain-specific language, including irregular vocabulary and regionally influenced spelling. By highlighting the strengths and limitations of each model, the dissertation points to potential improvements in both autocompletion and clustering, such as more extensive domain-focused training and hybrid model architectures, ultimately contributing to enhanced language technology in the culinary domain.

Abstract

Tipologia del documento

Tesi di laurea (Laurea magistrale)

Autore della tesi

Strocchi, Alessandro

Relatore della tesi

Barron Cedeno, Luis Alberto

Correlatore della tesi

Garcea, Federico

Scuola

Lingue e Letterature, Traduzione e Interpretazione

Corso di studio