Ciminari, Debora
(2025)
A Tough Row to Hoe: Instruction Fine-Tuning LLaMA 3.2 for Multilingual Sentence Disambiguation and Idiom Identification.
[Laurea magistrale], Università di Bologna, Corso di Studio in
Specialized translation [LM-DM270] - Forli'
Documenti full-text disponibili:
Abstract
Idiomatic expressions (IEs) are a fundamental aspect of language, traditionally
defined as expressions whose meanings cannot be inferred from their
individual components. However, modern linguistic theories propose a more
complex definition of idiomaticity, which is now understood as a continuum
where IEs can be placed depending on multiple factors. This complexity
poses challenges for natural language processing (NLP) applications, where
effective handling of IEs can improve performance in various tasks, including
sentiment analysis, question answering, text summarisation, and machine
translation. This thesis contributes to the study of IEs in NLP by instruction
fine-tuning LLaMA 3.2 1B on two tasks: sentence disambiguation and idiom
identification. To this end, a multilingual instruction-formatted dataset was
created, incorporating English, Italian, and Portuguese as both instruction
and input languages. This enabled to investigate the interaction between the
instruction and input language and examine the model’s performance when
they match and when they differ. The findings showed that aligning instruction
and input languages does not always improve performance, highlighting
complex cross-linguistic interactions. However, while fine-tuning enhanced
idiom identification, it led to slight declines in sentence disambiguation, possibly
due to dataset limitations and lack of hyperparameter tuning. Future
work could expand language diversity, refine fine-tuning strategies, and explore
other LLM architectures for better performance.
Abstract
Idiomatic expressions (IEs) are a fundamental aspect of language, traditionally
defined as expressions whose meanings cannot be inferred from their
individual components. However, modern linguistic theories propose a more
complex definition of idiomaticity, which is now understood as a continuum
where IEs can be placed depending on multiple factors. This complexity
poses challenges for natural language processing (NLP) applications, where
effective handling of IEs can improve performance in various tasks, including
sentiment analysis, question answering, text summarisation, and machine
translation. This thesis contributes to the study of IEs in NLP by instruction
fine-tuning LLaMA 3.2 1B on two tasks: sentence disambiguation and idiom
identification. To this end, a multilingual instruction-formatted dataset was
created, incorporating English, Italian, and Portuguese as both instruction
and input languages. This enabled to investigate the interaction between the
instruction and input language and examine the model’s performance when
they match and when they differ. The findings showed that aligning instruction
and input languages does not always improve performance, highlighting
complex cross-linguistic interactions. However, while fine-tuning enhanced
idiom identification, it led to slight declines in sentence disambiguation, possibly
due to dataset limitations and lack of hyperparameter tuning. Future
work could expand language diversity, refine fine-tuning strategies, and explore
other LLM architectures for better performance.
Tipologia del documento
Tesi di laurea
(Laurea magistrale)
Autore della tesi
Ciminari, Debora
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Indirizzo
CURRICULUM TRANSLATION AND TECHNOLOGY
Ordinamento Cds
DM270
Parole chiave
natural language processing, large language models, LLaMA, instruction fine-tuning, idiomatic expressions, multilingual.
Data di discussione della Tesi
18 Marzo 2025
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Ciminari, Debora
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Indirizzo
CURRICULUM TRANSLATION AND TECHNOLOGY
Ordinamento Cds
DM270
Parole chiave
natural language processing, large language models, LLaMA, instruction fine-tuning, idiomatic expressions, multilingual.
Data di discussione della Tesi
18 Marzo 2025
URI
Statistica sui download
Gestione del documento: