Documenti full-text disponibili:
Abstract
Automated image captioning using artificial neural networks allows for applications that go beyond the creation of a description in natural language of the visual information contained in an image. This work explores the use of image captioning to generate the instructions to perform a gastronomic procedure depicted by an input picture. To do this, the model must learn to focus on the appropriate visual elements of the image, as well as to mimic the required style of the captions. A multilingual dataset of recipes is used to fine-tune an English vision encoder-decoder model, and to prefix-tune an Italian model built using CLIP as an image encoder and mGPT as a linguistic decoder with a lightweight network to bridge the modalities. Lack of context that goes beyond the individual image causes the most issues, but both of the resulting models perform well overall. This is especially evident in the case of the English fine-tuned model. However, most of the automated metrics used struggle to reliably evaluate the quality of the results. BERTScore fares the best among them, both when only the baseline BERT model is used and when the model is adapted to the domain. The presence of noisy references probably contributes to the issues encountered during the evaluation, but is certainly not the only factor. In short, while this kind of non-standard application of image captioning can be modeled successfully, the selection of appropriate evaluation metrics is non-trivial, and a time-consuming manual evaluation may be necessary for a fully informed assessment.
Abstract
Automated image captioning using artificial neural networks allows for applications that go beyond the creation of a description in natural language of the visual information contained in an image. This work explores the use of image captioning to generate the instructions to perform a gastronomic procedure depicted by an input picture. To do this, the model must learn to focus on the appropriate visual elements of the image, as well as to mimic the required style of the captions. A multilingual dataset of recipes is used to fine-tune an English vision encoder-decoder model, and to prefix-tune an Italian model built using CLIP as an image encoder and mGPT as a linguistic decoder with a lightweight network to bridge the modalities. Lack of context that goes beyond the individual image causes the most issues, but both of the resulting models perform well overall. This is especially evident in the case of the English fine-tuned model. However, most of the automated metrics used struggle to reliably evaluate the quality of the results. BERTScore fares the best among them, both when only the baseline BERT model is used and when the model is adapted to the domain. The presence of noisy references probably contributes to the issues encountered during the evaluation, but is certainly not the only factor. In short, while this kind of non-standard application of image captioning can be modeled successfully, the selection of appropriate evaluation metrics is non-trivial, and a time-consuming manual evaluation may be necessary for a fully informed assessment.
Tipologia del documento
Tesi di laurea
(Laurea magistrale)
Autore della tesi
Benini, Elena
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Indirizzo
CURRICULUM TRANSLATION AND TECHNOLOGY
Ordinamento Cds
DM270
Parole chiave
NLP,image captioning,artificial neural networks,Transformer,fine-tuning,prefix-tuning,encoder-decoder model
Data di discussione della Tesi
20 Marzo 2026
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Benini, Elena
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Indirizzo
CURRICULUM TRANSLATION AND TECHNOLOGY
Ordinamento Cds
DM270
Parole chiave
NLP,image captioning,artificial neural networks,Transformer,fine-tuning,prefix-tuning,encoder-decoder model
Data di discussione della Tesi
20 Marzo 2026
URI
Statistica sui download
Gestione del documento: