Favale, Stefano
(2024)
Zero-Shot capabilities of Multi-Modal Large Language Models for Creative-oriented Tasks.
[Laurea magistrale], Università di Bologna, Corso di Studio in
Ingegneria informatica [LM-DM270], Documento full-text non disponibile
Il full-text non è disponibile per scelta dell'autore.
(
Contatta l'autore)
Abstract
This thesis explores the zero-shot capabilities of multimodal large language models (MLLMs) in creative-oriented tasks. The first objective is to demonstrate that multimodal integration, encompassing text, images, audio, and video, can be effectively achieved by using text as an intermedi- ate language. This objective is accomplished by analysing the architecture and training process of some state-of-art MLLMs. The second objective is to show the tremendous possibility that MLLMs have in the artistic field. In particular, examples of possible application in the painting, poetry, film, and video game creation fields of the models analyzed are given. Finally, the last objective of this thesis is to understand if MLLMs are ready for being used from artists. For this reason the two most versatile models among those presented, ImageBind LLM and Video-LLaMA, were be tested in multimodal artistic-related classification tasks. The results reveal the models’ strengths and limitations, suggesting that while these technologies show promise, they require further improvement for more complex artistic tasks.
In summary this thesis demonstrates the significant potential of multi-modal large language models (MLLMs) in creative fields, particularly through the integration of text as an intermediate language across multiple modalities. While the analyzed models exhibit promising zero-shot capabilities, especially in tasks related to painting, poetry, film, and video game creation, the results of the tests of ImageBind LLM and Video-LLaMA also highlight that further refinement is needed. Nevertheless these technologies show potential as artistic tools, since their performance on simple task are already good without any specific fine-tuning, it’s legit to think that with the support of more training or In Context Learning these models could represent a vary useful and versatile tool to help on the creation process in the artistic field.
Abstract
This thesis explores the zero-shot capabilities of multimodal large language models (MLLMs) in creative-oriented tasks. The first objective is to demonstrate that multimodal integration, encompassing text, images, audio, and video, can be effectively achieved by using text as an intermedi- ate language. This objective is accomplished by analysing the architecture and training process of some state-of-art MLLMs. The second objective is to show the tremendous possibility that MLLMs have in the artistic field. In particular, examples of possible application in the painting, poetry, film, and video game creation fields of the models analyzed are given. Finally, the last objective of this thesis is to understand if MLLMs are ready for being used from artists. For this reason the two most versatile models among those presented, ImageBind LLM and Video-LLaMA, were be tested in multimodal artistic-related classification tasks. The results reveal the models’ strengths and limitations, suggesting that while these technologies show promise, they require further improvement for more complex artistic tasks.
In summary this thesis demonstrates the significant potential of multi-modal large language models (MLLMs) in creative fields, particularly through the integration of text as an intermediate language across multiple modalities. While the analyzed models exhibit promising zero-shot capabilities, especially in tasks related to painting, poetry, film, and video game creation, the results of the tests of ImageBind LLM and Video-LLaMA also highlight that further refinement is needed. Nevertheless these technologies show potential as artistic tools, since their performance on simple task are already good without any specific fine-tuning, it’s legit to think that with the support of more training or In Context Learning these models could represent a vary useful and versatile tool to help on the creation process in the artistic field.
Tipologia del documento
Tesi di laurea
(Laurea magistrale)
Autore della tesi
Favale, Stefano
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Indirizzo
CURRICULUM INGEGNERIA INFORMATICA
Ordinamento Cds
DM270
Parole chiave
Multimodal Large Language Models (MLLMs),Multimodal Artificial Intelligence,Zero-shot,ImageBind LLM,Video-LLaMA,Text as intermediate language,Modality integration,Creative processes,Artistic applications,Multimodal classification tasks,Multimodal creativity,Creative AI,Artistic AI
Data di discussione della Tesi
8 Ottobre 2024
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Favale, Stefano
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Indirizzo
CURRICULUM INGEGNERIA INFORMATICA
Ordinamento Cds
DM270
Parole chiave
Multimodal Large Language Models (MLLMs),Multimodal Artificial Intelligence,Zero-shot,ImageBind LLM,Video-LLaMA,Text as intermediate language,Modality integration,Creative processes,Artistic applications,Multimodal classification tasks,Multimodal creativity,Creative AI,Artistic AI
Data di discussione della Tesi
8 Ottobre 2024
URI
Gestione del documento: