Zero-Shot capabilities of Multi-Modal Large Language Models for Creative-oriented Tasks

Favale, Stefano (2024) Zero-Shot capabilities of Multi-Modal Large Language Models for Creative-oriented Tasks. [Laurea magistrale], Università di Bologna, Corso di Studio in Ingegneria informatica [LM-DM270], Documento full-text non disponibile

Salva citazione

Il full-text non è disponibile per scelta dell'autore. (Contatta l'autore)

Abstract

This thesis explores the zero-shot capabilities of multimodal large language models (MLLMs) in creative-oriented tasks. The first objective is to demonstrate that multimodal integration, encompassing text, images, audio, and video, can be effectively achieved by using text as an intermedi- ate language. This objective is accomplished by analysing the architecture and training process of some state-of-art MLLMs. The second objective is to show the tremendous possibility that MLLMs have in the artistic field. In particular, examples of possible application in the painting, poetry, film, and video game creation fields of the models analyzed are given. Finally, the last objective of this thesis is to understand if MLLMs are ready for being used from artists. For this reason the two most versatile models among those presented, ImageBind LLM and Video-LLaMA, were be tested in multimodal artistic-related classification tasks. The results reveal the models’ strengths and limitations, suggesting that while these technologies show promise, they require further improvement for more complex artistic tasks. In summary this thesis demonstrates the significant potential of multi-modal large language models (MLLMs) in creative fields, particularly through the integration of text as an intermediate language across multiple modalities. While the analyzed models exhibit promising zero-shot capabilities, especially in tasks related to painting, poetry, film, and video game creation, the results of the tests of ImageBind LLM and Video-LLaMA also highlight that further refinement is needed. Nevertheless these technologies show potential as artistic tools, since their performance on simple task are already good without any specific fine-tuning, it’s legit to think that with the support of more training or In Context Learning these models could represent a vary useful and versatile tool to help on the creation process in the artistic field.

Abstract

Tipologia del documento

Tesi di laurea (Laurea magistrale)

Autore della tesi

Favale, Stefano

Relatore della tesi

De Filippo, Allegra

Correlatore della tesi

Stacchio, Lorenzo ; Borghesi, Andrea

Scuola

Ingegneria e Architettura

Corso di studio

Ingegneria informatica [LM-DM270]

Indirizzo

CURRICULUM INGEGNERIA INFORMATICA

Ordinamento Cds

DM270

Parole chiave

Multimodal Large Language Models (MLLMs),Multimodal Artificial Intelligence,Zero-shot,ImageBind LLM,Video-LLaMA,Text as intermediate language,Modality integration,Creative processes,Artistic applications,Multimodal classification tasks,Multimodal creativity,Creative AI,Artistic AI

Data di discussione della Tesi

8 Ottobre 2024

URI

https://amslaurea.unibo.it/id/eprint/32977

Altri metadati

Gestione del documento:

Strumenti di navigazione

Collezioni AlmaDL

Zero-Shot capabilities of Multi-Modal Large Language Models for Creative-oriented Tasks

Abstract

Altri metadati