Nazerzadeh, Mohammad Amin
(2023)
Multi-grained attention over query-scoring for dialogue-to-video retrieval.
[Laurea magistrale], Università di Bologna, Corso di Studio in
Artificial intelligence [LM-DM270], Documento full-text non disponibile
Il full-text non è disponibile per scelta dell'autore.
(
Contatta l'autore)
Abstract
Text-video retrieval is an important subcategory of multi-modal learning. Despite its importance, advances in this topic have been slower compared to other multi-modal areas due to several obstacles including firstly the computational costliness and effectiveness of processing videos in deep neural networks and secondly the semantic mismatch between the visual and the textual modalities. Training for this task is particularly challenging in small-scale datasets where only a few hundred to thousands of videos are available. To address the first issue, we extend text-image retrieval models to this domain on a small-scale text-video dataset, namely AVSD. Specifically, we adopt extending CLIP which is a text-image model to the video domain with the query-scoring approach, including dual-softmax and querybank normalization. In this way, we transfer the pretrained text-image multimodal knowledge in CLIP to the video domain in an efficient manner without the need for big-scale and/or costly text-video fine-tuning. To address the second issue, by adding dialogue as the input query and utilizing multi-grained attention over query-scoring, we further improve the retrieval performance and lessen the semantic mismatch between the textual and visual modalities and achieve state-of-the-art among previous dialogue-based video retrieval methods on the AVSD dataset. Moreover, we evaluate our approach on MSRVTT-QA and MSVD-QA to show the effectiveness of dialogue-based approaches in improving video retrieval outcomes.
Abstract
Text-video retrieval is an important subcategory of multi-modal learning. Despite its importance, advances in this topic have been slower compared to other multi-modal areas due to several obstacles including firstly the computational costliness and effectiveness of processing videos in deep neural networks and secondly the semantic mismatch between the visual and the textual modalities. Training for this task is particularly challenging in small-scale datasets where only a few hundred to thousands of videos are available. To address the first issue, we extend text-image retrieval models to this domain on a small-scale text-video dataset, namely AVSD. Specifically, we adopt extending CLIP which is a text-image model to the video domain with the query-scoring approach, including dual-softmax and querybank normalization. In this way, we transfer the pretrained text-image multimodal knowledge in CLIP to the video domain in an efficient manner without the need for big-scale and/or costly text-video fine-tuning. To address the second issue, by adding dialogue as the input query and utilizing multi-grained attention over query-scoring, we further improve the retrieval performance and lessen the semantic mismatch between the textual and visual modalities and achieve state-of-the-art among previous dialogue-based video retrieval methods on the AVSD dataset. Moreover, we evaluate our approach on MSRVTT-QA and MSVD-QA to show the effectiveness of dialogue-based approaches in improving video retrieval outcomes.
Tipologia del documento
Tesi di laurea
(Laurea magistrale)
Autore della tesi
Nazerzadeh, Mohammad Amin
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
text-video retrieval,dialogue-based video retrieval,AVSD dataset,query-scoring,dual-softmax loss,querybank normalization,multi-grained
Data di discussione della Tesi
21 Ottobre 2023
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Nazerzadeh, Mohammad Amin
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
text-video retrieval,dialogue-based video retrieval,AVSD dataset,query-scoring,dual-softmax loss,querybank normalization,multi-grained
Data di discussione della Tesi
21 Ottobre 2023
URI
Gestione del documento: