Reinforcement learning on emerging referential communication in Atari games

Song, Zhaohui (2024) Reinforcement learning on emerging referential communication in Atari games. [Laurea], Università di Bologna, Corso di Studio in Ingegneria e scienze informatiche [L-DM270] - Cesena, Documento ad accesso riservato.

Salva citazione

Documenti full-text disponibili:

Documento PDF (Thesis)
Full-text non accessibile fino al 31 Luglio 2025.
Disponibile con Licenza: Creative Commons: Attribuzione - Non commerciale - Non opere derivate 4.0 (CC BY-NC-ND 4.0)
Download (5MB) | Contatta l'autore

Abstract

In the process of learning new tasks, humans don't rely on random action sequences; instead, they draw from prior knowledge and experiences, often guided by instructional materials like videos, books, or online tutorials. Conversely, policy gradient-based methods, known for their struggle with exploration, typically resort to random environmental interactions until a reward signal is received. However, in a more complex learning environment, this approach may never converge, communicating and interpreting prior knowledge is essential to accelerate the learning process. Addressing this challenge, instruction-following reinforcement learning has emerged as a popular approach to augment exploration in policy gradient methods. By integrating reward shaping with instruction following, these methods optimize not only for maximizing task rewards but also align actions with human intention and instructions. In this paper, we investigate the feasibility of training two agents to play video games in collaboration in a referential game setting, where one agent explains the game observation and another agent interprets it to choose a valid action, we search the emergent communication protocol between two agents to achieve higher reward in the Atari games. To achieve this, we propose an Explainer-Interpreter framework that harnesses a large Vision-Language Model to generate instructions and a frozen pre-trained large Language Model that interprets and executes these instructions. We employ the Proximal Policy Optimization (PPO) algorithm to optimize instruction generation. Through detailed studies of the implementation intricacies, we assessed their impact on performance. Despite our efforts, the findings underscore that training the vision-language model solely on policy for text generation, without leveraging downstream datasets, fails to converge and results in a negligible improvement in reward.

Abstract

Tipologia del documento

Tesi di laurea (Laurea)

Autore della tesi

Song, Zhaohui

Relatore della tesi

Moro, Gianluca

Correlatore della tesi

Frisoni, Giacomo ; Molfetta, Lorenzo ; Italiani, Paolo

Scuola

Ingegneria e Architettura

Corso di studio

Ingegneria e scienze informatiche [L-DM270] - Cesena

Ordinamento Cds

DM270

Parole chiave

Reinforcement Learning from Human Feedback,Natural Language Processing,Vision-Language Models,Computer Vision,Deep Learning,proximal policy optimization,ai explainability

Data di discussione della Tesi

15 Marzo 2024

URI

https://amslaurea.unibo.it/id/eprint/31329

Altri metadati

Gestione del documento:

Strumenti di navigazione

Collezioni AlmaDL

Reinforcement learning on emerging referential communication in Atari games

Abstract

Altri metadati