Documenti full-text disponibili:
Abstract
In the process of learning new tasks, humans don't rely on random action sequences; instead, they draw from prior knowledge and experiences, often guided by instructional materials like videos, books, or online tutorials. Conversely, policy gradient-based methods, known for their struggle with exploration, typically resort to random environmental interactions until a reward signal is received. However, in a more complex learning environment, this approach may never converge, communicating and interpreting prior knowledge is essential to accelerate the learning process. Addressing this challenge, instruction-following reinforcement learning has emerged as a popular approach to augment exploration in policy gradient methods. By integrating reward shaping with instruction following, these methods optimize not only for maximizing task rewards but also align actions with human intention and instructions.
In this paper, we investigate the feasibility of training two agents to play video games in collaboration in a referential game setting, where one agent explains the game observation and another agent interprets it to choose a valid action, we search the emergent communication protocol between two agents to achieve higher reward in the Atari games. To achieve this, we propose an Explainer-Interpreter framework that harnesses a large Vision-Language Model to generate instructions and a frozen pre-trained large Language Model that interprets and executes these instructions. We employ the Proximal Policy Optimization (PPO) algorithm to optimize instruction generation. Through detailed studies of the implementation intricacies, we assessed their impact on performance. Despite our efforts, the findings underscore that training the vision-language model solely on policy for text generation, without leveraging downstream datasets, fails to converge and results in a negligible improvement in reward.
Abstract
In the process of learning new tasks, humans don't rely on random action sequences; instead, they draw from prior knowledge and experiences, often guided by instructional materials like videos, books, or online tutorials. Conversely, policy gradient-based methods, known for their struggle with exploration, typically resort to random environmental interactions until a reward signal is received. However, in a more complex learning environment, this approach may never converge, communicating and interpreting prior knowledge is essential to accelerate the learning process. Addressing this challenge, instruction-following reinforcement learning has emerged as a popular approach to augment exploration in policy gradient methods. By integrating reward shaping with instruction following, these methods optimize not only for maximizing task rewards but also align actions with human intention and instructions.
In this paper, we investigate the feasibility of training two agents to play video games in collaboration in a referential game setting, where one agent explains the game observation and another agent interprets it to choose a valid action, we search the emergent communication protocol between two agents to achieve higher reward in the Atari games. To achieve this, we propose an Explainer-Interpreter framework that harnesses a large Vision-Language Model to generate instructions and a frozen pre-trained large Language Model that interprets and executes these instructions. We employ the Proximal Policy Optimization (PPO) algorithm to optimize instruction generation. Through detailed studies of the implementation intricacies, we assessed their impact on performance. Despite our efforts, the findings underscore that training the vision-language model solely on policy for text generation, without leveraging downstream datasets, fails to converge and results in a negligible improvement in reward.
Tipologia del documento
Tesi di laurea
(Laurea)
Autore della tesi
Song, Zhaohui
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Reinforcement Learning from Human Feedback,Natural Language Processing,Vision-Language Models,Computer Vision,Deep Learning,proximal policy optimization,ai explainability
Data di discussione della Tesi
15 Marzo 2024
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Song, Zhaohui
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Reinforcement Learning from Human Feedback,Natural Language Processing,Vision-Language Models,Computer Vision,Deep Learning,proximal policy optimization,ai explainability
Data di discussione della Tesi
15 Marzo 2024
URI
Gestione del documento: