Cecchi, Lorenzo
(2025)
Reinforcement Learning for non-deterministic problems - exploration of Chance Nodes.
[Laurea magistrale], Università di Bologna, Corso di Studio in
Matematica [LM-DM270]
Documenti full-text disponibili:
Abstract
The quest to teach computers how to make decisions in complex and chaotic environments has been a central point of research since their invention. Among the frameworks that have been proposed regarding this topic, Reinforcement Learning stands out. Reinforcement Learning algorithms are often characterized by the need to balance exploration and exploitation. A simplified but fundamental instance of this dilemma is the multi-armed bandit problem. Over the years, a rich theory has been developed around this problem, including performance bounds and optimal strategies under various assumptions. Classical algorithms such as the ε-greedy strategy and Upper Confidence Bounds have been widely studied and applied. However, in many realistic settings, some form of prior knowledge or initial preference is available regarding possible actions. This leads to extensions of the classical bandit model that incorporate Bayesian priors, like the Thompson sampling algorithm, or externally provided preferences, such as those employed in PUCT: a key component of the Monte Carlo Tree Search algorithm used in AlphaZero. The aim of this thesis is to investigate the relations between bandit strategies and tree search methods for decision-making, especially in the presence of stochasticity.
Abstract
The quest to teach computers how to make decisions in complex and chaotic environments has been a central point of research since their invention. Among the frameworks that have been proposed regarding this topic, Reinforcement Learning stands out. Reinforcement Learning algorithms are often characterized by the need to balance exploration and exploitation. A simplified but fundamental instance of this dilemma is the multi-armed bandit problem. Over the years, a rich theory has been developed around this problem, including performance bounds and optimal strategies under various assumptions. Classical algorithms such as the ε-greedy strategy and Upper Confidence Bounds have been widely studied and applied. However, in many realistic settings, some form of prior knowledge or initial preference is available regarding possible actions. This leads to extensions of the classical bandit model that incorporate Bayesian priors, like the Thompson sampling algorithm, or externally provided preferences, such as those employed in PUCT: a key component of the Monte Carlo Tree Search algorithm used in AlphaZero. The aim of this thesis is to investigate the relations between bandit strategies and tree search methods for decision-making, especially in the presence of stochasticity.
Tipologia del documento
Tesi di laurea
(Laurea magistrale)
Autore della tesi
Cecchi, Lorenzo
Relatore della tesi
Scuola
Corso di studio
Indirizzo
CURRICULUM ADVANCED MATHEMATICS FOR APPLICATIONS
Ordinamento Cds
DM270
Parole chiave
Reinforcemente Learning,Chance Nodes,Monte Carlo Tree Seach
Data di discussione della Tesi
25 Luglio 2025
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Cecchi, Lorenzo
Relatore della tesi
Scuola
Corso di studio
Indirizzo
CURRICULUM ADVANCED MATHEMATICS FOR APPLICATIONS
Ordinamento Cds
DM270
Parole chiave
Reinforcemente Learning,Chance Nodes,Monte Carlo Tree Seach
Data di discussione della Tesi
25 Luglio 2025
URI
Statistica sui download
Gestione del documento: