GRPO4Chess: Improving Next Move Generation in Chess Language Models via Reinforcement Learning

Calzolari, Francesco Teo (2025) GRPO4Chess: Improving Next Move Generation in Chess Language Models via Reinforcement Learning. [Laurea], Università di Bologna, Corso di Studio in Ingegneria e scienze informatiche [L-DM270] - Cesena, Documento ad accesso riservato.
Documenti full-text disponibili:
[thumbnail of Thesis] Documento PDF (Thesis)
Full-text non accessibile fino al 17 Settembre 2026.
Disponibile con Licenza: Creative Commons: Attribuzione - Non commerciale - Condividi allo stesso modo 4.0 (CC BY-NC-SA 4.0)

Download (23MB) | Contatta l'autore

Abstract

In recent decades, chess has served as a cornerstone for artificial intelligence research, from DeepBlue’s 1997 victory over Garry Kasparov to the dominance of engines such as Stockfish and Leela Chess Zero. While these systems rely on brute-force search and carefully engineered evaluations, a new line of research treats chess as a symbolic language to be modeled and generated. However, current chess language models frequently produce illegal Standard Algebraic Notation (SAN), such as invalid moves or false checkmates, which undermines their reliability. This thesis addresses this challenge by training a chess language model with reinforcement learning (RL) using Group Relative Policy Optimization (GRPO). The training corpus consists of elite over-the-board and online games by grandmasters including Magnus Carlsen and Hikaru Nakamura. We design and evaluate reward functions at two levels: syntax-focused rewards, which only assess the formal correctness of SAN, and board-based rewards, which verify whether moves are executable within the actual game state. Our experiments show that GRPO substantially reduces the frequency of illegal moves, with the best-performing models trained under board-based rewards. These models exhibit improved board consistency but also a higher draw rate and reduced win rate. Compared to a supervised fine-tuned (SFT) trained model, GRPO training achieves superior legality but lower playing strength. Applying GRPO on top of SFT further increases legality, though the win rate remains below the base model. Overall, this work demonstrates that RL can significantly improve the syntactic and semantic validity of chess language models, though a trade-off emerges between move legality and competitive performance.

Abstract
Tipologia del documento
Tesi di laurea (Laurea)
Autore della tesi
Calzolari, Francesco Teo
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Chess Language Model,Reinforcement Learning,Group Relative Policy Optimization,Reward Function Engineering,Natural Language Processing
Data di discussione della Tesi
2 Ottobre 2025
URI

Altri metadati

Gestione del documento: Visualizza il documento

^