Fine-Tuning Neural Codec Language Models from Feedback with Reinforcement Learning

Pratesi, Lorenzo (2024) Fine-Tuning Neural Codec Language Models from Feedback with Reinforcement Learning. [Laurea magistrale], Università di Bologna, Corso di Studio in Artificial intelligence [LM-DM270]
Documenti full-text disponibili:
[thumbnail of Thesis] Documento PDF (Thesis)
Disponibile con Licenza: Creative Commons: Attribuzione - Non commerciale - Condividi allo stesso modo 4.0 (CC BY-NC-SA 4.0)

Download (840kB)

Abstract

Neural codec language models (NCLMs) are speech synthesizers that address the TTS task as a language modeling task rather than continous signal regression as in previous work. They showed an impressive generalization capability, surpassing previous state-of-the-art zero-shot TTS models by means of speaker similarity and naturalness. Although addressing speech synthesis as a language modeling task in part allows to train on large and diverse speech data crawled from the Internet, it also brings some issues common to those of large language models (LLMs) for text generation. While LLMs may generate outputs with made up facts or biased and toxic contents, neural codec language models suffers from synthesis robustness and expressivness. Reinforcement Learning from Human Feedback (RLHF) has emerged to tackle the issues of LLMs, by using human feedback to align the generated responses to the user preferences. Using RLHF helped LLMs to reduce the amount of generated toxic content and false facts. Motivated by the success of RLHF in the text generation domain, this work proposes to fine-tune NCLMs from feedback with reinforcement learning following the RLHF training pipeline. We conduct a series of experiments with VALL-E pretrained on LibriTTS, fine-tuning it to optimize different kind of feedback: intelligibility, naturalness, speaker similarity and waveform duration. Our results show that fine-tuning helped to increase the intelligibility of the model, showing a WER reduction up to 20.954\%, but also to change the speech duration according to the reward signal. Finally, we delineate limitations of our experimental setup and propose practical mitigations, to be explored in future work.

Abstract
Tipologia del documento
Tesi di laurea (Laurea magistrale)
Autore della tesi
Pratesi, Lorenzo
Relatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
reinforcement learning,RLHF,feedback,language models,neural codec language models,neural audio codecs,TTS,speech,speech quantization,ASR
Data di discussione della Tesi
19 Marzo 2024
URI

Altri metadati

Statistica sui download

Gestione del documento: Visualizza il documento

^