The Fate of Interaction after Whisper Decoding Optimization: a Case Study on the KIParla Corpus

Simonotti, Martina (2026) The Fate of Interaction after Whisper Decoding Optimization: a Case Study on the KIParla Corpus. [Laurea magistrale], Università di Bologna, Corso di Studio in Specialized translation [LM-DM270] - Forli'
Documenti full-text disponibili:
[thumbnail of Thesis] Documento PDF (Thesis)
Disponibile con Licenza: Salvo eventuali più ampie autorizzazioni dell'autore, la tesi può essere liberamente consultata e può essere effettuato il salvataggio e la stampa di una copia per fini strettamente personali di studio, di ricerca e di insegnamento, con espresso divieto di qualunque utilizzo direttamente o indirettamente commerciale. Ogni altro diritto sul materiale è riservato

Download (4MB)

Abstract

End-to-end Automatic Speech Recognition (ASR) systems such as Whisper achieve high transcription accuracy; however, they are designed to prioritize semantically informative content and consequently suppress short interactional phenomena such as backchannels, repair sequences and filled pauses. These elements are brief, prosodically subtle, may contain truncations or repeated elements and are frequently produced in overlap, making them especially vulnerable to omission or normalization. This thesis examines how spontaneous conversational features are treated in automatic transcriptions of Italian speech, and whether different system configurations affect their representation in the resulting output. To this end, the acoustic model is kept fixed, while decoding parameters are optimized using two objective functions: a standard Word Error Rate (WER)-based pipeline and an Interaction-aware pipeline incorporating event-level weights. A subset of the KIParla corpus was manually annotated to create a gold standard, and ASR outputs were evaluated in terms of global WER, event-level match ratios, substitution and omission patterns, as well as overlap effects. Results show that decoding optimization exerts only a limited influence on overall accuracy. The Interaction-aware configuration does not substantially increase event preservation, but it does not degrade global performance and slightly stabilizes error dispersion in some cases. Recognition patterns emerge as strongly phenomenon-dependent: self-repairs are often preserved in linearized form, whereas backchannels are particularly vulnerable to overlap. Overlapping speech consistently reduces recognition probability across configurations, suggesting that parameter adjustments alone are insufficient to counteract the normalization bias of large-scale ASR systems.

Abstract
Tipologia del documento
Tesi di laurea (Laurea magistrale)
Autore della tesi
Simonotti, Martina
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Indirizzo
CURRICULUM TRANSLATION AND TECHNOLOGY
Ordinamento Cds
DM270
Parole chiave
Automatic Speech Recognition,Whisper,KIParla,Conversation Analysis,spoken Italian,decoding optimization
Data di discussione della Tesi
19 Marzo 2026
URI

Altri metadati

Statistica sui download

Gestione del documento: Visualizza il documento

^