Automated Classification of Multilingual User Feedback in Fitness Applications Through Hybrid Retrieval and Large Language Models

Isomurodov, Javokhir (2026) Automated Classification of Multilingual User Feedback in Fitness Applications Through Hybrid Retrieval and Large Language Models. [Laurea magistrale], Università di Bologna, Corso di Studio in Digital transformation management [LM-DM270] - Cesena, Documento full-text non disponibile
Il full-text non è disponibile per scelta dell'autore. (Contatta l'autore)

Abstract

The proliferation of connected fitness applications generates large volumes of unstructured, multilingual user feedback that must be classified in real time. Traditional approaches such as manual review and keyword-based rules fail to scale with growing user bases and cannot capture the semantic nuance of text where users describe software bugs without explicit technical terminology. This thesis presents a hybrid classification pipeline combining vector database retrieval with Large Language Models (LLMs) to automatically detect software bugs in user-submitted feedback. The architecture follows a Retrieval-Augmented Generation (RAG) paradigm: incoming feedback is vectorised and queried against a Pinecone vector database of historically classified entries; if a sufficiently similar entry is found, its label is inherited directly; otherwise, the feedback is routed to an LLM ensemble (Gemini and GPT-4o) for context-aware semantic classification. A multilingual data engineering pipeline handles normalisation, language detection, and deduplication of the raw corpus. Extensive experiments compare four embedding models, multiple k-NN voting strategies, two distance metrics, and two LLM backends on a production dataset of 17,822 feedback entries from the Technogym App. Results demonstrate that the hybrid pipeline achieves an F1 score of 0.74 and an Average Precision of 0.84, outperforming standalone LLM classification by +0.54 F1 points (F1: 0.74 vs. 0.20) and improving recall by +0.11 over standalone vector search (recall: 0.70 vs. 0.58), while routing 69% of feedback through a cost-free retrieval path and reducing LLM API calls by 69%.

Abstract
Tipologia del documento
Tesi di laurea (Laurea magistrale)
Autore della tesi
Isomurodov, Javokhir
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Large,Language Models,Retrieval,Augmented,Generation,Vector ,Databases,Feedback,Classification,Natural,Processing
Data di discussione della Tesi
19 Marzo 2026
URI

Altri metadati

Gestione del documento: Visualizza il documento

^