Automated Classification of Multilingual User Feedback in Fitness Applications Through Hybrid Retrieval and Large Language Models

Isomurodov, Javokhir (2026) Automated Classification of Multilingual User Feedback in Fitness Applications Through Hybrid Retrieval and Large Language Models. [Laurea magistrale], Università di Bologna, Corso di Studio in Digital transformation management [LM-DM270] - Cesena, Documento full-text non disponibile

Salva citazione

Il full-text non è disponibile per scelta dell'autore. (Contatta l'autore)

Abstract

The proliferation of connected fitness applications generates large volumes of unstructured, multilingual user feedback that must be classified in real time. Traditional approaches such as manual review and keyword-based rules fail to scale with growing user bases and cannot capture the semantic nuance of text where users describe software bugs without explicit technical terminology. This thesis presents a hybrid classification pipeline combining vector database retrieval with Large Language Models (LLMs) to automatically detect software bugs in user-submitted feedback. The architecture follows a Retrieval-Augmented Generation (RAG) paradigm: incoming feedback is vectorised and queried against a Pinecone vector database of historically classified entries; if a sufficiently similar entry is found, its label is inherited directly; otherwise, the feedback is routed to an LLM ensemble (Gemini and GPT-4o) for context-aware semantic classification. A multilingual data engineering pipeline handles normalisation, language detection, and deduplication of the raw corpus. Extensive experiments compare four embedding models, multiple k-NN voting strategies, two distance metrics, and two LLM backends on a production dataset of 17,822 feedback entries from the Technogym App. Results demonstrate that the hybrid pipeline achieves an F1 score of 0.74 and an Average Precision of 0.84, outperforming standalone LLM classification by +0.54 F1 points (F1: 0.74 vs. 0.20) and improving recall by +0.11 over standalone vector search (recall: 0.70 vs. 0.58), while routing 69% of feedback through a cost-free retrieval path and reducing LLM API calls by 69%.

Abstract

Tipologia del documento

Tesi di laurea (Laurea magistrale)

Autore della tesi

Isomurodov, Javokhir

Relatore della tesi

Moro, Gianluca

Correlatore della tesi

Molfetta, Lorenzo

Scuola

Ingegneria e Architettura

Corso di studio