|
Documento PDF (Thesis)
Full-text non accessibile fino al 5 Maggio 2028. Disponibile con Licenza: Salvo eventuali più ampie autorizzazioni dell'autore, la tesi può essere liberamente consultata e può essere effettuato il salvataggio e la stampa di una copia per fini strettamente personali di studio, di ricerca e di insegnamento, con espresso divieto di qualunque utilizzo direttamente o indirettamente commerciale. Ogni altro diritto sul materiale è riservato Download (734kB) | Contatta l'autore |
Abstract
This thesis addresses a key challenge in enterprise Retrieval-Augmented Generation: how to combine the scalability and robustness of lexical search with the semantic understanding required by natural-language interaction, without the infrastructural cost of full dense retrieval. The work is conducted in collaboration with Musixmatch on a large-scale music catalog. The main contribution is the design and validation of a hybrid lexical–semantic retrieval architecture in which semantic reranking is reframed not as a simple post-processing improvement, but as the core component that bridges the gap between lexical retrieval and user intent. The system combines query translation, lexical candidate generation, dense-signal document construction, semantic reranking, and offline semantic enrichment modules such as Named Entity Recognition. The thesis also introduces a scalable evaluation methodology based on synthetic queries over real enterprise documents with LLM-based relevance annotation. Experimental results show that semantic reranking significantly improves retrieval quality over both lexical and dense baselines. In particular, lightweight pointwise rerankers retain strong effectiveness with low latency and zero variable cost, while avoiding the aggregate context-window constraints of listwise reranking. The results further show that document construction is a critical architectural factor: structured semantic representations improve reranking and, in some downstream RAG settings, outperform configurations including full lyrical content. Reranked retrieval more than doubles the proportion of retrieved documents effectively used during answer generation. Overall, the thesis proposes a practical, modular blueprint for enterprise-grade AI search systems, showing that semantic capability can be added to existing lexical infrastructures through reranking, while preserving scalability, improving effectiveness, and enabling zero-migration semantic modernization.

Login