Bonetti, Lorenzo
(2021)
Design and implementation of a real-world search engine based on Okapi BM25 and SentenceBERT.
[Laurea magistrale], Università di Bologna, Corso di Studio in
Artificial intelligence [LM-DM270]
Documenti full-text disponibili:
|
Documento PDF (Thesis)
Disponibile con Licenza: Salvo eventuali più ampie autorizzazioni dell'autore, la tesi può essere liberamente consultata e può essere effettuato il salvataggio e la stampa di una copia per fini strettamente personali di studio, di ricerca e di insegnamento, con espresso divieto di qualunque utilizzo direttamente o indirettamente commerciale. Ogni altro diritto sul materiale è riservato
Download (2MB)
|
Abstract
The work conducted in this thesis aims to present an hybrid model for a real world application search engine. The project presented was part of an internship work carried out in a startup which deals with Knowledge Management and Artificial Intelligence. The aim of the internship work was to improve the current search engine system to build a new system for a future web application use case. An in-depth study on the limitations of keyword search alone, and on semantic search, revealed the need of a transition from a pure keyword-based information retrieval system to an hybrid model, making use of both keyword search and semantic search. In particular the old system relied on a tfidf-based algorithm, while the final model tries to overcome the limits of keyword search by joining the abilities of OkapiBM25, a probabilistic information retrieval approach, with newer semantic search models based on SentenceBERT. The models, and the algorithm implemented, exploit deeply recent techniques in Information Retrieval such as lexical search, similarity search, query expansion, document expansion and automatic question generation. The data used to test the models came from a banking dataset, belonging to one of the company clients, previously created for an Information Retrieval chatbot. Different experiments led to a final model able to improve the search performances showing great advantages with respect to keyword search and pure semantic search.
Abstract
The work conducted in this thesis aims to present an hybrid model for a real world application search engine. The project presented was part of an internship work carried out in a startup which deals with Knowledge Management and Artificial Intelligence. The aim of the internship work was to improve the current search engine system to build a new system for a future web application use case. An in-depth study on the limitations of keyword search alone, and on semantic search, revealed the need of a transition from a pure keyword-based information retrieval system to an hybrid model, making use of both keyword search and semantic search. In particular the old system relied on a tfidf-based algorithm, while the final model tries to overcome the limits of keyword search by joining the abilities of OkapiBM25, a probabilistic information retrieval approach, with newer semantic search models based on SentenceBERT. The models, and the algorithm implemented, exploit deeply recent techniques in Information Retrieval such as lexical search, similarity search, query expansion, document expansion and automatic question generation. The data used to test the models came from a banking dataset, belonging to one of the company clients, previously created for an Information Retrieval chatbot. Different experiments led to a final model able to improve the search performances showing great advantages with respect to keyword search and pure semantic search.
Tipologia del documento
Tesi di laurea
(Laurea magistrale)
Autore della tesi
Bonetti, Lorenzo
Relatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
OkapiBM25,SentenceBERT,Keyword search,Semantic search,Question generation,Information Retrieval,document expansion
Data di discussione della Tesi
3 Dicembre 2021
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Bonetti, Lorenzo
Relatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
OkapiBM25,SentenceBERT,Keyword search,Semantic search,Question generation,Information Retrieval,document expansion
Data di discussione della Tesi
3 Dicembre 2021
URI
Statistica sui download
Gestione del documento: