Design and implementation of a real-world search engine based on Okapi BM25 and SentenceBERT

Bonetti, Lorenzo (2021) Design and implementation of a real-world search engine based on Okapi BM25 and SentenceBERT. [Laurea magistrale], Università di Bologna, Corso di Studio in Artificial intelligence [LM-DM270]
The work conducted in this thesis aims to present an hybrid model for a real­ world application search engine. The project presented was part of an intern­ship work carried out in a start­up which deals with Knowledge Management and Artificial Intelligence. The aim of the internship work was to improve the current search engine system to build a new system for a future web ap­plication use case. An in­-depth study on the limitations of keyword search alone, and on semantic search, revealed the need of a transition from a pure keyword­-based information retrieval system to an hybrid model, making use of both keyword search and semantic search. In particular the old system re­lied on a tfidf­-based algorithm, while the final model tries to overcome the limits of keyword search by joining the abilities of OkapiBM25, a probabilis­tic information retrieval approach, with newer semantic search models based on SentenceBERT. The models, and the algorithm implemented, exploit deeply recent techniques in Information Retrieval such as lexical search, sim­ilarity search, query expansion, document expansion and automatic question generation. The data used to test the models came from a banking dataset, be­longing to one of the company clients, previously created for an Information Retrieval chat­bot. Different experiments led to a final model able to improve the search performances showing great advantages with respect to keyword search and pure semantic search.

