STRUMENTI DI NAVIGAZIONE

Development of an Artificial Intelligence-based Solution for Document Processing Automation Using Machine Learning and NLP Techniques

Masa, Biniam Abraha (2023) Development of an Artificial Intelligence-based Solution for Document Processing Automation Using Machine Learning and NLP Techniques. [Laurea magistrale], Università di Bologna, Corso di Studio in Artificial intelligence [LM-DM270]

Salva citazione

Documenti full-text disponibili:

Documento PDF (Thesis)
Disponibile con Licenza: Salvo eventuali più ampie autorizzazioni dell'autore, la tesi può essere liberamente consultata e può essere effettuato il salvataggio e la stampa di una copia per fini strettamente personali di studio, di ricerca e di insegnamento, con espresso divieto di qualunque utilizzo direttamente o indirettamente commerciale. Ogni altro diritto sul materiale è riservato
Download (3MB)

Abstract

The proposal focuses on Intelligent Document Processing (IDP), which aims to automate various activities related to document processing using Artificial Intelligence technologies, particularly Machine Learning and Natural Language Processing techniques. The proposed solution seeks to improve the efficiency and quality of document processing in many business and organizational contexts by automating tasks such as classification, information extraction, validation, and verification of consistency between documents. This thesis paper includes the following phases: “Text Identification, OCR, Invoice Data Extraction and Quality Assurance”. In case of document files, the data extraction is done in the first phase. This project thesis details the IDP solution developed, analyse processing results and the quality of the extracted information, and evaluate the accuracy and efficiency of the system. The thesis is focused on information extraction from key fields of invoices using two different methods based on sequence labeling. Invoices are unstructured documents in which data can be located based on the context. Their performances are expected to be generally high on documents they have been trained for but processing new templates often requires new manual annotations like prodigy tool, which is tedious and time-consuming to produce labeled data. This showcases a set of trials utilizing neural networks methods to examine the balance between data prerequisites and efficacy in retrieving data from crucial sections of invoices (such as invoice date, invoice number, order number, amount, supplier's name...). The main contribution of this thesis is a system that achieves competitive results using a small amount of data compared to the state-of-the-art systems that need to be trained on large datasets, using a custom Named Entity Recognition (NER) model to extract that relevant information from non-uniform commercial invoice formats.

Abstract

Tipologia del documento

Tesi di laurea (Laurea magistrale)

Autore della tesi

Masa, Biniam Abraha

Relatore della tesi

Sartori, Claudio

Correlatore della tesi

Magrini, Alex ; Aspromonte, Marco

Scuola

Ingegneria e Architettura

Corso di studio