Smart Extraction of Structured Data from Unstructured Documents

Friberg, Elia (2024) Smart Extraction of Structured Data from Unstructured Documents. [Laurea], Università di Bologna, Corso di Studio in Informatica [L-DM270]
Documenti full-text disponibili:
[thumbnail of Thesis] Documento PDF (Thesis)
Disponibile con Licenza: Salvo eventuali più ampie autorizzazioni dell'autore, la tesi può essere liberamente consultata e può essere effettuato il salvataggio e la stampa di una copia per fini strettamente personali di studio, di ricerca e di insegnamento, con espresso divieto di qualunque utilizzo direttamente o indirettamente commerciale. Ogni altro diritto sul materiale è riservato

Download (4MB)

Abstract

This thesis presents DataDig, a mobile application designed to address the growing need for efficient and accurate data extraction from documents and images. DataDig uses a combination of Optical Character Recognition (OCR) and Large Language Models (LLMs) to automate the extraction process, leaving the users the ability to define fully customizable templates that retrieve specific information without the need for model training or example documents. The app’s key features include support for various data types, dynamic table extraction, context-aware interpretation and direct integration. Through testing and user feedback, DataDig demonstrates its effectiveness in accurately extracting information from diverse document types. The mobile-first design, coupled with the number of options available for both the process and output format, as well as the added cost transparency, makes it a useful tool for individuals, small businesses, and professionals looking to streamline document processing and data management workflows. This research contributes to the field of data extraction by providing a practical solution that utilizes the power of LLMs to automate this previously time-consuming and error-prone process. The thesis also explores the broader context of today’s AI-powered data extraction, analyzing ethical implications and the current market landscape, comparing DataDig with its main competitors, and discussing potential future developments.

Abstract
Tipologia del documento
Tesi di laurea (Laurea)
Autore della tesi
Friberg, Elia
Relatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Data Extraction,Document Processing,Optical Character Recognition,OCR,Large Language Models,LLM,Artificial Intelligence,AI,Mobile Application,Information Retrieval,Document Digitalization,Data Management,Kotlin,Python,Chaquopy,OpenAI,Azure,Azure Document Intelligence,Prompt Engineering
Data di discussione della Tesi
30 Ottobre 2024
URI

Altri metadati

Statistica sui download

Gestione del documento: Visualizza il documento

^