Vannucchi, Matteo
(2024)
End-to-End Vision Language Processing of Identity Documents: Detection, Classification and Extraction.
[Laurea magistrale], Università di Bologna, Corso di Studio in
Artificial intelligence [LM-DM270], Documento full-text non disponibile
Il full-text non è disponibile per scelta dell'autore.
(
Contatta l'autore)
Abstract
The efficient processing of identity documents is essential in sectors such as finance, healthcare and government. Traditional manual methods are often slow and susceptible to errors. This thesis develops an end-to-end pipeline that integrates object detection, image classification and vision-language models to automate the processing of identity documents, including detection, classification and data extraction from images. This work was conducted in two distinct phases: the first phase, conducted during an industry internship, focused on detection and classification using state-of-the-art models, such as RTMDet, YOLOX and EfficientNet, achieving near-perfect accuracy across various document types. The second phase, carried out in an academic setting, leveraged advanced vision language models, such as Donut, Qwen2 and LayoutLMv3, to extract structured information directly from document images. These models demonstrated remarkable accuracy in text extraction tasks, even in challenging real-world scenarios. By integrating object detection, classification and vision-language approaches, this thesis highlights the potential of combining visual and linguistic processing for scalable, efficient and precise document analysis.
Abstract
The efficient processing of identity documents is essential in sectors such as finance, healthcare and government. Traditional manual methods are often slow and susceptible to errors. This thesis develops an end-to-end pipeline that integrates object detection, image classification and vision-language models to automate the processing of identity documents, including detection, classification and data extraction from images. This work was conducted in two distinct phases: the first phase, conducted during an industry internship, focused on detection and classification using state-of-the-art models, such as RTMDet, YOLOX and EfficientNet, achieving near-perfect accuracy across various document types. The second phase, carried out in an academic setting, leveraged advanced vision language models, such as Donut, Qwen2 and LayoutLMv3, to extract structured information directly from document images. These models demonstrated remarkable accuracy in text extraction tasks, even in challenging real-world scenarios. By integrating object detection, classification and vision-language approaches, this thesis highlights the potential of combining visual and linguistic processing for scalable, efficient and precise document analysis.
Tipologia del documento
Tesi di laurea
(Laurea magistrale)
Autore della tesi
Vannucchi, Matteo
Relatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Document Analysis,Document Understanding,Object Detection,Image Classification,Vision Language Model
Data di discussione della Tesi
5 Dicembre 2024
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Vannucchi, Matteo
Relatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Document Analysis,Document Understanding,Object Detection,Image Classification,Vision Language Model
Data di discussione della Tesi
5 Dicembre 2024
URI
Gestione del documento: