End-to-End Vision Language Processing of Identity Documents: Detection, Classification and Extraction

Vannucchi, Matteo (2024) End-to-End Vision Language Processing of Identity Documents: Detection, Classification and Extraction. [Laurea magistrale], Università di Bologna, Corso di Studio in Artificial intelligence [LM-DM270], Documento full-text non disponibile

Salva citazione

Il full-text non è disponibile per scelta dell'autore. (Contatta l'autore)

Abstract

The efficient processing of identity documents is essential in sectors such as finance, healthcare and government. Traditional manual methods are often slow and susceptible to errors. This thesis develops an end-to-end pipeline that integrates object detection, image classification and vision-language models to automate the processing of identity documents, including detection, classification and data extraction from images. This work was conducted in two distinct phases: the first phase, conducted during an industry internship, focused on detection and classification using state-of-the-art models, such as RTMDet, YOLOX and EfficientNet, achieving near-perfect accuracy across various document types. The second phase, carried out in an academic setting, leveraged advanced vision language models, such as Donut, Qwen2 and LayoutLMv3, to extract structured information directly from document images. These models demonstrated remarkable accuracy in text extraction tasks, even in challenging real-world scenarios. By integrating object detection, classification and vision-language approaches, this thesis highlights the potential of combining visual and linguistic processing for scalable, efficient and precise document analysis.

Abstract

Tipologia del documento

Tesi di laurea (Laurea magistrale)

Autore della tesi

Vannucchi, Matteo

Relatore della tesi

Moro, Gianluca

Scuola

Ingegneria e Architettura

Corso di studio

Artificial intelligence [LM-DM270]

Ordinamento Cds