A Comparison between LLMs and SLMs for Document Processing in the Insurance Sector

Turrini, Alice (2025) A Comparison between LLMs and SLMs for Document Processing in the Insurance Sector. [Laurea magistrale], Università di Bologna, Corso di Studio in Artificial intelligence [LM-DM270]
Documenti full-text disponibili:
[thumbnail of Thesis] Documento PDF (Thesis)
Disponibile con Licenza: Creative Commons: Attribuzione - Condividi allo stesso modo 4.0 (CC BY-SA 4.0)

Download (6MB)

Abstract

This thesis provides a comparison over feasibility and performance between state-of-the-art large language models (LLMs) and smaller language models (SLMs), for the task of document classification and data extraction of a real-world case scenario. The research focuses on the development of a robust document processing pipeline. Starting from the raw PDF and encompassing all the necessary steps to obtain a structured format suitable for classification and subsequent metadata extraction. Modern techniques are integrated throughout the pipeline to ensure efficiency and scalability. The project leverages a dataset of over 8,000 documents, including both labeled and pseudo-labeled data, in the medical and administrative domains. Specifically, the study compares the use of advanced LLMs, particularly GPT-4o, against smaller language models, BERT and LLaMA 3.2, for document classification and key metadata extraction. Key challenges addressed include the efficient extraction of meaningful information from complex domain documents, optimization of model performance for both classification and extraction tasks, and scalability of the proposed methods. A central focus of this research is identifying the optimal balance between model size and performance. This is explored through fine-tuning smaller models, applying techniques such as knowledge distillation and model quantization, and comparing their results to those of larger models. Results suggest that finetuning small language models for specific tasks can achieve performance comparable to, or in some cases surpass, LLMs, especially when considering model size and computational efficiency. These findings provide valuable insights for the modern topic of choosing between solutions based on LLMs or SLMs, taking into consideration various aspects such as performances, deployment, privacy, personalization, and cost.

Abstract
Tipologia del documento
Tesi di laurea (Laurea magistrale)
Autore della tesi
Turrini, Alice
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Document Classification, Metadata Extraction, Large Language Models (LLMs), Small Language Models (SLMs), GPT-4o, Document Processing Pipeline, Model Fine-Tuning, Knowledge Distillation, Model Quantization, Computational Efficiency, Scalability, Medical and Administrative Documents, Real-World Case Scenario, Insurance Sector
Data di discussione della Tesi
25 Marzo 2025
URI

Altri metadati

Statistica sui download

Gestione del documento: Visualizza il documento

^