Hierarchical Multi-Label Text Classification in a Low-Resource Setting

Lavista, Andrea (2022) Hierarchical Multi-Label Text Classification in a Low-Resource Setting. [Laurea magistrale], Università di Bologna, Corso di Studio in Artificial intelligence [LM-DM270]

Salva citazione

Documenti full-text disponibili:

Documento PDF (Thesis)
Disponibile con Licenza: Creative Commons: Attribuzione - Non commerciale - Condividi allo stesso modo 4.0 (CC BY-NC-SA 4.0)
Download (1MB)

Abstract

In this thesis we address a multi-label hierarchical text classification problem in a low-resource setting and explore different approaches to identify the best one for our case. The goal is to train a model that classifies English school exercises according to a hierarchical taxonomy with few labeled data. The experiments made in this work employ different machine learning models and text representation techniques: CatBoost with tf-idf features, classifiers based on pre-trained models (mBERT, LASER), and SetFit, a framework for few-shot text classification. SetFit proved to be the most promising approach, achieving better performance when during training only a few labeled examples per class are available. However, this thesis does not consider all the hierarchical taxonomy, but only the first two levels: to address classification with the classes at the third level further experiments should be carried out, exploring methods for zero-shot text classification, data augmentation, and strategies to exploit the hierarchical structure of the taxonomy during training.

Abstract

Tipologia del documento

Tesi di laurea (Laurea magistrale)

Autore della tesi

Lavista, Andrea

Relatore della tesi

Torroni, Paolo

Correlatore della tesi

Savino, Giuseppe

Scuola

Ingegneria e Architettura

Corso di studio

Artificial intelligence [LM-DM270]

Ordinamento Cds

DM270

Parole chiave

natural language processing,text classification,multi-label classification,hierarchical classification,multi-label text classification,hierarchical text classification,few-shot learning,few-shot text classification,low-resource setting,pre-trained models,contextual embedding,sentence embedding,task-adaptive pre-training,domain adaptation,multilingual,BERT,SetFit,LASER,SHAP

Data di discussione della Tesi

6 Dicembre 2022

URI

https://amslaurea.unibo.it/id/eprint/27453

Altri metadati

Statistica sui download

Vedi altre statistiche

Gestione del documento:

Strumenti di navigazione

Collezioni AlmaDL

Hierarchical Multi-Label Text Classification in a Low-Resource Setting

Abstract

Altri metadati

Statistica sui download