Hierarchical Multi-Label Text Classification in a Low-Resource Setting

Lavista, Andrea (2022) Hierarchical Multi-Label Text Classification in a Low-Resource Setting. [Laurea magistrale], Università di Bologna, Corso di Studio in Artificial intelligence [LM-DM270]
Documenti full-text disponibili:
[img] Documento PDF (Thesis)
Disponibile con Licenza: Creative Commons: Attribuzione - Non commerciale - Condividi allo stesso modo 4.0 (CC BY-NC-SA 4.0)

Download (1MB)

Abstract

In this thesis we address a multi-label hierarchical text classification problem in a low-resource setting and explore different approaches to identify the best one for our case. The goal is to train a model that classifies English school exercises according to a hierarchical taxonomy with few labeled data. The experiments made in this work employ different machine learning models and text representation techniques: CatBoost with tf-idf features, classifiers based on pre-trained models (mBERT, LASER), and SetFit, a framework for few-shot text classification. SetFit proved to be the most promising approach, achieving better performance when during training only a few labeled examples per class are available. However, this thesis does not consider all the hierarchical taxonomy, but only the first two levels: to address classification with the classes at the third level further experiments should be carried out, exploring methods for zero-shot text classification, data augmentation, and strategies to exploit the hierarchical structure of the taxonomy during training.

Abstract
Tipologia del documento
Tesi di laurea (Laurea magistrale)
Autore della tesi
Lavista, Andrea
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
natural language processing,text classification,multi-label classification,hierarchical classification,multi-label text classification,hierarchical text classification,few-shot learning,few-shot text classification,low-resource setting,pre-trained models,contextual embedding,sentence embedding,task-adaptive pre-training,domain adaptation,multilingual,BERT,SetFit,LASER,SHAP
Data di discussione della Tesi
6 Dicembre 2022
URI

Altri metadati

Statistica sui download

Gestione del documento: Visualizza il documento

^