Small transformers for Bioinformatics tasks

Lorello, Luca Salvatore (2021) Small transformers for Bioinformatics tasks. [Laurea magistrale], Università di Bologna, Corso di Studio in Artificial intelligence [LM-DM270]

Salva citazione

Documenti full-text disponibili:

Documento PDF (Thesis)
Disponibile con Licenza: Creative Commons: Attribuzione - Condividi allo stesso modo 4.0 (CC BY-SA 4.0)
Download (355kB)

Abstract

Recent trends in bioinformatics are trying to align the techniques to more modern approaches based on statistical natural language processing and deep learning, however state-of-the-art neural natural language processing techniques remain relatively unexplored in this domain. Large models are capable of achieving state-of-the-art performances, however, a typical bioinformatics lab has limited hardware resources. For this reason, this thesis focuses on small architectures, the training of which can be performed in a reasonable amount of time, while trying to limit or even negate the performance loss compared to SOTA. In particular, sparse attention mechanisms (such as the one proposed by Longformer) and parameter sharing techniques (such as the one proposed by Albert) are jointly explored with respect to two genetic languages: human genome and eukaryotic mitochondrial genome of 2000+ different species. Contextual embeddings for each token are learned via pretraining on a language understanding task, both in RoBERTa and Albert styles to highlight differences in performance and training efficiency. The learned contextual embeddings are finally exploited for fine tuning a task of localization (transcription start site in human promoters) and two tasks of sequence classification (12S metagenomics in fishes and chromatin profile prediction, single-class and multi-class respectively). Using smaller architectures, near SOTA performances are achieved in all the tasks already explored in literature, and a new SOTA has been established for the other tasks. Further experiments with larger architectures consistently improved the previous SOTA for every task.

Abstract

Tipologia del documento

Tesi di laurea (Laurea magistrale)

Autore della tesi

Lorello, Luca Salvatore

Relatore della tesi

Torroni, Paolo

Correlatore della tesi

Galassi, Andrea

Scuola

Ingegneria e Architettura

Corso di studio