Small transformers for Bioinformatics tasks

Lorello, Luca Salvatore (2021) Small transformers for Bioinformatics tasks. [Laurea magistrale], Università di Bologna, Corso di Studio in Artificial intelligence [LM-DM270]
Documenti full-text disponibili:
[thumbnail of Thesis] Documento PDF (Thesis)
Disponibile con Licenza: Creative Commons: Attribuzione - Condividi allo stesso modo 4.0 (CC BY-SA 4.0)

Download (355kB)

Abstract

Recent trends in bioinformatics are trying to align the techniques to more modern approaches based on statistical natural language processing and deep learning, however state-of-the-art neural natural language processing techniques remain relatively unexplored in this domain. Large models are capable of achieving state-of-the-art performances, however, a typical bioinformatics lab has limited hardware resources. For this reason, this thesis focuses on small architectures, the training of which can be performed in a reasonable amount of time, while trying to limit or even negate the performance loss compared to SOTA. In particular, sparse attention mechanisms (such as the one proposed by Longformer) and parameter sharing techniques (such as the one proposed by Albert) are jointly explored with respect to two genetic languages: human genome and eukaryotic mitochondrial genome of 2000+ different species. Contextual embeddings for each token are learned via pretraining on a language understanding task, both in RoBERTa and Albert styles to highlight differences in performance and training efficiency. The learned contextual embeddings are finally exploited for fine tuning a task of localization (transcription start site in human promoters) and two tasks of sequence classification (12S metagenomics in fishes and chromatin profile prediction, single-class and multi-class respectively). Using smaller architectures, near SOTA performances are achieved in all the tasks already explored in literature, and a new SOTA has been established for the other tasks. Further experiments with larger architectures consistently improved the previous SOTA for every task.

Abstract
Tipologia del documento
Tesi di laurea (Laurea magistrale)
Autore della tesi
Lorello, Luca Salvatore
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
natural language processing,bioinformatics,metagenomics,transformers,promoter regions,chromatin profiles,bert,sparse attention,neural networks
Data di discussione della Tesi
21 Luglio 2021
URI

Altri metadati

Statistica sui download

Gestione del documento: Visualizza il documento

^