Predicting protein functions with ensemble deep learning and Protein Language Models

Fuschi, Marcello (2023) Predicting protein functions with ensemble deep learning and Protein Language Models. [Laurea magistrale], Università di Bologna, Corso di Studio in Artificial intelligence [LM-DM270], Documento full-text non disponibile
Il full-text non è disponibile per scelta dell'autore. (Contatta l'autore)

Abstract

The exponential growth in the discovery of protein sequences has rendered large-scale automated protein function prediction (AFP) increasingly challenging. Since a protein is typically associated with numerous Gene Ontology terms, AFP can be seen as a complex large-scale multi-label classification issue. In this work, we propose a solution that uses the recently-developed Protein Language Model (PLM) technology, specifically ESM-2, a SOTA model by Meta AI which had never been used for the AFP task. PLMs, in fact, are able to generate very informative protein sequence embeddings. Our study also fills the gap in the literature regarding the performance comparison between neural network approaches and more efficient methods like those based on the cosine-similarity between embeddings. We propose a Stacked Ensemble model which combines predictions from five classifiers, revealing insights into the capabilities of different predictive techniques. The method proposed in this work achieves results that surpass the winning solutions of the CAFA3 challenge. Its best method, in fact, achieves F-max values of (0.580, 0.370, 0.687) on the 3 sub-ontologies of Gene Ontology, while the method we propose scores (0.594, 0.493, 0.722) when evaluated on the same dataset. A comparative analysis is included to assess the contribution of the various components to the overall result.

Abstract
Tipologia del documento
Tesi di laurea (Laurea magistrale)
Autore della tesi
Fuschi, Marcello
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Protein Function Prediction,Machine Learning,Deep Neural Networks,Protein Language Embedding,Gene Ontology
Data di discussione della Tesi
21 Ottobre 2023
URI

Altri metadati

Gestione del documento: Visualizza il documento

^