Seraj, Kiana
(2024)
Optimization of deep learning-based protein representations.
[Laurea magistrale], Università di Bologna, Corso di Studio in
Physics [LM-DM270], Documento full-text non disponibile
Il full-text non è disponibile per scelta dell'autore.
(
Contatta l'autore)
Abstract
Proteins are essential macromolecules that play an important in living organisms. Their diverse roles make understanding and annotation of protein functions a critical area of research. However, traditional experimental approaches for annotating protein functions require significant time and resources. Recent advancements in the application of Large pretrained protein language models (PLMs), have improved protein function and structure prediction from sequences via transfer learning, in which representations from PLMs are repurposed for downstream tasks. The representations derived from PLMs are high-dimensional vectors that, for some subsequent training tasks, require reduction of dimensionality. Currently, the predominant methods for this purpose are mean and max pooling; however, their effectiveness lacks a clear justification. This thesis explores multiple strategies to compactly represent protein sequences and evaluates their performance in the downstream task of predicting Gene Ontology (GO) functions. The experiment demonstrates that the downstream task of (GO) prediction benefits from protein language model-based representations. In particular, these representations maintain consistent evaluation results even when trained on 60% of the protein sequence representation, rather than the entire sequence. The study finds that mean pooling outperforms max pooling in this context. However, when feature reduction layers are integrated into the downstream task model architecture, the input representations become fine-tuned to the task, leading to equal performance for models trained on representations from both mean and max pooling methods. Interestingly, representations derived from an improved protein language model architecture do not enhance the downstream task performance. In fact, they perform worse than the low-level features learned early in pre-training, indicating the need for more effective pre-training of PLMs.
Abstract
Proteins are essential macromolecules that play an important in living organisms. Their diverse roles make understanding and annotation of protein functions a critical area of research. However, traditional experimental approaches for annotating protein functions require significant time and resources. Recent advancements in the application of Large pretrained protein language models (PLMs), have improved protein function and structure prediction from sequences via transfer learning, in which representations from PLMs are repurposed for downstream tasks. The representations derived from PLMs are high-dimensional vectors that, for some subsequent training tasks, require reduction of dimensionality. Currently, the predominant methods for this purpose are mean and max pooling; however, their effectiveness lacks a clear justification. This thesis explores multiple strategies to compactly represent protein sequences and evaluates their performance in the downstream task of predicting Gene Ontology (GO) functions. The experiment demonstrates that the downstream task of (GO) prediction benefits from protein language model-based representations. In particular, these representations maintain consistent evaluation results even when trained on 60% of the protein sequence representation, rather than the entire sequence. The study finds that mean pooling outperforms max pooling in this context. However, when feature reduction layers are integrated into the downstream task model architecture, the input representations become fine-tuned to the task, leading to equal performance for models trained on representations from both mean and max pooling methods. Interestingly, representations derived from an improved protein language model architecture do not enhance the downstream task performance. In fact, they perform worse than the low-level features learned early in pre-training, indicating the need for more effective pre-training of PLMs.
Tipologia del documento
Tesi di laurea
(Laurea magistrale)
Autore della tesi
Seraj, Kiana
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Indirizzo
Applied Physics
Ordinamento Cds
DM270
Parole chiave
Protein language models,Pooling,Transfer learning,Computational biology,Gene Ontology
Data di discussione della Tesi
20 Dicembre 2024
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Seraj, Kiana
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Indirizzo
Applied Physics
Ordinamento Cds
DM270
Parole chiave
Protein language models,Pooling,Transfer learning,Computational biology,Gene Ontology
Data di discussione della Tesi
20 Dicembre 2024
URI
Gestione del documento: