Cichetti, Federico
(2023)
Language Modelling of Source Code using Masked Graph Autoencoders and Graph Neural Networks.
[Laurea magistrale], Università di Bologna, Corso di Studio in
Artificial intelligence [LM-DM270], Documento ad accesso riservato.
Documenti full-text disponibili:
Abstract
In a landscape of constant evolution of software and hardware, the mediating role played by compilers has become increasingly complex. Deep learning (DL)-based source code analysis has proven beneficial in supporting compile-time decisions that impact performance in heterogeneous devices. Graph-based representations of source code are particularly appealing, as they express code properties that would otherwise be challenging to identify.
In this thesis, I develop DeepCodeGraph (DCG), a technique for constructing a general graph-based language model (LM), which learns patterns to identify better compilation strategies, optimal hardware configurations and software transformations.
DCG includes: i) A large-scale dataset containing over 100k graph-based representations of compilable source code files. ii) A Graph Neural Network (GNN) implementing a flexible graph-based LM. iii) A self-supervised training procedure based on the framework of Masked Graph AutoEncoding (MGAE), providing general and transferable knowledge to the LM.
The performance of DCG is evaluated on two complex tasks: heterogeneous device mapping and thread block size prediction. DCG outperforms previous graph-based state-of-the-art approaches in all tasks, improving previous results by 3% and 5% respectively.
Abstract
In a landscape of constant evolution of software and hardware, the mediating role played by compilers has become increasingly complex. Deep learning (DL)-based source code analysis has proven beneficial in supporting compile-time decisions that impact performance in heterogeneous devices. Graph-based representations of source code are particularly appealing, as they express code properties that would otherwise be challenging to identify.
In this thesis, I develop DeepCodeGraph (DCG), a technique for constructing a general graph-based language model (LM), which learns patterns to identify better compilation strategies, optimal hardware configurations and software transformations.
DCG includes: i) A large-scale dataset containing over 100k graph-based representations of compilable source code files. ii) A Graph Neural Network (GNN) implementing a flexible graph-based LM. iii) A self-supervised training procedure based on the framework of Masked Graph AutoEncoding (MGAE), providing general and transferable knowledge to the LM.
The performance of DCG is evaluated on two complex tasks: heterogeneous device mapping and thread block size prediction. DCG outperforms previous graph-based state-of-the-art approaches in all tasks, improving previous results by 3% and 5% respectively.
Tipologia del documento
Tesi di laurea
(Laurea magistrale)
Autore della tesi
Cichetti, Federico
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Masked Graph Autoencoders,Graph Neural Networks,Language Modelling,Source Code Analysis,Heterogeneous Devices,Big Code
Data di discussione della Tesi
16 Dicembre 2023
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Cichetti, Federico
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Masked Graph Autoencoders,Graph Neural Networks,Language Modelling,Source Code Analysis,Heterogeneous Devices,Big Code
Data di discussione della Tesi
16 Dicembre 2023
URI
Gestione del documento: