Documenti full-text disponibili:
Abstract
Machine learning for malware classification shows promising results in terms of performance, but models are prone to degradation due to malware evolution. New malware families emerge every year, and attackers continually adapt techniques to evade detection by these systems. Given its critical application, it is essential to keep these models up to date and maintain their performance, despite the fast evolution and broad nature of malware. This challenge can be linked in the machine learning literature to concept drift, a phenomenon that leads to the degradation of classifiers' performances over time.
Parallely, while classification is widely used for malware detection, clustering methods provide a complementary approach by uncovering hidden structures in the data and identifying emerging malware families. More importantly, clustering can serve as a tool to track changes in malware feature distributions over time, offering a way to detect and analyze data distribution shifts.
This thesis explores different ML techniques for data mining malware samples based on static features. First, after an initial phase of dataset creation, clustering approaches are assessed for identifying groups of Windows malware. The features that characterize each cluster are computed leveraging hierarchical clustering algorithms and XAI techniques. Empirical experiments are conducted to study the relationship between these clusters and current family labeling systems and packing algorithms.
Additionally, concept drift detection is applied with respect to malware family labels leveraging a state-of-the-art technique, proposing some modifications to the existing project. This analysis enables the study of temporal changes in family assignments and highlights how malware families evolve, reveals potential inconsistencies in existing labeling systems, and provides a deeper understanding of the dynamics of malware ecosystems over time.
Abstract
Machine learning for malware classification shows promising results in terms of performance, but models are prone to degradation due to malware evolution. New malware families emerge every year, and attackers continually adapt techniques to evade detection by these systems. Given its critical application, it is essential to keep these models up to date and maintain their performance, despite the fast evolution and broad nature of malware. This challenge can be linked in the machine learning literature to concept drift, a phenomenon that leads to the degradation of classifiers' performances over time.
Parallely, while classification is widely used for malware detection, clustering methods provide a complementary approach by uncovering hidden structures in the data and identifying emerging malware families. More importantly, clustering can serve as a tool to track changes in malware feature distributions over time, offering a way to detect and analyze data distribution shifts.
This thesis explores different ML techniques for data mining malware samples based on static features. First, after an initial phase of dataset creation, clustering approaches are assessed for identifying groups of Windows malware. The features that characterize each cluster are computed leveraging hierarchical clustering algorithms and XAI techniques. Empirical experiments are conducted to study the relationship between these clusters and current family labeling systems and packing algorithms.
Additionally, concept drift detection is applied with respect to malware family labels leveraging a state-of-the-art technique, proposing some modifications to the existing project. This analysis enables the study of temporal changes in family assignments and highlights how malware families evolve, reveals potential inconsistencies in existing labeling systems, and provides a deeper understanding of the dynamics of malware ecosystems over time.
Tipologia del documento
Tesi di laurea
(Laurea magistrale)
Autore della tesi
Fabri, Luca
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Clustering,Malware,Static Features,ML,Concept Drift,XAI,Malware Family,Conformal Evaluation,Windows PE
Data di discussione della Tesi
14 Marzo 2025
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Fabri, Luca
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Clustering,Malware,Static Features,ML,Concept Drift,XAI,Malware Family,Conformal Evaluation,Windows PE
Data di discussione della Tesi
14 Marzo 2025
URI
Statistica sui download
Gestione del documento: