Clustering Analysis of Windows Malware using Static Features and Concept Drift Detection

Fabri, Luca (2025) Clustering Analysis of Windows Malware using Static Features and Concept Drift Detection. [Laurea magistrale], Università di Bologna, Corso di Studio in Ingegneria e scienze informatiche [LM-DM270] - Cesena

Salva citazione

Documenti full-text disponibili:

Documento PDF (Thesis)
Disponibile con Licenza: Creative Commons: Attribuzione - Condividi allo stesso modo 4.0 (CC BY-SA 4.0)
Download (3MB)

Abstract

Machine learning for malware classification shows promising results in terms of performance, but models are prone to degradation due to malware evolution. New malware families emerge every year, and attackers continually adapt techniques to evade detection by these systems. Given its critical application, it is essential to keep these models up to date and maintain their performance, despite the fast evolution and broad nature of malware. This challenge can be linked in the machine learning literature to concept drift, a phenomenon that leads to the degradation of classifiers' performances over time. Parallely, while classification is widely used for malware detection, clustering methods provide a complementary approach by uncovering hidden structures in the data and identifying emerging malware families. More importantly, clustering can serve as a tool to track changes in malware feature distributions over time, offering a way to detect and analyze data distribution shifts. This thesis explores different ML techniques for data mining malware samples based on static features. First, after an initial phase of dataset creation, clustering approaches are assessed for identifying groups of Windows malware. The features that characterize each cluster are computed leveraging hierarchical clustering algorithms and XAI techniques. Empirical experiments are conducted to study the relationship between these clusters and current family labeling systems and packing algorithms. Additionally, concept drift detection is applied with respect to malware family labels leveraging a state-of-the-art technique, proposing some modifications to the existing project. This analysis enables the study of temporal changes in family assignments and highlights how malware families evolve, reveals potential inconsistencies in existing labeling systems, and provides a deeper understanding of the dynamics of malware ecosystems over time.

Abstract

Tipologia del documento

Tesi di laurea (Laurea magistrale)

Autore della tesi

Fabri, Luca

Relatore della tesi

Melis, Andrea

Correlatore della tesi

Han, Yufei ; Aonzo, Simone ; Dambra, Savino

Scuola

Ingegneria e Architettura

Corso di studio