Unsupervised Anomaly Detection in Large-Scale System Logs

Aminraoufpour, Shahab (2026) Unsupervised Anomaly Detection in Large-Scale System Logs. [Laurea magistrale], Università di Bologna, Corso di Studio in Digital transformation management [LM-DM270] - Cesena, Documento full-text non disponibile
Il full-text non è disponibile per scelta dell'autore. (Contatta l'autore)

Abstract

Application logs are fundamental for monitoring and maintaining large-scale information systems, because they record operational behavior and help engineers diagnose failures. In modern distributed platforms, however, log streams grow rapidly in volume and diversity, making manual inspection time-consuming and often ineffective. Since anomalous executions can lead to performance degradation, service disruption, or costly downtime, reliable automated anomaly detection is increasingly necessary. This thesis develops an end-to-end unsupervised framework for detecting anomalies in the HDFS log dataset, aiming to minimize dependence on labeled training data. Raw log messages are first parsed into stable event templates to convert unstructured text into consistent discrete events. The logs are then sessionized by BlockId to reconstruct block-level execution traces, which represent the system’s behavior for each data block. To enable machine learning, each trace is encoded into a high-dimensional feature representation that captures event-template occurrence patterns; TF–IDF weighting is applied to emphasize rare but informative templates while down-weighting frequent background events. On top of these representations, the thesis studies multiple unsupervised detection strategies to capture different notions of abnormality, including isolation-based approaches, reconstruction-error approaches, and a hierarchical design that clusters traces into dominant execution modes and performs anomaly detection within each mode. Finally, ground-truth labels are used only for post-hoc validation, and the proposed workflow is designed for reproducibility and deployment by persisting all required preprocessing artifacts and inference settings.

Abstract
Tipologia del documento
Tesi di laurea (Laurea magistrale)
Autore della tesi
Aminraoufpour, Shahab
Relatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
Anomaly,detection,HDFS,logs,Unsupervised,learning,Isolation, Forest,PCA,Log,parsing,Drain,algorithm,TF-IDF,Block-level, traces,System,monitoring,Distributed,Machine
Data di discussione della Tesi
19 Marzo 2026
URI

Altri metadati

Gestione del documento: Visualizza il documento

^