Multi-node Fault Classification using Machine Learning

Covella, Vito Vincenzo (2021) Multi-node Fault Classification using Machine Learning. [Laurea magistrale], Università di Bologna, Corso di Studio in Informatica [LM-DM270]

Salva citazione

Documenti full-text disponibili:

Documento PDF (Thesis)
Disponibile con Licenza: Creative Commons: Attribuzione - Non commerciale - Condividi allo stesso modo 4.0 (CC BY-NC-SA 4.0)
Download (2MB)

Abstract

An HPC system, a system with much more computational power than general computing systems, is a complex system made up of different sections and many computing nodes. In such systems failures can arise for different reasons: because of the interactions among the components, because of the specific technologies used or because of bugs in the software. In order to reach Exascale performances and guarantee availability and reliability it is important to detect and recover from these anomalies. In this thesis we propose a fault classification method based on machine learning. Other researchers have worked in this field, but their work mainly relies on per-node models. However per-node models are impractical because they require too much data and fault injection would be hard to control. For this reason our research involves single multi-node models, since for single general models there’s less operational effort for training and mantaining the model over time is easier. More specifically our methodology is focused not only on metaparameter exploration, but also on understanding how many nodes are necessary for training and which specific nodes are the best candidates. For these reasons, we compare two approaches: incremental training with nodes selected randomly and incremental training with nodes which are representative of a chosen number of clusters. In both cases the end result is a single general model that can be used on different nodes for fault detection. Using the dataset provided by LRZ, about 32 compute nodes, we show that the classification performances stabilize when using a small subset of compute nodes as training set and both the previously discussed selection methods outperform node-specific classifiers when using more than one training node. Finally we show that the clustering approach is more reliable and stable when using more training nodes, while the random approach gives better performances when using a lower number of training nodes.

Abstract

Tipologia del documento

Tesi di laurea (Laurea magistrale)

Autore della tesi

Covella, Vito Vincenzo

Relatore della tesi

Kiziltan, Zeynep

Correlatore della tesi

Sîrbu, Alina ; Netti, Alessio

Scuola

Scienze

Corso di studio