Covella, Vito Vincenzo
(2021)
Multi-node Fault Classification using Machine Learning.
[Laurea magistrale], Università di Bologna, Corso di Studio in
Informatica [LM-DM270]
Documenti full-text disponibili:
Abstract
An HPC system, a system with much more computational power than general computing systems, is a complex system made up of different sections and many computing nodes. In such systems failures can arise for different reasons: because of the interactions among the components, because of the specific technologies used or because of bugs in the software. In order to reach Exascale performances and guarantee
availability and reliability it is important to detect and recover from these anomalies. In this thesis we propose a fault classification method based on machine learning.
Other researchers have worked in this field, but their work mainly relies on per-node models. However per-node models are impractical because they require too much data and fault injection would be hard to control. For this reason our research involves single multi-node models, since for single general models there’s less operational effort for training
and mantaining the model over time is easier. More specifically our methodology is focused not only on metaparameter exploration, but also on understanding how many nodes are necessary for training and which specific nodes are the best candidates. For these reasons, we compare
two approaches: incremental training with nodes selected randomly and incremental training with nodes which are representative of a chosen number of clusters. In both cases the end result is a single general model that can be used on different nodes for fault detection.
Using the dataset provided by LRZ, about 32 compute nodes, we
show that the classification performances stabilize when using a small subset of compute nodes as training set and both the previously discussed selection methods outperform node-specific classifiers when using more than one training node. Finally we show that the clustering approach
is more reliable and stable when using more training nodes, while the random approach gives better performances when using a lower number of training nodes.
Abstract
An HPC system, a system with much more computational power than general computing systems, is a complex system made up of different sections and many computing nodes. In such systems failures can arise for different reasons: because of the interactions among the components, because of the specific technologies used or because of bugs in the software. In order to reach Exascale performances and guarantee
availability and reliability it is important to detect and recover from these anomalies. In this thesis we propose a fault classification method based on machine learning.
Other researchers have worked in this field, but their work mainly relies on per-node models. However per-node models are impractical because they require too much data and fault injection would be hard to control. For this reason our research involves single multi-node models, since for single general models there’s less operational effort for training
and mantaining the model over time is easier. More specifically our methodology is focused not only on metaparameter exploration, but also on understanding how many nodes are necessary for training and which specific nodes are the best candidates. For these reasons, we compare
two approaches: incremental training with nodes selected randomly and incremental training with nodes which are representative of a chosen number of clusters. In both cases the end result is a single general model that can be used on different nodes for fault detection.
Using the dataset provided by LRZ, about 32 compute nodes, we
show that the classification performances stabilize when using a small subset of compute nodes as training set and both the previously discussed selection methods outperform node-specific classifiers when using more than one training node. Finally we show that the clustering approach
is more reliable and stable when using more training nodes, while the random approach gives better performances when using a lower number of training nodes.
Tipologia del documento
Tesi di laurea
(Laurea magistrale)
Autore della tesi
Covella, Vito Vincenzo
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Indirizzo
CURRICULUM A: TECNICHE DEL SOFTWARE
Ordinamento Cds
DM270
Parole chiave
machine learning,fault classification,multi-node,HPC systems,clustering
Data di discussione della Tesi
18 Marzo 2021
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Covella, Vito Vincenzo
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Indirizzo
CURRICULUM A: TECNICHE DEL SOFTWARE
Ordinamento Cds
DM270
Parole chiave
machine learning,fault classification,multi-node,HPC systems,clustering
Data di discussione della Tesi
18 Marzo 2021
URI
Statistica sui download
Gestione del documento: