Covella, Vito Vincenzo
 
(2021)
Multi-node Fault Classification using Machine Learning.
[Laurea magistrale], Università di Bologna, Corso di Studio in 
Informatica [LM-DM270]
   
  
  
        
        
	
  
  
  
  
  
  
  
    
  
    
      Documenti full-text disponibili:
      
    
  
  
    
      Abstract
      An HPC system, a system with much more computational power than general computing systems, is a complex system made up of different sections and many computing nodes. In such systems failures can arise for different reasons: because of the interactions among the components, because of the specific technologies used  or because of bugs in the software. In order to reach Exascale performances and guarantee 
availability and reliability it is important to detect and recover from these anomalies. In this thesis we propose a fault classification method based on machine learning.
Other researchers have worked in this field, but their work mainly relies on per-node models. However per-node models are impractical because they require too much data and fault injection would be hard to control. For this reason our research involves single multi-node models, since for single general models there’s less operational effort for training
and mantaining the model over time is easier. More specifically our methodology is focused not only on metaparameter exploration, but also on understanding how many nodes are necessary for training and which specific nodes are the best candidates. For these reasons, we compare
two approaches: incremental training with nodes selected randomly and incremental training with nodes which are representative of a chosen number of clusters. In both cases the end result is a single general model that can be used on different nodes for fault detection.
Using the dataset provided by LRZ, about 32 compute nodes, we
show that the classification performances stabilize when using a small subset of compute nodes as training set and both the previously discussed selection methods outperform node-specific classifiers when using more than one training node. Finally we show that the clustering approach
is more reliable and stable when using more training nodes, while the random approach gives better performances when using a lower number of training nodes.
     
    
      Abstract
      An HPC system, a system with much more computational power than general computing systems, is a complex system made up of different sections and many computing nodes. In such systems failures can arise for different reasons: because of the interactions among the components, because of the specific technologies used  or because of bugs in the software. In order to reach Exascale performances and guarantee 
availability and reliability it is important to detect and recover from these anomalies. In this thesis we propose a fault classification method based on machine learning.
Other researchers have worked in this field, but their work mainly relies on per-node models. However per-node models are impractical because they require too much data and fault injection would be hard to control. For this reason our research involves single multi-node models, since for single general models there’s less operational effort for training
and mantaining the model over time is easier. More specifically our methodology is focused not only on metaparameter exploration, but also on understanding how many nodes are necessary for training and which specific nodes are the best candidates. For these reasons, we compare
two approaches: incremental training with nodes selected randomly and incremental training with nodes which are representative of a chosen number of clusters. In both cases the end result is a single general model that can be used on different nodes for fault detection.
Using the dataset provided by LRZ, about 32 compute nodes, we
show that the classification performances stabilize when using a small subset of compute nodes as training set and both the previously discussed selection methods outperform node-specific classifiers when using more than one training node. Finally we show that the clustering approach
is more reliable and stable when using more training nodes, while the random approach gives better performances when using a lower number of training nodes.
     
  
  
    
    
      Tipologia del documento
      Tesi di laurea
(Laurea magistrale)
      
      
      
      
        
      
        
          Autore della tesi
          Covella, Vito Vincenzo
          
        
      
        
          Relatore della tesi
          
          
        
      
        
          Correlatore della tesi
          
          
        
      
        
          Scuola
          
          
        
      
        
          Corso di studio
          
          
        
      
        
          Indirizzo
          CURRICULUM A: TECNICHE DEL SOFTWARE
          
        
      
        
      
        
          Ordinamento Cds
          DM270
          
        
      
        
          Parole chiave
          machine learning,fault classification,multi-node,HPC systems,clustering
          
        
      
        
          Data di discussione della Tesi
          18 Marzo 2021
          
        
      
      URI
      
      
     
   
  
    Altri metadati
    
      Tipologia del documento
      Tesi di laurea
(NON SPECIFICATO)
      
      
      
      
        
      
        
          Autore della tesi
          Covella, Vito Vincenzo
          
        
      
        
          Relatore della tesi
          
          
        
      
        
          Correlatore della tesi
          
          
        
      
        
          Scuola
          
          
        
      
        
          Corso di studio
          
          
        
      
        
          Indirizzo
          CURRICULUM A: TECNICHE DEL SOFTWARE
          
        
      
        
      
        
          Ordinamento Cds
          DM270
          
        
      
        
          Parole chiave
          machine learning,fault classification,multi-node,HPC systems,clustering
          
        
      
        
          Data di discussione della Tesi
          18 Marzo 2021
          
        
      
      URI
      
      
     
   
  
  
  
  
  
    
    Statistica sui download
    
    
  
  
    
      Gestione del documento: