Modelling task execution time in Directed Acyclic Graphs for efficient distributed management

Chieregato, Federico (2022) Modelling task execution time in Directed Acyclic Graphs for efficient distributed management. [Laurea magistrale], Università di Bologna, Corso di Studio in Ingegneria informatica [LM-DM270], Documento full-text non disponibile
Il full-text non è disponibile per scelta dell'autore. (Contatta l'autore)


In this thesis, has been shown a framework that predicts the execution time of tasks in Directed Acyclic Graphs (DAG), each task is the smallest unit of work that executes a function over a set of inputs and in this scenario represents a vertex in a DAG. This thesis includes an implementation for extracting profiling information from Apache Spark, as well, an evaluation of the framework for the Spark decision support benchmark TPC-DS and an in-house and completely different DAG runtime system for real-world DAGS, involving computational quantum chemistry applications. Speeding up the execution in Spark or other workflows is an important problem for many real-time applications; since it is impractical to generate a predictive model that considers the actual values of the inputs to tasks, has been explored the use of Surrogates as the number of parents and the mean parent duration of a task. For this reason, this solution takes the name of PRODIGIOUS, Performance modelling of DAGs via surrogate features. Since the duration of the tasks is a float value, have been studied different regression algorithms, tuning the Hyperparameters through GridSearchCV. The main objective of PRODIGIOUS concern, not only to understand if the use of surrogates instead of actual inputs is enough to predict the execution time of tasks of the same DAG type, but also if it is possible to predict the execution time of tasks of different DAG type creating so a DAG agnostic framework that could help scientist and computer engineer making more efficient their workflow. Others agnostic feature chosen were, the core for each task, the RAM of the benchmark, the data access type, and the number of executors.

Tipologia del documento
Tesi di laurea (Laurea magistrale)
Autore della tesi
Chieregato, Federico
Relatore della tesi
Correlatore della tesi
Corso di studio
Ordinamento Cds
Parole chiave
Spark,Kubernetes,Docker,cloud,Machine Learning,regressione,object storage,CEPH,datashim,DAG,workflows,task
Data di discussione della Tesi
22 Marzo 2022

Altri metadati

Gestione del documento: Visualizza il documento