Murgia, Antonio
(2016)
Lightweight Internet Traffic Classification - A Subject Based Solution with Word Embeddings.
[Laurea magistrale], Università di Bologna, Corso di Studio in
Ingegneria informatica [LM-DM270], Documento ad accesso riservato.
Documenti full-text disponibili:
Abstract
Internet traffic classification is a relevant and mature research field, anyway of growing importance
and with still open technical challenges, also due to the pervasive presence of Internet-connected devices into everyday life.
We claim the need for innovative traffic classification solutions capable of being lightweight, of adopting a domain-based approach,
of not only concentrating on application-level protocol categorization but also classifying Internet traffic by subject.
To this purpose, this paper originally proposes a classification solution that leverages
domain name information extracted from IPFIX summaries, DNS logs, and DHCP
leases, with the possibility to be applied to any kind of traffic.
Our proposed solution is based on an extension of Word2vec unsupervised learning techniques running
on a specialized Apache Spark cluster. In particular, learning techniques are leveraged to generate
word-embeddings from a mixed dataset composed by domain names and natural language corpuses
in a lightweight way and with general applicability.
The paper also reports lessons learnt from our implementation
and deployment experience that demonstrates that our solution can process 5500
IPFIX summaries per second on an Apache Spark cluster with 1 slave instance in Amazon EC2 at a cost of $ 3860 year.
Reported experimental results about Precision, Recall, F-Measure, Accuracy, and Cohen's Kappa
show the feasibility and effectiveness of the proposal.
The experiments prove that words contained in domain names do have a relation
with the kind of traffic directed towards them, therefore using specifically trained word
embeddings we are able to classify them in customizable categories.
We also show that training word embeddings on larger natural language corpuses
leads improvements in terms of precision up to 180%.
Abstract
Internet traffic classification is a relevant and mature research field, anyway of growing importance
and with still open technical challenges, also due to the pervasive presence of Internet-connected devices into everyday life.
We claim the need for innovative traffic classification solutions capable of being lightweight, of adopting a domain-based approach,
of not only concentrating on application-level protocol categorization but also classifying Internet traffic by subject.
To this purpose, this paper originally proposes a classification solution that leverages
domain name information extracted from IPFIX summaries, DNS logs, and DHCP
leases, with the possibility to be applied to any kind of traffic.
Our proposed solution is based on an extension of Word2vec unsupervised learning techniques running
on a specialized Apache Spark cluster. In particular, learning techniques are leveraged to generate
word-embeddings from a mixed dataset composed by domain names and natural language corpuses
in a lightweight way and with general applicability.
The paper also reports lessons learnt from our implementation
and deployment experience that demonstrates that our solution can process 5500
IPFIX summaries per second on an Apache Spark cluster with 1 slave instance in Amazon EC2 at a cost of $ 3860 year.
Reported experimental results about Precision, Recall, F-Measure, Accuracy, and Cohen's Kappa
show the feasibility and effectiveness of the proposal.
The experiments prove that words contained in domain names do have a relation
with the kind of traffic directed towards them, therefore using specifically trained word
embeddings we are able to classify them in customizable categories.
We also show that training word embeddings on larger natural language corpuses
leads improvements in terms of precision up to 180%.
Tipologia del documento
Tesi di laurea
(Laurea magistrale)
Autore della tesi
Murgia, Antonio
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
internet traffic classification machine learning apache spark hadoop big data word2vec
Data di discussione della Tesi
16 Marzo 2016
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Murgia, Antonio
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
internet traffic classification machine learning apache spark hadoop big data word2vec
Data di discussione della Tesi
16 Marzo 2016
URI
Statistica sui download
Gestione del documento: