Lightweight Internet Traffic Classification - A Subject Based Solution with Word Embeddings

Murgia, Antonio (2016) Lightweight Internet Traffic Classification - A Subject Based Solution with Word Embeddings. [Laurea magistrale], Università di Bologna, Corso di Studio in Ingegneria informatica [LM-DM270], Documento ad accesso riservato.
Documenti full-text disponibili:
[img] Documento PDF
Full-text accessibile solo agli utenti istituzionali dell'Ateneo
Disponibile con Licenza: Creative Commons Attribuzione - Non commerciale - Non opere derivate 3.0

Download (3MB) | Contatta l'autore

Abstract

Internet traffic classification is a relevant and mature research field, anyway of growing importance and with still open technical challenges, also due to the pervasive presence of Internet-connected devices into everyday life. We claim the need for innovative traffic classification solutions capable of being lightweight, of adopting a domain-based approach, of not only concentrating on application-level protocol categorization but also classifying Internet traffic by subject. To this purpose, this paper originally proposes a classification solution that leverages domain name information extracted from IPFIX summaries, DNS logs, and DHCP leases, with the possibility to be applied to any kind of traffic. Our proposed solution is based on an extension of Word2vec unsupervised learning techniques running on a specialized Apache Spark cluster. In particular, learning techniques are leveraged to generate word-embeddings from a mixed dataset composed by domain names and natural language corpuses in a lightweight way and with general applicability. The paper also reports lessons learnt from our implementation and deployment experience that demonstrates that our solution can process 5500 IPFIX summaries per second on an Apache Spark cluster with 1 slave instance in Amazon EC2 at a cost of $ 3860 year. Reported experimental results about Precision, Recall, F-Measure, Accuracy, and Cohen's Kappa show the feasibility and effectiveness of the proposal. The experiments prove that words contained in domain names do have a relation with the kind of traffic directed towards them, therefore using specifically trained word embeddings we are able to classify them in customizable categories. We also show that training word embeddings on larger natural language corpuses leads improvements in terms of precision up to 180%.

Abstract
Tipologia del documento
Tesi di laurea (Laurea magistrale)
Autore della tesi
Murgia, Antonio
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Ordinamento Cds
DM270
Parole chiave
internet traffic classification machine learning apache spark hadoop big data word2vec
Data di discussione della Tesi
16 Marzo 2016
URI

Altri metadati

Statistica sui download

Gestione del documento: Visualizza il documento

^