Lightweight Internet Traffic Classification - A Subject Based Solution with Word Embeddings

Murgia, Antonio (2016) Lightweight Internet Traffic Classification - A Subject Based Solution with Word Embeddings. [Laurea magistrale], Università di Bologna, Corso di Studio in Ingegneria informatica [LM-DM270], Documento ad accesso riservato.

Salva citazione

Documenti full-text disponibili:

[thumbnail of Lightweight_Internet_Traf.pdf]

Documento PDF
Full-text accessibile solo agli utenti istituzionali dell'Ateneo
Disponibile con Licenza: Creative Commons: Attribuzione - Non commerciale - Non opere derivate 3.0 (CC BY-NC-ND 3.0)
Download (3MB) | Contatta l'autore

Abstract

Internet traffic classification is a relevant and mature research field, anyway of growing importance and with still open technical challenges, also due to the pervasive presence of Internet-connected devices into everyday life. We claim the need for innovative traffic classification solutions capable of being lightweight, of adopting a domain-based approach, of not only concentrating on application-level protocol categorization but also classifying Internet traffic by subject. To this purpose, this paper originally proposes a classification solution that leverages domain name information extracted from IPFIX summaries, DNS logs, and DHCP leases, with the possibility to be applied to any kind of traffic. Our proposed solution is based on an extension of Word2vec unsupervised learning techniques running on a specialized Apache Spark cluster. In particular, learning techniques are leveraged to generate word-embeddings from a mixed dataset composed by domain names and natural language corpuses in a lightweight way and with general applicability. The paper also reports lessons learnt from our implementation and deployment experience that demonstrates that our solution can process 5500 IPFIX summaries per second on an Apache Spark cluster with 1 slave instance in Amazon EC2 at a cost of $ 3860 year. Reported experimental results about Precision, Recall, F-Measure, Accuracy, and Cohen's Kappa show the feasibility and effectiveness of the proposal. The experiments prove that words contained in domain names do have a relation with the kind of traffic directed towards them, therefore using specifically trained word embeddings we are able to classify them in customizable categories. We also show that training word embeddings on larger natural language corpuses leads improvements in terms of precision up to 180%.

Abstract

Tipologia del documento

Tesi di laurea (Laurea magistrale)

Autore della tesi

Murgia, Antonio

Relatore della tesi

Bellavista, Paolo

Correlatore della tesi

Ghidini, Giacomo ; Emmons, Stephen P.

Scuola

Ingegneria e Architettura

Corso di studio