Del Coco, Pierpaolo Elio Jr
(2018)
Temporal Text Mining: From Frequencies to Word Embeddings.
[Laurea magistrale], Università di Bologna, Corso di Studio in
Informatica [LM-DM270]
Documenti full-text disponibili:
|
Documento PDF (Thesis)
Disponibile con Licenza: Salvo eventuali più ampie autorizzazioni dell'autore, la tesi può essere liberamente consultata e può essere effettuato il salvataggio e la stampa di una copia per fini strettamente personali di studio, di ricerca e di insegnamento, con espresso divieto di qualunque utilizzo direttamente o indirettamente commerciale. Ogni altro diritto sul materiale è riservato
Download (3MB)
|
Abstract
The last decade has witnessed a tremendous growth in the amount of textual data available from web pages and social media posts, as well as from digitized sources, such as newspapers and books. However, as new data is continuously created to record the events of the moment, old data is archived day by day, for months, years, and decades. From this point of view, web archives play an important role not only as sources of data, but also as testimonials of history. In this respect, state-of-art machine learning models for word representations, namely word embeddings, are not able to capture the dynamic nature of semantics, since they represent a word as a single-state vector which do not consider different time spans of the corpus. Although diachronic word embeddings have started appearing in recent works, the very small literature leaves several open questions that must be addressed. Moreover, these works model language evolution from a strong linguistic perspective. We approach this problem from a slightly different perspective. In particular, we discuss temporal word embeddings models trained on highly evolving corpora, in order to model the knowledge that textual archives have accumulated over the years. This allow to discover semantic evolution of words, but also find temporal analogies and compute temporal translations. Moreover, we conducted experiments on word frequencies. The results of an in-depth temporal analysis of shifts in word semantics, in comparison to word frequencies, show that these two variations are related.
Abstract
The last decade has witnessed a tremendous growth in the amount of textual data available from web pages and social media posts, as well as from digitized sources, such as newspapers and books. However, as new data is continuously created to record the events of the moment, old data is archived day by day, for months, years, and decades. From this point of view, web archives play an important role not only as sources of data, but also as testimonials of history. In this respect, state-of-art machine learning models for word representations, namely word embeddings, are not able to capture the dynamic nature of semantics, since they represent a word as a single-state vector which do not consider different time spans of the corpus. Although diachronic word embeddings have started appearing in recent works, the very small literature leaves several open questions that must be addressed. Moreover, these works model language evolution from a strong linguistic perspective. We approach this problem from a slightly different perspective. In particular, we discuss temporal word embeddings models trained on highly evolving corpora, in order to model the knowledge that textual archives have accumulated over the years. This allow to discover semantic evolution of words, but also find temporal analogies and compute temporal translations. Moreover, we conducted experiments on word frequencies. The results of an in-depth temporal analysis of shifts in word semantics, in comparison to word frequencies, show that these two variations are related.
Tipologia del documento
Tesi di laurea
(Laurea magistrale)
Autore della tesi
Del Coco, Pierpaolo Elio Jr
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Indirizzo
Curriculum C: Sistemi e reti
Ordinamento Cds
DM270
Parole chiave
text mining,machine learning,word embeddings,temporal text mining
Data di discussione della Tesi
15 Marzo 2018
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Del Coco, Pierpaolo Elio Jr
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Indirizzo
Curriculum C: Sistemi e reti
Ordinamento Cds
DM270
Parole chiave
text mining,machine learning,word embeddings,temporal text mining
Data di discussione della Tesi
15 Marzo 2018
URI
Statistica sui download
Gestione del documento: