Pignedoli, Daniele
(2023)
Integrations of natural language processing and network observables for a classification task: vaccine perception in Twitter.
[Laurea magistrale], Università di Bologna, Corso di Studio in
Physics [LM-DM270], Documento ad accesso riservato.
Documenti full-text disponibili:
|
Documento PDF (Thesis)
Full-text accessibile solo agli utenti istituzionali dell'Ateneo
Disponibile con Licenza: Salvo eventuali più ampie autorizzazioni dell'autore, la tesi può essere liberamente consultata e può essere effettuato il salvataggio e la stampa di una copia per fini strettamente personali di studio, di ricerca e di insegnamento, con espresso divieto di qualunque utilizzo direttamente o indirettamente commerciale. Ogni altro diritto sul materiale è riservato
Download (5MB)
| Contatta l'autore
|
Abstract
The COVID-19 pandemic in 2020 caused significant distress and death worldwide, leading to unprecedented global efforts to develop effective vaccines rapidly. The safety of these vaccines and the execution of vaccination campaigns became widely debated topics in the media and among the general public, especially on social media platforms. In this thesis, we analyze over 5.5 million Italian Twitter messages collected during the initial COVID-19 vaccination campaign, 7000 of which were manually classified into three categories based on attitudes towards vaccines: Pro-Vaccine, Anti-Vaccine, and Neutral. We employed a Word2Vec algorithm to perform text embedding. This allowed us to classify messages using a logistic regression approach, achieving a 59% accuracy rate for the three categories and a 76% accuracy rate when comparing only the two opposing classes (Pro-Vax and Anti-Vax). Additionally, we examined the social network formed by Twitter users through retweets, consisting of over 60,000 unique users and 390,000 links. We applied a community detection algorithm, revealing clusters of users with strongly connected opinions, the two largest of which, representing 58% and 26% of total users, respectively, represent neutral and Pro-Vax users and Anti-Vax users. We assigned each user a proximity value to each significant community based on their neighbors' membership, creating a low-dimensional embedding of the network nodes. Using logistic regression on these embeddings, we achieved an 88% accuracy rate when distinguishing between Pro-Vax and Anti-Vax users and a 60% accuracy rate for all three classes. Finally, we combined the text and node embeddings, which led to a significant improvement in classification for the three classes, reaching a 67% accuracy rate. By analyzing the classifier’s errors we identified several areas in which text embedding struggled, such as processing irony, handling short messages, and integrating less typical language references.
Abstract
The COVID-19 pandemic in 2020 caused significant distress and death worldwide, leading to unprecedented global efforts to develop effective vaccines rapidly. The safety of these vaccines and the execution of vaccination campaigns became widely debated topics in the media and among the general public, especially on social media platforms. In this thesis, we analyze over 5.5 million Italian Twitter messages collected during the initial COVID-19 vaccination campaign, 7000 of which were manually classified into three categories based on attitudes towards vaccines: Pro-Vaccine, Anti-Vaccine, and Neutral. We employed a Word2Vec algorithm to perform text embedding. This allowed us to classify messages using a logistic regression approach, achieving a 59% accuracy rate for the three categories and a 76% accuracy rate when comparing only the two opposing classes (Pro-Vax and Anti-Vax). Additionally, we examined the social network formed by Twitter users through retweets, consisting of over 60,000 unique users and 390,000 links. We applied a community detection algorithm, revealing clusters of users with strongly connected opinions, the two largest of which, representing 58% and 26% of total users, respectively, represent neutral and Pro-Vax users and Anti-Vax users. We assigned each user a proximity value to each significant community based on their neighbors' membership, creating a low-dimensional embedding of the network nodes. Using logistic regression on these embeddings, we achieved an 88% accuracy rate when distinguishing between Pro-Vax and Anti-Vax users and a 60% accuracy rate for all three classes. Finally, we combined the text and node embeddings, which led to a significant improvement in classification for the three classes, reaching a 67% accuracy rate. By analyzing the classifier’s errors we identified several areas in which text embedding struggled, such as processing irony, handling short messages, and integrating less typical language references.
Tipologia del documento
Tesi di laurea
(Laurea magistrale)
Autore della tesi
Pignedoli, Daniele
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Indirizzo
Applied Physics
Ordinamento Cds
DM270
Parole chiave
Text embedding,Word2Vec,social network analysis.,Classification task,Covid-19 vaccines
Data di discussione della Tesi
14 Luglio 2023
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Pignedoli, Daniele
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Indirizzo
Applied Physics
Ordinamento Cds
DM270
Parole chiave
Text embedding,Word2Vec,social network analysis.,Classification task,Covid-19 vaccines
Data di discussione della Tesi
14 Luglio 2023
URI
Statistica sui download
Gestione del documento: