This Is My Cope: Identification and Forecasting of Hate Speech in Inceldom

Gajo, Paolo (2023) This Is My Cope: Identification and Forecasting of Hate Speech in Inceldom. [Laurea magistrale], Università di Bologna, Corso di Studio in Specialized translation [LM-DM270] - Forli'
Documenti full-text disponibili:
[thumbnail of Thesis] Documento PDF (Thesis)
Disponibile con Licenza: Creative Commons: Attribuzione - Non commerciale - Non opere derivate 4.0 (CC BY-NC-ND 4.0)

Download (5MB)

Abstract

The identification and moderation of hate speech on social media platforms is crucial, with the potential to increase the civility of online interactions and safeguard the well-being of all users. Despite the topic having been thoroughly explored in recent years by the NLP community, many avenues of research are still open, especially in the context of niche communities, where the language used by speakers is often riddled with opaque jargon and for which the amount of available data is limited. For the first time, we introduce a multilingual corpus for the analysis and identification of hate speech in the domain of inceldom, built from incel Web forums in English and Italian, including expert annotation at the post level for two kinds of hate speech: misogyny and racism. This resource paves the way for the development of mono- and cross-lingual models for (a) the identification of hateful (misogynous and racist) posts and (b) the forecasting of the amount of hateful responses that a post is likely to trigger. With relation to the identification tasks, our experiments aim at improving the performance of transformer-based models using masked language modeling (MLM) pretraining and dataset merging. These approaches are particularly effective in cross-lingual scenarios. Using multilingual MLM, we are able to improve the performance of mBERT models on the task of identifying hate speech in a zero-shot cross-lingual scenario by 17 points in terms of F1-measure, while the performance boost is 34 and 18 points for misogyny and racism identification, respectively. Multilingual dataset merging also leads to a significant improvement in performance for the binary classification setting, in the cross-lingual scenario, with a performance boost over the baseline dataset we compiled of 22 points in terms of F1-measure. In the forecasting setting, we propose a simple and novel approach to the task, which allows us to beat our MSE baseline by 37% in the monolingual English setting.

Abstract
Tipologia del documento
Tesi di laurea (Laurea magistrale)
Autore della tesi
Gajo, Paolo
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Indirizzo
CURRICULUM SPECIALIZED TRANSLATION
Ordinamento Cds
DM270
Parole chiave
natural language processing,hate speech,masked language modeling,cross-lingual,bert,transformers,misogyny,racism,nlp
Data di discussione della Tesi
13 Luglio 2023
URI

Altri metadati

Statistica sui download

Gestione del documento: Visualizza il documento

^