Parallel Sentence Mining: A Semi-Automated Approach for the Creation of a Comparallel News Corpus in Greek an English

Pistolia, Elton (2024) Parallel Sentence Mining: A Semi-Automated Approach for the Creation of a Comparallel News Corpus in Greek an English. [Laurea magistrale], Università di Bologna, Corso di Studio in Specialized translation [LM-DM270] - Forli'

Salva citazione

Documenti full-text disponibili:

Documento PDF (Thesis)
Disponibile con Licenza: Creative Commons: Attribuzione - Non commerciale - Condividi allo stesso modo 4.0 (CC BY-NC-SA 4.0)
Download (1MB)

Abstract

This thesis explores the opportunities presented by the large amount of multilingual comparable data in digital news platforms, focusing on the implications for multilingual news production and Translation Studies. With the rise of online news consumption, as evidenced by the preference for digital platforms over print in Europe over the last few years, there is a growing need for research in news translation. This study addresses the complexity of extracting parallel sentences from bilingual comparable news corpora of Greek and English, aiming to enhance understanding and methodologies within Translation Studies (TS) and Computational Linguistics (CL). The research investigates the efficacy of cosine similarity measures applied to sentence and word embeddings for identifying parallel (translated) sentences across languages based on semantic similarity, with a focus on the peculiarities of journalistic language and the challenges of aligning sentences that involve not just direct translation but also cultural and contextual adaptation. Through a comprehensive workflow that includes data collection, algorithm implementation, and performance evaluation, this thesis attempts to answer three critical research questions regarding the automatic extraction of pairs of translated sentences and their classification into four categories, namely, translated, partial translation, non-translation, and unrelated, reflecting their translation relationship. The findings confirm that cosine similarity in combination with sentence and word embeddings can effectively identify semantically similar sentences across bilingual news corpora. Moreover, they enable the categorization of sentence pairs into three categories, i.e., parallel, ambiguous, and unrelated, with further refinement into partial translations or non-translations for ambiguous pairs. This thesis contributes to the fields of Translation Studies and Computational Linguistics.

Abstract

Tipologia del documento

Tesi di laurea (Laurea magistrale)

Autore della tesi

Pistolia, Elton

Relatore della tesi

Garcea, Federico

Correlatore della tesi

Bernardini, Silvia

Scuola

Lingue e Letterature, Traduzione e Interpretazione

Corso di studio