Pistolia, Elton
(2024)
Parallel Sentence Mining: A Semi-Automated Approach for the Creation of a Comparallel News Corpus in Greek an English.
[Laurea magistrale], Università di Bologna, Corso di Studio in
Specialized translation [LM-DM270] - Forli'
Documenti full-text disponibili:
Abstract
This thesis explores the opportunities presented by the large amount of multilingual comparable data in digital news platforms, focusing on the implications for multilingual news production and Translation Studies. With the rise of online news consumption, as evidenced by the preference for digital platforms over print in Europe over the last few years, there is a growing need for research in news translation. This study addresses the complexity of extracting parallel sentences from bilingual comparable news corpora of Greek and English, aiming to enhance understanding and methodologies within Translation Studies (TS) and Computational Linguistics (CL).
The research investigates the efficacy of cosine similarity measures applied to sentence and word embeddings for identifying parallel (translated) sentences across languages based on semantic similarity, with a focus on the peculiarities of journalistic language and the challenges of aligning sentences that involve not just direct translation but also cultural and contextual adaptation. Through a comprehensive workflow that includes data collection, algorithm implementation, and performance evaluation, this thesis attempts to answer three critical research questions regarding the automatic extraction of pairs of translated sentences and their classification into four categories, namely, translated, partial translation, non-translation, and unrelated, reflecting their translation relationship.
The findings confirm that cosine similarity in combination with sentence and word embeddings can effectively identify semantically similar sentences across bilingual news corpora. Moreover, they enable the categorization of sentence pairs into three categories, i.e., parallel, ambiguous, and unrelated, with further refinement into partial translations or non-translations for ambiguous pairs. This thesis contributes to the fields of Translation Studies and Computational Linguistics.
Abstract
This thesis explores the opportunities presented by the large amount of multilingual comparable data in digital news platforms, focusing on the implications for multilingual news production and Translation Studies. With the rise of online news consumption, as evidenced by the preference for digital platforms over print in Europe over the last few years, there is a growing need for research in news translation. This study addresses the complexity of extracting parallel sentences from bilingual comparable news corpora of Greek and English, aiming to enhance understanding and methodologies within Translation Studies (TS) and Computational Linguistics (CL).
The research investigates the efficacy of cosine similarity measures applied to sentence and word embeddings for identifying parallel (translated) sentences across languages based on semantic similarity, with a focus on the peculiarities of journalistic language and the challenges of aligning sentences that involve not just direct translation but also cultural and contextual adaptation. Through a comprehensive workflow that includes data collection, algorithm implementation, and performance evaluation, this thesis attempts to answer three critical research questions regarding the automatic extraction of pairs of translated sentences and their classification into four categories, namely, translated, partial translation, non-translation, and unrelated, reflecting their translation relationship.
The findings confirm that cosine similarity in combination with sentence and word embeddings can effectively identify semantically similar sentences across bilingual news corpora. Moreover, they enable the categorization of sentence pairs into three categories, i.e., parallel, ambiguous, and unrelated, with further refinement into partial translations or non-translations for ambiguous pairs. This thesis contributes to the fields of Translation Studies and Computational Linguistics.
Tipologia del documento
Tesi di laurea
(Laurea magistrale)
Autore della tesi
Pistolia, Elton
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Indirizzo
CURRICULUM TRANSLATION AND TECHNOLOGY
Ordinamento Cds
DM270
Parole chiave
Comparallel Corpora,Parallel Sentence Mining,News Translation,Text Classification,Parallel Corpora
Data di discussione della Tesi
19 Marzo 2024
URI
Altri metadati
Tipologia del documento
Tesi di laurea
(NON SPECIFICATO)
Autore della tesi
Pistolia, Elton
Relatore della tesi
Correlatore della tesi
Scuola
Corso di studio
Indirizzo
CURRICULUM TRANSLATION AND TECHNOLOGY
Ordinamento Cds
DM270
Parole chiave
Comparallel Corpora,Parallel Sentence Mining,News Translation,Text Classification,Parallel Corpora
Data di discussione della Tesi
19 Marzo 2024
URI
Statistica sui download
Gestione del documento: