ArzEn-MultiGenre: An aligned parallel dataset of Egyptian Arabic song lyrics, novels, and subtitles, with English translations

被引:0
作者
Al-Sabbagh, Rania [1 ]
机构
[1] Univ Sharjah, Dept Foreign Languages, Sharjah, U Arab Emirates
关键词
Parallel datasets; Arabic dialects; Benchmarking datasets; Finetuning large -language models; Machine translation; Translation studies; Cross -linguistic analysis; Lexical semantics;
D O I
10.1016/j.dib.2024.110271
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
ArzEn-MultiGenre is a parallel dataset of Egyptian Arabic song lyrics, novels, and TV show subtitles that are manually translated and aligned with their English counterparts. The dataset contains 25,557 segment pairs that can be used to benchmark new machine translation models, fine-tune large language models in few-shot settings, and adapt commercial machine translation applications such as Google Translate. Additionally, the dataset is a valuable resource for research in various disciplines, including translation studies, crosslinguistic analysis, and lexical semantics. The dataset can also serve pedagogical purposes by training translation students and aid professional translators as a translation memory. The contributions are twofold: first, the dataset features textual genres not found in existing parallel Egyptian Arabic and English datasets, and second, it is a gold-standard dataset that has been translated and aligned by human experts. (c) 2024 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )
引用
收藏
页数:9
相关论文
共 16 条
[1]  
Aghani Lyrics, 2023, US
[2]  
Al-Sabbagh R., 2024, International Journal of Arabic-English Studies, V24, P95, DOI DOI 10.33806/IJAES.V24I1.560
[3]  
Al-Sabbagh Rania, 2024, Mendeley Data, V4, DOI 10.17632/6K97JTY9XG.4
[4]  
[Anonymous], 2023, SOTOOR ALL ONE OCR
[5]  
[Anonymous], 2012, P 2012 C N AM CHAPTE
[6]  
Bouamor H, 2018, PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), P3387
[7]  
Callison-Burch C., 2014, TRANSLATIONS CALLHOM
[8]  
Chen S., 2021, LINGUIST DATA CONSOR, DOI [10.35111/k4bf-hh16, DOI 10.35111/K4BF-HH16]
[9]  
Chen S., 2019, LINGUIST DATA CONSOR, DOI [10.35111/bbk4-8c25, DOI 10.35111/BBK4-8C25]
[10]  
El-Haj M, 2020, PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), P1318