Natural language processing for similar languages, varieties, and dialects: A survey

被引:17
作者
Zampieri, Marcos [1 ]
Nakov, Preslav [2 ]
Scherrer, Yves [3 ]
机构
[1] Rochester Inst Technol, Rochester, NY 14623 USA
[2] HBKU, Qatar Comp Res Inst, Doha, Qatar
[3] Univ Helsinki, Helsinki, Finland
关键词
Dialects; similar languages; language varieties; language identification machine; translation parsing; MACHINE TRANSLATION; IDENTIFICATION; ADAPTATION;
D O I
10.1017/S1351324920000492
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
There has been a lot of recent interest in the natural language processing (NLP) community in the computational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar languages, varieties, and dialects. In particular, we discuss the most important challenges when dealing with diatopic language variation, and we present some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects. We further present a number of studies on computational methods developed and/or adapted for preprocessing, normalization, part-of-speech tagging, and parsing similar languages, language varieties, and dialects. Finally, we discuss relevant applications such as language and dialect identification and machine translation for closely related languages, language varieties, and dialects.
引用
收藏
页码:595 / 612
页数:18
相关论文
共 148 条
  • [1] Aepli Noemi, 2014, P 1 WORKSHOP APPLYIN, P76
  • [2] Agic Z, 2015, PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL) AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (IJCNLP), VOL 2, P268
  • [3] Aharoni R, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P3874
  • [4] Al-Onaizan Y, 2014, Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, P110
  • [5] AlGhamdi F., 2019, Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, P99
  • [6] Ali A, 2017, 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), P316, DOI 10.1109/ASRU.2017.8268952
  • [7] Automatic Dialect Detection in Arabic Broadcast Speech
    Ali, Ahmed
    Dehak, Najim
    Cardinal, Patrick
    Khurana, Sameer
    Yella, Sree Harsha
    Glass, James
    Bell, Peter
    Renals, Steve
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2934 - 2938
  • [8] Alshutayri A., 2017, International Journal of Computational Linguistics IJCL, V8, P37
  • [9] Altintas K, 2003, PROCEEDINGS OF THE 17TH INTERNATIONAL SYMPOSIUM ON COMPUTER AND INFORMATION SCIENCES, P192
  • [10] [Anonymous], 2011, P WORKSH ALG RES MOD