Employing a Multilingual Transformer Model for Segmenting Unpunctuated Arabic Text

被引:6
作者
Alshanqiti, Abdullah M. [1 ]
Albouq, Sami [1 ]
Alkhodre, Ahmad B. [1 ]
Namoun, Abdallah [1 ]
Nabil, Emad [1 ,2 ]
机构
[1] Islamic Univ Madinah, Fac Comp & Informat Syst, Madinah 42351, Saudi Arabia
[2] Cairo Univ, Fac Comp & Artificial Intelligence, Giza 12613, Egypt
来源
APPLIED SCIENCES-BASEL | 2022年 / 12卷 / 20期
关键词
text splitting; text tokenization; transfer learning; mask-fill prediction; NLP linguistic rules; missing punctuations; cross-lingual BERT model; Masked Language Modeling;
D O I
10.3390/app122010559
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Long unpunctuated texts containing complex linguistic sentences are a stumbling block to processing any low-resource languages. Thus, approaches that attempt to segment lengthy texts with no proper punctuation into simple candidate sentences are a vitally important preprocessing task in many hard-to-solve NLP applications. In this paper, we propose (PDTS) a punctuation detection approach for segmenting Arabic text, built on top of a multilingual BERT-based model and some generic linguistic rules. Furthermore, we showcase how PDTS can be effectively employed as a text tokenizer for unpunctuated documents (i.e., mimicking the transcribed audio-to-text documents). Experimental findings across two evaluation protocols (involving an ablation study and a human-based judgment) demonstrate that PDTS is practically effective in both performance quality and computational cost.
引用
收藏
页数:15
相关论文
共 33 条
[1]  
Abdelali Ahmed., 2016, P 2016 C N AM CHAPTE, P11, DOI [10.18653/v1/N16-3003, DOI 10.18653/V1/N16-3003, DOI 10.18653/V1/N16]
[2]  
Abdul-Mageed M., 2021, 59 ANN M ASS COMP LI, P7088, DOI [10.18653/v1/2021.acl-long.551, DOI 10.18653/V1/2021.ACL-LONG.551]
[3]  
Alonzo Oliver, 2021, ACM SIGACCESS Accessibility and Computing, P1, DOI [10.1145/3523265.3523268, 10.1145/3523265.3523268]
[4]  
Alosh M., 2012, USING ARABIC GUIDE C
[5]   Leveraging DistilBERT for Summarizing Arabic Text: An Extractive Dual-Stage Approach [J].
Alshanqiti, Abdullah ;
Namoun, Abdallah ;
Alsughayyir, Aeshah ;
Mashraqi, Aisha Mousa ;
Gilal, Abdul Rehman ;
Albouq, Sami Saad .
IEEE ACCESS, 2021, 9 :135594-135607
[6]  
Antoun W., 2020, arXiv preprint arXiv:2003.00104
[7]  
Cheragui Mohamed Amine, 2020, 2020 2nd International Conference on Mathematics and Information Technology (ICMIT), P220, DOI 10.1109/ICMIT47780.2020.9046976
[8]  
Conneau A., 2020, arXiv
[9]  
Daroch S.K., 2022, P INT C INTELLIGENT, P285
[10]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171