Towards Malay named entity recognition: an open-source dataset and a multi-task framework

被引:1
作者
Fu, Yingwen [1 ]
Lin, Nankai [2 ]
Yang, Zhihe [1 ]
Jiang, Shengyi [1 ,3 ]
机构
[1] Guangdong Univ Foreign Studies, Sch Informat Sci & Technol, Guangzhou, Peoples R China
[2] Guangdong Univ Technol, Sch Comp Sci & Technol, Guangzhou, Peoples R China
[3] Guangzhou Key Lab Multilingual Intelligent Proc, Guangzhou, Peoples R China
关键词
Malay; named entity recognition; dataset; multi-task learning; Bi-revision;
D O I
10.1080/09540091.2022.2159014
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Named entity recognition (NER) is a key component of many natural language processing (NLP) applications. The majority of advanced research, however, has not been widely applied to low-resource languages represented by Malay due to the data-hungry problem. In this paper, we present a system for building a Malay NER dataset (MS-NER) of 20,146 sentences through labelled datasets of homologous languages and iterative optimisation. Additionally, we propose a Multi-Task framework, namely MTBR, to integrate boundary information more effectively for NER. Specifically, boundary detection is treated as an auxiliary task and an enhanced Bidirectional Revision module with a gated ignoring mechanism is proposed to undertake conditional label transfer. This can reduce error propagation by the auxiliary task. We conduct extensive experiments on Malay, Indonesian, and English. Experimental results show that MTBR could achieve competitive performance and tends to outperform multiple baselines. The constructed dataset and model would be made available to the public as a new, reliable benchmark for Malay NER.
引用
收藏
页数:23
相关论文
共 60 条
[1]  
Abinaya N., 2014, Proc. ACM Int. Conf. Ser., V05-07-Dec, DOI 10.1145/2824864.2824882
[2]  
Akbik A, 2018, P 27 INT C COMP LING, P1638
[3]  
Akbik A, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P724
[4]  
Alfina I, 2017, INT C ADV COMP SCI I, P216, DOI 10.1109/ICACSIS.2017.8355036
[5]  
Alfina I, 2016, INT C ADV COMP SCI I, P335, DOI 10.1109/ICACSIS.2016.7872784
[6]  
Alfred Rayner, 2014, International Journal of Machine Learning and Computing, V4, P103, DOI 10.7763/IJMLC.2014.V4.428
[7]  
Anbukkarasi S., 2022, 2022 International Conference on Computer Communication and Informatics, P1, DOI DOI 10.1109/ICCCI54379.2022.9740745
[8]  
Asmai SA, 2018, INT J ADV COMPUT SC, V9, P474
[9]  
Bali R.-M., 2006, ECTI Transaction on Computer and Information Technology, V2, P126
[10]  
Chiu Jason PC, 2016, Transactions of the Association for Computational Linguistics, V4, P357, DOI DOI 10.1162/TACLA00104