Leveraging the meta-embedding for text classification in a resource-constrained language

被引：14

作者：

Hossain, Md. Rajib ^{[1
]}

Hoque, Mohammed Moshiul ^{[1
]}

Siddique, Nazmul ^{[2
]}

机构：

[1] Chittagong Univ Engn & Technol, Dept Comp Sci & Engn, Chittagong 4349, Bangladesh

[2] Ulster Univ, Sch Comp Engn & Intelligent Syst, Belfast, North Ireland

来源：

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE | 2023年 / 124卷

关键词：

Natural language processing; Text classification; Text corpora; Semantic feature extraction; Meta-embedding; Deep learning;

D O I：

10.1016/j.engappai.2023.106586

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper proposes an intelligent text classification framework for a resource-constrained language like Bengali, which is considered a challenging task due to the lack of standard corpora, appropriate hyper-parameter tuning method, and pre-trained language-specific embedding. The proposed framework comprises an average meta-embedding feature fusion module and a convolutions neural network module called AVG-M+CNN. This work also proposes an algorithm, i.e., automatic hyperparameter tuning and selection, for enhancing the performance of the AVG-M+CN N technique. A l l meta-embedding models are evaluated using the intrinsic, e.g., semantic, syntactic, relatedness word similarity, analog y tasks and extrinsic evaluators. The intrinsic evaluator evaluates 200 Bengali semantic, syntactic and relatedness word pairs. Spearman (o), Pearson (?) and cosine similarity correlations are used to evaluate 18 individual embedding and 9 meta-embedding models. The 3COSADD and 3COSMU L evaluators evaluate the 300 analog y tasks. The extrinsic evaluator evaluates a total of 156 classification models on four corpora: BARD, IndicNLP, Prothom-Alo and BTCC 11 (a newly developed corpus having eleven distinct categories). Among these, the AVG-M+CN N model achieves the highest accuracy regarding four Bengal i corpora: 95.92 & PLUSMN;.001% for BARD, 93.10 & PLUSMN;.001% for Prothom-Alo, 90.07 & PLUSMN;.001% for BTCC 11 and 87.44 & PLUSMN;.001% for IndicNLP, respectively.

引用

页数：18

共 49 条

[1]

Afroze Sadia, 2023, Intelligent Computing & Optimization: Proceedings of the 5th International Conference on Intelligent Computing and Optimization 2022 (ICO2022). Lecture Notes in Networks and Systems (569), P242, DOI 10.1007/978-3-031-19958-5_23

[2]

[Anonymous], 2010, LREC WORKSH NEW CHAL

[3]

[Anonymous], 2014, P 2014 C EMP METH NA, DOI DOI 10.3115/V1/D14-1162

[4]

[Anonymous], 2016, P NAACL HLT

[5]

[Anonymous], 2018, 2018 IEEE WIR COMM N, DOI DOI 10.1109/WCNC.2018.8377222

[6]

Arora Gaurav, 2020, P 2 WORKSHOP NLP OPE, P66, DOI DOI 10.18653/V1/2020.NLPOSS

[7]

Benton A, 2019, 4TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP (REPL4NLP-2019), P1

[8]

Bojanowski P., 2017, Trans ACL, V5, P135, DOI [10.1162/tacla00051, DOI 10.1162/TACLA00051, DOI 10.1162/TACL_A_00051]

[9] Investigating Word Meta-Embeddings by Disentangling Common and Individual Information [J].

Chen, Wenfan ;

Sheng, Mengmeng ;

Mao, Jiafa ;

Sheng, Weiguo .

IEEE ACCESS, 2020, 8 :11692-11699

[10] Empirical study on character level neural network classifier for Chinese text [J].

Chung, Tonglee ;

Xu, Bin ;

Liu, Yongbin ;

Ouyang, Chunping ;

Li, Siliang ;

Luo, Lingyun .

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2019, 80 :1-7

← 1 2 3 4 5 →