HinPLMs: Pre-trained Language Models for Hindi

被引：1

作者：

Huang, Xixuan ^{[1
]}

Lin, Nankai ^{[1
]}

Li, Kexin ^{[1
]}

Wang, Lianxi ^{[1
,2
]}

Gan, Suifu ^{[3
]}

机构：

[1] Guangdong Univ Foreign Studies, Sch Informat Sci & Technol, Guangzhou, Peoples R China

[2] Guangdong Univ Foreign Studies, Guangzhou Key Lab Multilingual Intelligent Proc, Guangzhou, Peoples R China

[3] Jinan Univ, Sch Management, Guangzhou, Peoples R China

来源：

2021 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP) | 2021年

关键词：

Hindi Language Processing; Pre-trained Models; Corpus Construction; Romanization;

D O I：

10.1109/IALP54817.2021.9675194

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

It has been shown that the use of pre-trained models (PTMs) can significantly improve the performance of natural language processing (NLP) tasks for language with rich resources, and also reduce the amount of labeled sample data required in supervised learning. However, there are still few research and shared task datasets available for Hindi, and PTMs for the Romanized Hindi script has been rarely released. In this work, we construct a Hindi pre-training corpus in Devanagari and Romanized scripts, and train Hindi pre-trained models with two versions: Hindi-Devanagari-Roberta and Hindi-Romanized-Roberta. We evaluate our model on 5 types of downstream NLP tasks with 8 datasets, and compare them with existing Hindi pre-training models and commonly used methods. Experimental results show that the model proposed in this work can achieve the best results on the all tasks, especially on Part-of-Speech Tagging and Named Entity Recognition tasks, which proves the validity and superiority of our Hindi pre-trained models. Specifically, the performance of Devanagari Hindi pretrained model is better than the Romanized Hindi pre-trained model in the tasks of single-label Text Classification, Part-of-Speech Tagging, Named Entity Recognition, and Natural Language Inference. However, Romanized Hindi pre-trained model performs better in multi-label Text Classification and Machine Reading Comprehension, which may indicate that the pre-trained model of Romanized Hindi script has advantages in such tasks. We will publish our model to the community with the intention of promoting the future development of Hindi NLP.

引用

页码：241 / 246

页数：6

共 50 条

[1] Pre-Trained Language Models and Their Applications
Wang, Haifeng
Li, Jiwei
Wu, Hua
Hovy, Eduard
Sun, Yu
ENGINEERING, 2023, 25 : 51 - 65
[2] Comprehensive study of pre-trained language models: detecting humor in news headlines
Farah Shatnawi
Malak Abdullah
Mahmoud Hammad
Mahmoud Al-Ayyoub
Soft Computing, 2023, 27 : 2575 - 2599
[3] Comprehensive study of pre-trained language models: detecting humor in news headlines
Shatnawi, Farah
Abdullah, Malak
Hammad, Mahmoud
Al-Ayyoub, Mahmoud
SOFT COMPUTING, 2023, 27 (05) : 2575 - 2599
[4] Pre-trained models: Past, present and future
Han, Xu
Zhang, Zhengyan
Ding, Ning
Gu, Yuxian
Liu, Xiao
Huo, Yuqi
Qiu, Jiezhong
Yao, Yuan
Zhang, Ao
Zhang, Liang
Han, Wentao
Huang, Minlie
Jin, Qin
Lan, Yanyan
Liu, Yang
Liu, Zhiyuan
Lu, Zhiwu
Qiu, Xipeng
Song, Ruihua
Tang, Jie
Wen, Ji-Rong
Yuan, Jinhui
Zhao, Wayne Xin
Zhu, Jun
AI OPEN, 2021, 2 : 225 - 250
[5] Natural Attack for Pre-trained Models of Code
Yang, Zhou
Shi, Jieke
He, Junda
Lo, David
2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), 2022, : 1482 - 1493
[6] Explainable Pre-Trained Language Models for Sentiment Analysis in Low-Resourced Languages
Mabokela, Koena Ronny
Primus, Mpho
Celik, Turgay
BIG DATA AND COGNITIVE COMPUTING, 2024, 8 (11)
[7] Text clustering based on pre-trained models and autoencoders
Xu, Qiang
Gu, Hao
Ji, ShengWei
FRONTIERS IN COMPUTATIONAL NEUROSCIENCE, 2024, 17
[8] Compressing Pre-trained Models of Code into 3 MB
Shi, Jieke
Yang, Zhou
Xu, Bowen
Kang, Hong Jin
Lo, David
PROCEEDINGS OF THE 37TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE 2022, 2022,
[9] A Survey on Time-Series Pre-Trained Models
Ma, Qianli
Liu, Zhen
Zheng, Zhenjing
Huang, Ziyang
Zhu, Siying
Yu, Zhongzhong
Kwok, James T.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (12) : 7536 - 7555
[10] Universal embedding for pre-trained models and data bench
Cho, Namkyeong
Cho, Taewon
Shin, Jaesun
Jeon, Eunjoo
Lee, Taehee
NEUROCOMPUTING, 2025, 619

← 1 2 3 4 5 →