BERT-Based Sentiment Analysis for Low-Resourced Languages: A Case Study of Urdu Language

被引：8

作者：

Ashraf, Muhammad Rehan ^{[1
,2
]}

Jana, Yasmeen ^{[2
]}

Umer, Qasim ^{[2
,3
]}

Jaffar, M. Arfan ^{[1
]}

Chung, Sungwook ^{[4
]}

Ramay, Waheed Yousuf ^{[5
]}

机构：

[1] Super Univ, Dept Comp Sci, Lahore 54000, Pakistan

[2] COMSATS Univ Islamabad, Dept Comp Sci, Vehari 61000, Pakistan

[3] Hanyang Univ, Dept Comp Sci, Seoul 04763, South Korea

[4] Changwon Natl Univ, Dept Comp Engn, Chang Won 51140, South Korea

[5] Air Univ, Dept Comp Sci, Multan 60000, Pakistan

来源：

IEEE ACCESS | 2023年 / 11卷

关键词：

Sentiment analysis; Support vector machines; Social networking (online); Sports; Blogs; Encoding; Natural language processing; Linguistics; Urdu; BERT; classification; sentiment analysis; ROMAN URDU; CLASSIFICATION; MACHINE;

D O I：

10.1109/ACCESS.2023.3322101

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Sentiment analysis holds significant importance in research projects by providing valuable insights into public opinions. However, the majority of sentiment analysis studies focus on the English language, leaving a gap in research for other low-resourced languages or regional languages, e.g., Persian, Pashto, and Urdu. Moreover, computational linguists face the challenge of developing lexical resources for these languages. In light of this, this paper presents a deep learning-based approach for Urdu Text Sentiment Analysis (USA-BERT), leveraging Bidirectional Encoder Representations from Transformers and introduces an Urdu Dataset for Sentiment Analysis-23 (UDSA-23). USA-BERT first preprocesses the Urdu reviews by exploiting BERT-Tokenizer. Second, it creates BERT embeddings for each Urdu review. Third, given the BERT embeddings, it fine-tunes a deep learning classifier (BERT). Finally, it employs the Pareto principle on two datasets (the state-of-the-art (UCSA-21) and UDSA-23) to assess USA-BERT. The assessment results demonstrate that USA-BERT significantly surpasses the existing methods by improving the accuracy and f-measure up to 26.09% and 25.87%, respectively.

引用

页码：110245 / 110259

页数：15

共 58 条

[1]

Ahmad W., 2022, arXiv

[2]

Ahmed N., 2023, Urdu sentiment analysis using deep attention-based technique

[3] Deep Learning Based Cross Domain Sentiment Classification for Urdu Language [J].

Altaf, Amna ;

Anwar, Muhammad Waqas ;

Jamal, Muhammad Hasan ;

Hassan, Sana ;

Bajwa, Usama Ijaz ;

Choi, Gyu Sang ;

Ashraf, Imran .

IEEE ACCESS, 2022, 10 :102135-102147

[4]

Arif H., 2016, ICICC, V8, P48

[5]

Batra Rakhi, 2020, Mendeley Data, V1, DOI 10.17632/RZ3XG97RM5.1

[6]

Bengio Y, 2001, ADV NEUR IN, V13, P932

[7] A comprehensive survey on sentiment analysis: Approaches, challenges and trends [J].

Birjali, Marouane ;

Kasri, Mohammed ;

Beni-Hssane, Abderrahim .

KNOWLEDGE-BASED SYSTEMS, 2021, 226

[8]

Bojanowski P., 2017, Trans ACL, V5, P135, DOI [10.1162/tacla00051, DOI 10.1162/TACLA00051, DOI 10.1162/TACL_A_00051]

[9] Sentiment Analysis Is a Big Suitcase [J].

Cambria, Erik ;

Poria, Soujanya ;

Gelbukh, Alexander ;

Thelwall, Mike .

IEEE INTELLIGENT SYSTEMS, 2017, 32 (06) :74-80

[10] Sentiment Analysis of Roman Urdu on E-Commerce Reviews Using Machine Learning [J].

Chandio, Bilal ;

Shaikh, Asadullah ;

Bakhtyar, Maheen ;

Alrizq, Mesfer ;

Baber, Junaid ;

Sulaiman, Adel ;

Rajab, Adel ;

Noor, Waheed .

CMES-COMPUTER MODELING IN ENGINEERING & SCIENCES, 2022, 131 (03) :1263-1287

← 1 2 3 4 5 6 →