Toward Machine Learning Based Binary Sentiment Classification of Movie Reviews for Resource Restraint Language (RRL)-Hindi

被引:2
作者
Sharma, Ankita [1 ]
Ghose, Udayan [1 ]
机构
[1] Univ Sch Informat Commun & Technol, Guru Gobind Singh Indraprastha Univ, New Delhi 110078, India
关键词
Hindi; machine learning; movie reviews; NLP; opinion mining; sentiment analysis; stacking ensemble; TF-ISF; SVM;
D O I
10.1109/ACCESS.2023.3283461
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Sentiment analysis has significantly progressed in English, whereas Hindi research is still nascent. Despite being the third most spoken language worldwide, Hindi remains an RRL. Movie reviews are a treasure trove of opinionated content fueled by people's passionate engagement with film industry. The proliferation of great use of Hindi in writing reviews has catalyzed our endeavor to devise an approach for bipolar sentiment classification of movie reviews. We compiled and manually annotated a Hindi Language Movie Review (HLMR) dataset comprising 10K reviews for experiments, and challenges associated with Hindi have also been identified. In addition to HLMR, two publicly available IIT-P movie and product review datasets are used. Following dataset preprocessing, we explored TF-ISF with word-level N-gram features for text representation. Studies suggest that performance of machine learning approaches can be enhanced by hyperparameter tuning and ensemble learning. Several baseline classifiers were initially applied, and their parameters were hyper-tuned using Grid search. Subsequently, ensemble-based classifiers were applied individually. Lastly, we propose a simplistic yet powerful stacked ensemble-based architecture (SEBA), which effectively classifies Hindi reviews by leveraging the strengths of both approaches. Comprehensive experiments were conducted on all deployed datasets. Empirical results demonstrate that SEBA outperformed individual baselines and exhibited superior performance with unigrams and TF-ISF as features across deployed datasets. SEBA achieved an accuracy, precision, and recall of 0.808% and an F1-score of 0.807% on the HLMR dataset. These findings strongly advocate for the effectiveness of proposed solution and indicate its suitability for online deployment in binary review classification tasks.
引用
收藏
页码:58546 / 58564
页数:19
相关论文
共 57 条
[1]  
Akhtar MS, 2016, LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P2703
[2]  
Akhtar MS., 2016, P COLING 2016 26 INT, P482
[3]  
Alotaibi SS, 2016, International Journal on Natural Language Computing, V5, P1, DOI [10.5121/ijnlc.2016.5301, 10.5121/ijnlc.2016.5301, DOI 10.5121/IJNLC.2016.5301]
[4]   Short text classification for Arabic social media tweets [J].
Alzanin, Samah M. ;
Azmi, Aqil M. ;
Aboalsamh, Hatim A. .
JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2022, 34 (09) :6595-6604
[5]   Sentence Classification Using N-Grams in Urdu Language Text [J].
Awan, Malik Daler Ali ;
Ali, Sikandar ;
Samad, Ali ;
Iqbal, Nadeem ;
Missen, Malik Muhammad Saad ;
Ullah, Niamat .
SCIENTIFIC PROGRAMMING, 2021, 2021
[6]  
Balaji P, 2023, INT J ADV COMPUT SC, V14, P185
[7]   Sentiment classification of Roman-Urdu opinions using Naive Bayesian, Decision Tree and KNN classification techniques [J].
Bilal, Muhammad ;
Israr, Huma ;
Shahid, Muhammad ;
Khan, Amin .
JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2016, 28 (03) :330-344
[8]   Sentiment analysis for customer relationship management: an incremental learning approach [J].
Capuano, Nicola ;
Greco, Luca ;
Ritrovato, Pierluigi ;
Vento, Mario .
APPLIED INTELLIGENCE, 2021, 51 (06) :3339-3352
[9]  
Ceyhan M, 2021, European Journal of Formal Sciences and Engineering, V4, P57, DOI [10.26417/328uno67t, 10.26417/328uno67t, DOI 10.26417/328UNO67T]
[10]   Development of Sindhi text corpus [J].
Dootio, Mazhar Ali ;
Wagan, Asim Imdad .
JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2021, 33 (04) :468-475