Toward Machine Learning Based Binary Sentiment Classification of Movie Reviews for Resource Restraint Language (RRL)-Hindi

被引：2

作者：

Sharma, Ankita ^{[1
]}

Ghose, Udayan ^{[1
]}

机构：

[1] Univ Sch Informat Commun & Technol, Guru Gobind Singh Indraprastha Univ, New Delhi 110078, India

来源：

IEEE ACCESS | 2023年 / 11卷

关键词：

Hindi; machine learning; movie reviews; NLP; opinion mining; sentiment analysis; stacking ensemble; TF-ISF; SVM;

D O I：

10.1109/ACCESS.2023.3283461

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Sentiment analysis has significantly progressed in English, whereas Hindi research is still nascent. Despite being the third most spoken language worldwide, Hindi remains an RRL. Movie reviews are a treasure trove of opinionated content fueled by people's passionate engagement with film industry. The proliferation of great use of Hindi in writing reviews has catalyzed our endeavor to devise an approach for bipolar sentiment classification of movie reviews. We compiled and manually annotated a Hindi Language Movie Review (HLMR) dataset comprising 10K reviews for experiments, and challenges associated with Hindi have also been identified. In addition to HLMR, two publicly available IIT-P movie and product review datasets are used. Following dataset preprocessing, we explored TF-ISF with word-level N-gram features for text representation. Studies suggest that performance of machine learning approaches can be enhanced by hyperparameter tuning and ensemble learning. Several baseline classifiers were initially applied, and their parameters were hyper-tuned using Grid search. Subsequently, ensemble-based classifiers were applied individually. Lastly, we propose a simplistic yet powerful stacked ensemble-based architecture (SEBA), which effectively classifies Hindi reviews by leveraging the strengths of both approaches. Comprehensive experiments were conducted on all deployed datasets. Empirical results demonstrate that SEBA outperformed individual baselines and exhibited superior performance with unigrams and TF-ISF as features across deployed datasets. SEBA achieved an accuracy, precision, and recall of 0.808% and an F1-score of 0.807% on the HLMR dataset. These findings strongly advocate for the effectiveness of proposed solution and indicate its suitability for online deployment in binary review classification tasks.

引用

页码：58546 / 58564

页数：19

共 57 条

[1]

Akhtar MS, 2016, LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P2703

[2]

Akhtar MS., 2016, P COLING 2016 26 INT, P482

[3]

Alotaibi SS, 2016, International Journal on Natural Language Computing, V5, P1, DOI [10.5121/ijnlc.2016.5301, 10.5121/ijnlc.2016.5301, DOI 10.5121/IJNLC.2016.5301]

[4] Short text classification for Arabic social media tweets [J].