Predicting Supervise Machine Learning Performances for Sentiment Analysis Using Contextual-Based Approaches

被引：30

作者：

Aziz, Azwa Abdul ^{[1
,2
]}

Starkey, Andrew ^{[1
]}

机构：

[1] Univ Aberdeen, Sch Engn, Aberdeen AB24 3FX, Scotland

[2] Univ Sultan Zainal Abidin UniSZA, Fac Informat & Comp, Tembila Campus, Kuala Terengganu 22200, Malaysia

来源：

IEEE ACCESS | 2020年 / 8卷

关键词：

Text analytics; sentiment analysis; contextual analysis; supervised machine learning; CLASSIFICATION;

D O I：

10.1109/ACCESS.2019.2958702

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Sentiment Analysis (SA) is focused on mining opinion (identification and classification) from unstructured text data such as product reviews or microblogs. It is widely used for brand reviews, political campaigns, marketing analysis or gaining feedback from customers. One of the prominent approaches for SA is using supervised machine learning (SML), an algorithm that uses datasets with defined class labels based on mathematical learning from a training dataset. While the results are promising especially with in-domain sentiments, there is no guarantee the model provides the same performance against real time data due to the diversity of new data. In addition, previous studies suggest the result of SML decrease when applied to cross-domain datasets because new features are appeared in different domains. So far, studies in SA emphasise the improvement of the sentiment result whereas there is little discussion focusing on how to detect the degradation of performance for the proposed model. Therefore, we provide a method known as Contextual Analysis (CA), a mechanism that constructs a relationship between words and sources that is constructed in a tree structure identified as Hierarchical Knowledge Tree (HKT). Then, Tree Similarity Index (TSI) and Tree Differences Index (TDI), a formula generate from tree structure are proposed to find similarity as well as changes between train and actual dataset. The regression analysis of datasets reveals that there is a highly significant positive relationship between TSI and SML accuracies. As a result, the prediction model created indicated estimation error within 2.75 to 3.94 and 2.30 for 3.51 for average absolute differences. Moreover, this method also can cluster sentiment words into positive and negative without having any linguistics resources used and at the same time capturing changes of sentiment words when a new dataset is applied.

引用

页码：17722 / 17733

页数：12

共 31 条

[1] A comprehensive survey of arabic sentiment analysis [J].