Classification of Shopify App User Reviews Using Novel Multi Text Features

被引:46
作者
Rustam, Furqan [1 ]
Mehmood, Arif [4 ]
Ahmad, Muhammad [2 ,3 ]
Ullah, Saleem [1 ]
Khan, Dost Muhammad [4 ]
Choi, Gyu Sang [5 ]
机构
[1] Khwaja Fareed Univ Engn & Informat Technol, Dept Comp Sci, Rahim Yar Khan 64200, Pakistan
[2] Khwaja Fareed Univ Engn & Informat Technol, Dept Comp Engn, Rahim Yar Khan 64200, Pakistan
[3] Univ Messina, Dipartimento Matemat & Informat MIFT, I-98121 Messina, Italy
[4] Islamia Univ Bahawalpur, Dept Comp Sci & IT, Bahawalpur 63100, Pakistan
[5] Yeungnam Univ, Dept Informat & Commun Engn, Gyongsan 38541, South Korea
基金
新加坡国家研究基金会;
关键词
Feature engineering; feature extraction; feature selection; machine learning; review classification; text mining; RANDOM FOREST; CLASSIFIERS; REGRESSION; MODELS;
D O I
10.1109/ACCESS.2020.2972632
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
App stores usually allow users to give reviews and ratings that are used by developers to resolve issues and make plans for their apps. In this way, these app stores collect large amounts of data for analysis. However, there are several challenges that must first be addressed, related to redundancy and the volume of data, by using machine learning. This study performs experiments on a dataset that contains reviews for Shopify apps. To overcome the aforementioned limitations, we first categorize user reviews into two groups, i.e., happy and unhappy, and then perform preprocessing on the reviews to clean the data. At a later stage, several feature engineering techniques, such as bag-of-words, term frequency-inverse document frequency (TF-IDF), and chi-square (Chi2), are used singly and in combination to preserve meaningful information. Finally, the random forest, AdaBoost classifier, and logistic regression models are used to classify the reviews as happy or unhappy. The performance of our proposed pipeline was evaluated using average accuracy, precision, recall, and f(1) score. The experiments reveal that a combination of features can improve machine learning models performance and in this study, logistic regression outperforms the others and achieves an 83% true acceptance rate when combined with TF-IDF and Chi2.
引用
收藏
页码:30234 / 30244
页数:11
相关论文
共 45 条
[1]  
[Anonymous], ARXIV12121108
[2]  
[Anonymous], 2002, ASS COMPUTATIONAL LI
[3]  
[Anonymous], 2003, W10148 NAT BUR EC RE
[4]  
[Anonymous], 2009, PHANEROZOIC EGYPT GE
[5]  
[Anonymous], PROCEEDINGS
[6]  
[Anonymous], OPTIMIZING HYPERPARA
[7]   CONSENSUS THEORETIC CLASSIFICATION METHODS [J].
BENEDIKTSSON, JA ;
SWAIN, PH .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1992, 22 (04) :688-704
[8]   A random forest guided tour [J].
Biau, Gerard ;
Scornet, Erwan .
TEST, 2016, 25 (02) :197-227
[9]  
Bird J, 2005, CHEM IND-LONDON, P11
[10]   The effect of tuning, feature engineering, and feature selection in data mining applied to rainfed sugarcane yield modelling [J].
Bocca, Felipe F. ;
Antunes Rodrigues, Luiz Henrique .
COMPUTERS AND ELECTRONICS IN AGRICULTURE, 2016, 128 :67-76