Lexicon-pointed hybrid N-gram Features Extraction Model (LeNFEM) for sentence level sentiment analysis

被引:10
作者
Mutinda, James [1 ]
Mwangi, Waweru [2 ]
Okeyo, George [3 ]
机构
[1] Kenya Sch Govt Embu Campus, Dept Informat & Commun Technol, POB 402, Embu 60100, Kenya
[2] Jomo Kenyatta Univ Agr & Technol, Sch Comp & Informat Technol, Nairobi, Kenya
[3] De Montfort Univ, Sch Comp Sci & Informat, Leicester, Leics, England
关键词
feature selection; lexicon; minimum redundancy maximum relevance; N-gram2vec model; sentence level SA; sentiment classification; TF-IDF; CLASSIFICATION;
D O I
10.1002/eng2.12374
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Sentiment analysis of social media textual posts can provide information and knowledge that is applicable in social settings, business intelligence, evaluation of citizens' opinions in governance, and in mood triggered devices in the Internet of Things. Feature extraction and selection is a key determinant of accuracy and computational cost of machine learning models for such analysis. Most feature extraction and selection techniques utilize bag of words, N-grams, and frequency-based algorithms especially Term Frequency-Inverse Document Frequency. However, these approaches do not consider relationships between words, they ignore words' characteristics and they suffer high feature dimensionality. In this paper we propose and evaluate a feature extraction and selection approach that utilizes a fixed hybrid N-gram window for feature extraction and minimum redundancy maximum relevance feature selection algorithm for sentence level sentiment analysis. The approach improves the existing features extraction techniques, specifically the N-gram by generating a hybrid vector from words, Part of Speech (POS) tags, and word semantic orientation. The vector is extracted by using a static trigram window identified by a lexicon where a sentiment word appears in a sentence. A blend of the words, POS tags, and the sentiment orientations of the static trigram are used to build the feature vector. The optimal features from the vector are then selected using minimum redundancy maximum relevance (MRMR) algorithm. Experiments were carried out using the public Yelp dataset to compare the performance of the proposed model and existing feature extraction models (BOW, normal N-grams and lexicon-based bag of words semantic orientations). Using supervised machine learning classifiers the experimental results showed that the proposed model had the highest F-measure (88.64%) compared to the highest (83.55%) from baseline approaches. Wilcoxon test carried out ascertained that the proposed approach performed significantly better than the baseline approaches. Comparative performance analysis with other datasets further affirmed that the proposed approach is generalizable.
引用
收藏
页数:17
相关论文
共 35 条
[1]  
Ahuja Ravinder, 2019, Procedia Computer Science, V152, P341, DOI 10.1016/j.procs.2019.05.008
[2]  
Aisopos F., 2016, IEEE 2 INT C BIG DAT
[3]  
Ankit, 2018, Procedia Computer Science, V132, P937, DOI 10.1016/j.procs.2018.05.109
[4]  
[Anonymous], 2015, P 21 ACM SIGKDD INT
[5]  
Bansal Barkha, 2018, Procedia Computer Science, V132, P1147, DOI 10.1016/j.procs.2018.05.029
[6]   Sentiment analysis: Measuring opinions [J].
Bhadane, Chetashri ;
Dalal, Hardi ;
Doshi, Heenal .
INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING TECHNOLOGIES AND APPLICATIONS (ICACTA), 2015, 45 :808-814
[7]  
Brindha S, 2016, INT CONF ADVAN COMPU
[8]  
Coban O., 2018, P INT C UT EXH GREEN, P1, DOI [10.23919/ICUEGESD.2018.8635669, DOI 10.23919/ICUEGESD.2018.8635669]
[9]  
Egejuru NC., 2017, COMPUT INFORM SYST D, V8, P47
[10]   Text representation and classification based on bi-gram alphabet [J].
Elghannam, Fatma .
JOURNAL OF KING SAUD UNIVERSITY COMPUTER AND INFORMATION SCIENCES, 2021, 33 (02) :235-242