Sentiment analysis based on improved pre-trained word embeddings

被引:198
作者
Rezaeinia, Seyed Mahdi [1 ]
Rahmani, Rouhollah [1 ]
Ghodsi, Ali [2 ]
Veisi, Hadi [1 ]
机构
[1] Univ Tehran, Network Sci & Technol Dept, Tehran, Iran
[2] Univ Waterloo, Dept Stat & Actuarial Sci, Waterloo, ON, Canada
关键词
Sentiment analysis; Deep learning; Word embeddings; Word2Vec; GloVe; Natural language processing; NEURAL-NETWORKS; MACHINE;
D O I
10.1016/j.eswa.2018.08.044
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sentiment analysis is a fast growing area of research in natural language processing (NLP) and text classifications. This technique has become an essential part of a wide range of applications including politics, business, advertising and marketing. There are various techniques for sentiment analysis, but recently word embeddings methods have been widely used in sentiment classification tasks. Word2Vec and GloVe are currently among the most accurate and usable word embedding methods which can convert words into meaningful vectors. However, these methods ignore sentiment information of texts and need a large corpus of texts for training and generating exact vectors. As a result, because of the small size of some corpora, researcher often have to use pre-trained word embeddings which were trained on other large text corpora such as Google News with about 100 billion words. The increasing accuracy of pre-trained word embeddings has a great impact on sentiment analysis research. In this paper, we propose a novel method, Improved Word Vectors (IWV), which increases the accuracy of pre-trained word embeddings in sentiment analysis. Our method is based on Part-of-Speech (POS) tagging techniques, lexicon-based approaches, word position algorithm and Word2Vec/GloVe methods. We tested the accuracy of our method via different deep learning models and benchmark sentiment datasets. Our experiment results show that Improved Word Vectors (IWV) are very effective for sentiment analysis. (C) 2018 Published by Elsevier Ltd.
引用
收藏
页码:139 / 147
页数:9
相关论文
共 39 条
[1]  
[Anonymous], 2013, PROC 1 INT C LEARN R
[2]  
[Anonymous], 2011, HP LAB TECH REP
[3]  
[Anonymous], ARXIV170503122V2
[4]   Enhancing deep learning sentiment analysis with ensemble techniques in social applications [J].
Araque, Oscar ;
Corcuera-Platas, Ignacio ;
Sanchez-Rada, J. Fernando ;
Iglesias, Carlos A. .
EXPERT SYSTEMS WITH APPLICATIONS, 2017, 77 :236-246
[5]   Opinion Mining of Movie Review using Hybrid Method of Support Vector Machine and Particle Swarm Optimization [J].
Basari, Abd Samad Hasan ;
Hussin, Burairah ;
Ananta, I. Gede Pramudya ;
Zeniarja, Junta .
MALAYSIAN TECHNICAL UNIVERSITIES CONFERENCE ON ENGINEERING & TECHNOLOGY 2012 (MUCET 2012), 2013, 53 :453-462
[6]   Semantics derived automatically from language corpora contain human-like biases [J].
Caliskan, Aylin ;
Bryson, Joanna J. ;
Narayanan, Arvind .
SCIENCE, 2017, 356 (6334) :183-186
[7]   NASARI: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities [J].
Camacho-Collados, Jose ;
Pilehvar, Mohammad Taher ;
Navigli, Roberto .
ARTIFICIAL INTELLIGENCE, 2016, 240 :36-64
[8]   On the effects of using word2vec representations in neural networks for dialogue act recognition [J].
Cerisara, Christophe ;
Kral, Pavel ;
Lenc, Ladislav .
COMPUTER SPEECH AND LANGUAGE, 2018, 47 :175-193
[9]   Leveraging Large Amounts of Weakly Supervised Data for Multi-Language Sentiment Classification [J].
Deriu, Jan ;
Lucchi, Aurelien ;
De Luca, Valeria ;
Severyn, Aliaksei ;
Muller, Simon ;
Cieliebak, Mark ;
Hofmann, Thomas ;
Jaggi, Martin .
PROCEEDINGS OF THE 26TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'17), 2017, :1045-1052
[10]  
Ding X., 2008, P 2008 INT C WEB SEA, P231, DOI [10.1145/1341531.1341561, DOI 10.1145/1341531.1341561]