A novel, gradient boosting framework for sentiment analysis in languages where NLP resources are not plentiful: A case study for modern Greek

被引:29
作者
Athanasiou V. [1 ]
Maragoudakis M. [1 ]
机构
[1] Artificial Intelligence Laboratory, University of the Aegean, 2 Palama Street, Samos
关键词
Gradient boosting machines; High-dimensional data; Modern Greek; Sentiment analysis;
D O I
10.3390/a10010034
中图分类号
学科分类号
摘要
Sentiment analysis has played a primary role in text classification. It is an undoubted fact that some years ago, textual information was spreading in manageable rates; however, nowadays, such information has overcome even the most ambiguous expectations and constantly grows within seconds. It is therefore quite complex to cope with the vast amount of textual data particularly if we also take the incremental production speed into account. Social media, e-commerce, news articles, comments and opinions are broadcasted on a daily basis. A rational solution, in order to handle the abundance of data, would be to build automated information processing systems, for analyzing and extracting meaningful patterns from text. The present paper focuses on sentiment analysis applied in Greek texts. Thus far, there is no wide availability of natural language processing tools for Modern Greek. Hence, a thorough analysis of Greek, from the lexical to the syntactical level, is difficult to perform. This paper attempts a different approach, based on the proven capabilities of gradient boosting, a well-known technique for dealing with high-dimensional data. The main rationale is that since English has dominated the area of preprocessing tools and there are also quite reliable translation services, we could exploit them to transform Greek tokens into English, thus assuring the precision of the translation, since the translation of large texts is not always reliable and meaningful. The new feature set of English tokens is augmented with the original set of Greek, consequently producing a high dimensional dataset that poses certain difficulties for any traditional classifier. Accordingly, we apply gradient boosting machines, an ensemble algorithm that can learn with different loss functions providing the ability to work efficiently with high dimensional data. Moreover, for the task at hand, we deal with a class imbalance issues since the distribution of sentiments in real-world applications often displays issues of inequality. For example, in political forums or electronic discussions about immigration or religion, negative comments overwhelm the positive ones. The class imbalance problem was confronted using a hybrid technique that performs a variation of under-sampling the majority class and over-sampling the minority class, respectively. Experimental results, considering different settings, such as translation of tokens against translation of sentences, consideration of limited Greek text preprocessing and omission of the translation phase, demonstrated that the proposed gradient boosting framework can effectively cope with both high-dimensional and imbalanced datasets and performs significantly better than a plethora of traditional machine learning classification approaches in terms of precision and recall measures. © 2016 by the authors.
引用
收藏
相关论文
共 54 条
[1]  
Taylor E.M., Rodriguez C., Velasquez J.D., Ghosh G., Banerjee S., Web Opinion Mining and Sentimental Analysis, In Techniques in Web Intelligence-2, SCI 452, pp. 105-126, (2012)
[2]  
Maynard D., Bontcheva K., Rout D., Challenges in developing opinion mining tools for social media, Proceedings of the Workshop at LREC 2012, pp. 15-22, (2012)
[3]  
Ravikant N., Rifkin A., Why Twitter is Massively Undervalued Compared to Facebook
[4]  
Skeels M.M., Grudin J., When social net-works cross boundaries: A case study of workplace use of Facebook and LinkedIn, Proceedings of the ACM 2009 International Conference on Supporting Group Work, pp. 95-104, (2009)
[5]  
Abel F., Gao Q., Houben G.J., Tao K., Semantic Enrichment of Twitter Posts for User Profile Construction on the Social Web, ESWC, 6644, pp. 375-389, (2011)
[6]  
Mendes P.N., Passant A., Kapanipathi P., Sheth A.P., Linked open social signals, Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pp. 224-231, (2010)
[7]  
Han B., Baldwin T., Lexical normalisation of short text messages: Makn sens a #twitter, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 368-378, (2011)
[8]  
Gouws S., Metzler D., Cai C., Hovy E., Contextual bearing on linguistic variation in social media, Proceedings of the Workshop on Languages in Social Media, pp. 20-29, (2011)
[9]  
Natekin A., Knoll A., Gradient boosting machines, a tutorial, Front. Neurorobot, 7, (2011)
[10]  
Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P., SMOTE: Synthetic Minority Over-Sampling Technique, J. Artif. Intell. Res, 16, pp. 321-357, (2002)