A CASE STUDY IN TEXT MINING OF DISCUSSION FORUM POSTS: CLASSIFICATION WITH BAG OF WORDS AND GLOBAL VECTORS

被引:14
作者
Cichosz, Pawel [1 ]
机构
[1] Warsaw Univ Technol, Inst Comp Sci, Nowowiejska 15-19, PL-00665 Warsaw, Poland
关键词
text mining; discussion forums; text representation; document classification; word embedding; ONLINE;
D O I
10.2478/amcs-2018-0060
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Despite the rapid growth of other types of social media, Internet discussion forums remain a highly popular communication channel and a useful source of text data for analyzing user interests and sentiments. Being suited to richer, deeper, and longer discussions than microblogging services, they particularly well reflect topics of long-term, persisting involvement and areas of specialized knowledge or experience. Discovering and characterizing such topics and areas by text mining algorithms is therefore an interesting and useful research direction. This work presents a case study in which selected classification algorithms are applied to posts from a Polish discussion forum devoted to psychoactive substances received from home-grown plants, such as hashish or marijuana. The utility of two different vector text representations is examined: the simple bag of words representation and the more refined embedded global vectors one. While the former is found to work well for the multinomial naive Bayes algorithm, the latter turns out more useful for other classification algorithms: logistic regression, SVMs, and random forests. The obtained results suggest that post-classification can be applied for measuring publication intensity of particular topics and, in the case of forums related to psychoactive substances, for monitoring the risk of drug-related crime.
引用
收藏
页码:787 / 801
页数:15
相关论文
共 61 条
[1]  
Aggarwal Charu C, 2012, Mining text data, P163, DOI [DOI 10.1007/978-1-4614-3223-46, DOI 10.1007/978-1-4614-3223-4, 10.1007/978-1-4614-3223-4]
[2]  
[Anonymous], 2001, Learning with Kernels |
[3]  
[Anonymous], 1998, P AAAI 98 WORKSH LEA, DOI DOI 10.1109/TSMC.1985.6313426
[4]  
[Anonymous], Philosophical Transactions of the Royal Society of London for, DOI DOI 10.1098/RSTL.1763.0053
[5]  
[Anonymous], 2002, Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
[6]  
Aswani Kumar C., 2006, International Journal of Applied Mathematics and Computer Science, P551
[7]   AUTOMATIC PARAMETRIC FAULT DETECTION IN COMPLEX ANALOG SYSTEMS BASED ON A METHOD OF MINIMUM NODE SELECTION [J].
Bilski, Adrian ;
Wojciechowski, Jacek .
INTERNATIONAL JOURNAL OF APPLIED MATHEMATICS AND COMPUTER SCIENCE, 2016, 26 (03) :655-668
[8]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[9]   SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].
Blewitt, Marnie E. ;
Gendrel, Anne-Valerie ;
Pang, Zhenyi ;
Sparrow, Duncan B. ;
Whitelaw, Nadia ;
Craig, Jeffrey M. ;
Apedaile, Anwyn ;
Hilton, Douglas J. ;
Dunwoodie, Sally L. ;
Brockdorff, Neil ;
Kay, Graham F. ;
Whitelaw, Emma .
NATURE GENETICS, 2008, 40 (05) :663-669
[10]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32