Improving Classification of Tweets Using Linguistic Information from a Large External Corpus

被引:0
作者
Hammer, Hugo Lewi [1 ]
Yazidi, Anis [1 ]
Bai, Aleksander [1 ]
Engelstad, Paal [1 ]
机构
[1] Oslo & Akershus Univ, Coll Appl Sci, Dept Comp Sci, Oslo, Norway
来源
INDUSTRIAL NETWORKS AND INTELLIGENT SYSTEMS, INISCOM 2016 | 2017年 / 188卷
关键词
Classification; Co-occurrence information; Text mining; Tweets;
D O I
10.1007/978-3-319-52569-3_11
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The bag of words representation of documents is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. Improvements might be achieved by expanding the vocabulary with other relevant word, like synonyms. In this paper we use word-word co-occurence information from a large corpus to expand the vocabulary of another corpus consisting of tweets. Several different methods on how to include the co-occurence information are constructed and tested out on the classification of real twitter data. Our results show that we are able to reduce the number of erroneous classifications by 14% using co-occurence information.
引用
收藏
页码:122 / 134
页数:13
相关论文
共 22 条
[1]  
Alahmadi A., 2013, 2013 7 IEEE GCC C EX
[2]  
[Anonymous], P 31 ANN ACM S APPL
[3]  
[Anonymous], 2012, Proceedings of Human Language Technologies: The 2012 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL HLT '12
[4]  
[Anonymous], 2003, P 12 INT C WORLD WID, DOI DOI 10.1145/775152.775226
[5]   Probabilistic Topic Models [J].
Blei, David M. .
COMMUNICATIONS OF THE ACM, 2012, 55 (04) :77-84
[6]  
Cai Li, 2011, P 20 ACM INT C INF K, P1321, DOI DOI 10.1145/2063576.2063768
[7]  
Chen Z, 2011, J COMPUT INF SYS, V7, P17
[8]   Regularization Paths for Generalized Linear Models via Coordinate Descent [J].
Friedman, Jerome ;
Hastie, Trevor ;
Tibshirani, Rob .
JOURNAL OF STATISTICAL SOFTWARE, 2010, 33 (01) :1-22
[9]  
Gabrilovich E., 2006, AAAI, P1301
[10]   Large-scale Bayesian logistic regression for text categorization [J].
Genkin, Alexander ;
Lewis, David D. ;
Madigan, David .
TECHNOMETRICS, 2007, 49 (03) :291-304