A tale of two epidemics: Contextual Word2Vec for classifying twitter streams during outbreaks

被引:120
作者
Khatua, Aparup [1 ,2 ]
Khatua, Apalak [3 ]
Cambria, Erik [2 ]
机构
[1] Univ Calcutta, Dept Comp Sci & Engn, Kolkata, W Bengal, India
[2] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore
[3] XLRI Xavier Sch Management, Jamshedpur, Bihar, India
关键词
Epidemics; Ebola; Zika; PubMed; Twitter; Text classification; Word Vectors; SOCIAL MEDIA; MISINFORMATION;
D O I
10.1016/j.ipm.2018.10.010
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Unstructured tweet feeds are becoming the source of real-time information for various events. However, extracting actionable information in real-time from this unstructured text data is a challenging task. Hence, researchers are employing word embedding approach to classify unstructured text data. We set our study in the contexts of the 2014 Ebola and 2016 Zika outbreaks and probed the accuracy of domain-specific word vectors for identifying crisis-related actionable tweets. Our findings suggest that relatively smaller domain-specific input corpora from the Twitter corpus are better in extracting meaningful semantic relationship than generic pre-trained Word2Vec (contrived from Google News) or GloVe (of Stanford NLP group). However, domain specific quality tweet corpora during the early stages of outbreaks are normally scant, and identifying actionable tweets during early stages is crucial to stemming the proliferation of an outbreak. To overcome this challenge, we consider scholarly abstracts, related to Ebola and Zika virus, from PubMed and probe the efficiency of cross-domain resource utilization for word vector generation. Our findings demonstrate that the relevance of PubMed abstracts for the training purpose when Twitter data (as input corpus) would be scant during the early stages of the outbreak. Thus, this approach can be implemented to handle future outbreaks in real time. We also explore the accuracy of our word vectors for various model architectures and hyper-parameter settings. We observe that Skip-gram accuracies are better than CBOW, and higher dimensions yield better accuracy.
引用
收藏
页码:247 / 257
页数:11
相关论文
共 45 条
[1]  
[Anonymous], P ACM KDD WORKSH CON
[2]  
[Anonymous], 2014, P 2014 C EMP METH NA
[3]  
[Anonymous], 2009, NATURE, DOI DOI 10.1038/nature07634
[4]  
[Anonymous], USING WORD EMBEDDING
[5]  
[Anonymous], 2013, P INT C COMP LEARN R
[6]  
Ashktorab Z., 2014, ISCRAM, P269, DOI DOI 10.1145/1835449.1835643
[7]   Sentiment Analysis Is a Big Suitcase [J].
Cambria, Erik ;
Poria, Soujanya ;
Gelbukh, Alexander ;
Thelwall, Mike .
IEEE INTELLIGENT SYSTEMS, 2017, 32 (06) :74-80
[8]  
Cambria E, 2011, LECT NOTES COMPUT SC, V6677, P601, DOI 10.1007/978-3-642-21111-9_68
[9]  
Cambria E, 2010, INT CONF SIGN PROCES, P1279, DOI 10.1109/ICOSP.2010.5657072
[10]  
Cambria E, 2010, LECT NOTES ARTIF INT, V6279, P385, DOI 10.1007/978-3-642-15384-6_41