HSpam14: A Collection of 14 Million Tweets for Hashtag-Oriented Spam Research

被引:58
作者
Sedhai, Surendra [1 ]
Sun, Aixin [1 ]
机构
[1] Nanyang Technol Univ, Sch Comp Engn, Singapore, Singapore
来源
SIGIR 2015: PROCEEDINGS OF THE 38TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL | 2015年
关键词
Twitter; tweets; hashtag; spam;
D O I
10.1145/2766462.2767701
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Hashtag facilitates information diffusion in Twitter by creating dynamic and virtual communities for information aggregation from all Twitter users. Because hashtags serve as additional channels for one's tweets to be potentially accessed by other users than her own followers, hashtags are targeted for spamming purposes (e.g., hashtag hijacking), particularly the popular and trending hashtags. Although much effort has been devoted to fighting against email/web spam, limited studies are on hashtag-oriented spam in tweets. In this paper, we collected 14 million tweets that matched some trending hashtags in two months' time and then conducted systematic annotation of the tweets being spam and ham (i.e., non-spam). We name the annotated dataset HSpam14. Our annotation process includes four major steps: (i) heuristic-based selection to search for tweets that are more likely to be spam, (ii) near-duplicate cluster based annotation to firstly group similar tweets into clusters and then label the clusters, (iii) reliable ham tweets detection to label tweets that are non-spam, and (iv) Expectation-Maximization (EM)-based label prediction to predict the labels of remaining unlabeled tweets. One major contribution of this work is the creation of HSpam14 dataset, which can be used for hashtag-oriented spam research in tweets. Another contribution is the observations made from the preliminary analysis of the HSpam14 dataset.
引用
收藏
页码:223 / 232
页数:10
相关论文
共 40 条
  • [21] Jindal N., 2007, P 16 INT C WORLD WID, P1189
  • [22] Lee K., 2014, ICWSM
  • [23] Lee K., 2011, ICWSM
  • [24] Lee K, 2010, P 19 INT C WORLD WID, P1139, DOI DOI 10.1145/1772690.1772843
  • [25] Li Fangtao., 2011, P 22 INT JOINT C ART
  • [26] Li J., 2013, Short Papers, V2, P217
  • [27] Lim E. P., 2010, P 19 ACM INT C INF K, P939, DOI DOI 10.1145/1871437.1871557
  • [28] Messias J., 2013, 1 MONDAY, V18
  • [29] Mishne G., 2005, AIRWEB
  • [30] Mukherjee Arjun., 2011, P 20 INT C COMPANION, P93