HSpam14: A Collection of 14 Million Tweets for Hashtag-Oriented Spam Research

被引:58
作者
Sedhai, Surendra [1 ]
Sun, Aixin [1 ]
机构
[1] Nanyang Technol Univ, Sch Comp Engn, Singapore, Singapore
来源
SIGIR 2015: PROCEEDINGS OF THE 38TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL | 2015年
关键词
Twitter; tweets; hashtag; spam;
D O I
10.1145/2766462.2767701
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Hashtag facilitates information diffusion in Twitter by creating dynamic and virtual communities for information aggregation from all Twitter users. Because hashtags serve as additional channels for one's tweets to be potentially accessed by other users than her own followers, hashtags are targeted for spamming purposes (e.g., hashtag hijacking), particularly the popular and trending hashtags. Although much effort has been devoted to fighting against email/web spam, limited studies are on hashtag-oriented spam in tweets. In this paper, we collected 14 million tweets that matched some trending hashtags in two months' time and then conducted systematic annotation of the tweets being spam and ham (i.e., non-spam). We name the annotated dataset HSpam14. Our annotation process includes four major steps: (i) heuristic-based selection to search for tweets that are more likely to be spam, (ii) near-duplicate cluster based annotation to firstly group similar tweets into clusters and then label the clusters, (iii) reliable ham tweets detection to label tweets that are non-spam, and (iv) Expectation-Maximization (EM)-based label prediction to predict the labels of remaining unlabeled tweets. One major contribution of this work is the creation of HSpam14 dataset, which can be used for hashtag-oriented spam research in tweets. Another contribution is the observations made from the preliminary analysis of the HSpam14 dataset.
引用
收藏
页码:223 / 232
页数:10
相关论文
共 40 条
  • [1] [Anonymous], CIVR
  • [2] [Anonymous], 2010, Proceedings of the third ACM International Conference on Web Search and Data Mining, DOI DOI 10.1145/1718487.1718520
  • [3] [Anonymous], 2012, SIGKDD Explor. Newsl., DOI [DOI 10.1145/2207243.2207252, 10.1145/2207243.2207252]
  • [4] [Anonymous], CEAS
  • [5] [Anonymous], 2015, Retriev Technologies
  • [6] Detecting Spammers and Content Promoters in Online Video Social Networks
    Benevenuto, Fabricio
    Rodrigues, Tiago
    Almeida, Virgilio
    Almeida, Jussara
    Goncalves, Marcos
    [J]. PROCEEDINGS 32ND ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2009, : 620 - 627
  • [7] Building text classifiers using positive and unlabeled examples
    Bing, L
    Yang, D
    Li, XL
    Lee, WS
    Yu, PS
    [J]. THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2003, : 179 - 186
  • [8] On the resemblance and containment of documents
    Broder, AZ
    [J]. COMPRESSION AND COMPLEXITY OF SEQUENCES 1997 - PROCEEDINGS, 1998, : 21 - 29
  • [9] Castillo C., 2011, P 20 INT C WORLD WID, P675, DOI 10.1145/1963405.1963500
  • [10] Chinavle Deepak., 2009, Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM '09, P2015