Weakly Supervised Extraction of Computer Security Events from Twitter

被引:78
作者
Ritter, Alan [1 ,4 ]
Wright, Evan [2 ]
Casey, William [2 ]
Mitchell, Tom [3 ]
机构
[1] Ohio State Univ, Comp Sci & Engn, Columbus, OH 43210 USA
[2] Carnegie Mellon Univ, Software Engn Inst, Pittsburgh, PA 15213 USA
[3] Carnegie Mellon Univ, Machine Learning Dept, Pittsburgh, PA 15213 USA
[4] Carnegie Mellon, Pittsburgh, PA USA
来源
PROCEEDINGS OF THE 24TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW 2015) | 2015年
基金
美国安德鲁·梅隆基金会;
关键词
D O I
10.1145/2736277.2741083
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Twitter contains a wealth of timely information, however staying on top of breaking events requires that an information analyst constantly scan many sources, leading to information overload. For example, a user might wish to be made aware whenever an infectious disease outbreak takes place, when a new smartphone is announced or when a distributed Denial of Service (DoS) attack might a. ect an organization's network connectivity. There are many possible event categories an analyst may wish to track, making it impossible to anticipate all those of interest in advance. We therefore propose a weakly supervised approach, in which extractors for new categories of events are easy to define and train, by specifying a small number of seed examples. We cast seed-based event extraction as a learning problem where only positive and unlabeled data is available. Rather than assuming unlabeled instances are negative, as is common in previous work, we propose a learning objective which regularizes the label distribution towards a user-provided expectation. Our approach greatly outperforms heuristic negatives, used in most previous work, in experiments on real-world data. Significant performance gains are also demonstrated over two novel and competitive baselines: semi-supervised EM and one-class support-vector machines. We investigate three security-related events breaking on Twitter: DoS attacks, data breaches and account hijacking. A demonstration of security events extracted by our system is available at: http://kb1.cse.ohio-state.edu:8123/events/hacked
引用
收藏
页码:896 / 905
页数:10
相关论文
共 44 条
[1]  
Agichtein E., 2000, ACM 2000. Digital Libraries. Proceedings of the Fifth ACM Conference on Digital Libraries, P85, DOI 10.1145/336597.336644
[2]  
Alan Ritter Mausam, 2013, TACL
[3]  
[Anonymous], 1992, COLING 1992, DOI DOI 10.3115/992133.992154
[4]  
[Anonymous], 2005, P 12 ACM C COMPUTER, DOI DOI 10.1145/1102120.1102168
[5]  
[Anonymous], 2010, WWW
[6]  
[Anonymous], 2013, Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 August 2013, Sofia, Bulgaria, Volume 1: Long Papers
[7]  
[Anonymous], 2010, HLT 10
[8]  
[Anonymous], 2010, AAAI
[9]  
[Anonymous], 2008, P 14 ACM SIGKDD INT
[10]  
Becker Hila., 2012, Proceedings of the fifth ACM international conference on Web search and data mining, P533, DOI [10.1145/2124295.212436017, DOI 10.1145/2124295.2124360]