Growing Story Forest Online from Massive Breaking News

被引:24
作者
Liu, Bang [1 ]
Niu, Di [1 ]
Lai, Kunfeng [2 ]
Kong, Linglong [1 ]
Xu, Yu [2 ]
机构
[1] Univ Alberta, Edmonton, AB, Canada
[2] Tencent Inc, Mobile Internet Grp, Shenzhen, Peoples R China
来源
CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT | 2017年
关键词
Text Clustering; Online Story Tree; Information Retrieval;
D O I
10.1145/3132847.3132852
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We describe our experience of implementing a news content organization system at Tencent that discovers events from vast streams of breaking news and evolves news story structures in an online fashion. Our real-world system has distinct requirements in contrast to previous studies on topic detection and tracking (TDT) and event timeline or graph generation, in that we 1) need to accurately and quickly extract distinguishable events from massive streams of long text documents that cover diverse topics and contain highly redundant information, and 2) must develop the structures of event stories in an online manner, without repeatedly restructuring previously formed stories, in order to guarantee a consistent user viewing experience. In solving these challenges, we propose Story Forest, a set of online schemes that automatically clusters streaming documents into events, while connecting related events in growing trees to tell evolving stories. We conducted extensive evaluation based on 60 GB of real-world Chinese news data, although our ideas are not language-dependent and can easily be extended to other languages, through detailed pilot user experience studies. The results demonstrate the superior capability of Story Forest to accurately identify events and organize news text into a logical structure that is appealing to human readers, compared to multiple existing algorithm frameworks.
引用
收藏
页码:777 / 785
页数:9
相关论文
共 27 条
[1]  
Aggarwal C. C., 2012, MINING TEXT DATA, P163, DOI [DOI 10.1007/978-1-4614-3223-46, DOI 10.1007/978-1-4614-3223-4, 10.1007/978-1-4614-3223-4]
[2]  
Allan J., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P37, DOI 10.1145/290941.290954
[3]  
Allan J., 2012, TOPIC DETECTION TRAC, V12
[4]  
[Anonymous], PAC AS C ADV KNOWL
[5]  
[Anonymous], 2009, ICWSM
[6]  
[Anonymous], 2004, P 2004 C EMP METH NA, DOI DOI 10.1016/0305-0491(73)90144-2
[7]  
[Anonymous], 2008, Proceedings of the Third Workshop on Statistical Machine Translation
[8]  
[Anonymous], 2012, WWW 2012
[9]  
[Anonymous], 2004, Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
[10]  
[Anonymous], 2007, P JOINT C EMP METH N, DOI DOI 10.7916/D80V8N84