Extracting significant time varying features from text

被引:37
作者
Swan, R [1 ]
Allan, J [1 ]
机构
[1] Univ Massachusetts, Dept Comp Sci, Ctr Intelligent Informat Retrieval, Amherst, MA 01003 USA
来源
PROCEEDINGS OF THE EIGHTH INTERNATIONAL CONFERENCE ON INFORMATION KNOWLEDGE MANAGEMENT, CIKM'99 | 1999年
关键词
D O I
10.1145/319950.319956
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose a simple statistical model for the frequency of occurrence of features in a stream of text. Adoption of this model allows us to use classical significance tests to filter the stream for interesting events. We tested the model by building a system and running it on a news corpus. By a subjective evaluation, the system worked remarkably well: almost all of the groups of identified tokens corresponded to news stories and were appropriately placed in time. A preliminary objective evaluation was also used to measure the quality of the system and it showed some of the weaknesses and the power of our approach.
引用
收藏
页码:38 / 45
页数:8
相关论文
共 12 条
[1]  
Allan J., 1998, P DARPA BROADCAST NE, P194
[2]  
ALLEN RB, 1995, P INT S DIG LIB, P175
[3]  
DAGAN I, 1996, P S DOC AN INF RETR
[4]  
FISHER D, 1996, P 6 MESS UND C NOV 1, P127
[5]   MODELING DOCUMENTS WITH MULTIPLE POISSON-DISTRIBUTIONS [J].
MARGULIS, EL .
INFORMATION PROCESSING & MANAGEMENT, 1993, 29 (02) :215-227
[6]  
PAPKA R, 1999, P DARPA BROADC WORKS
[7]  
ROBIN L, 1995, THESIS MIT MEDIA LAB
[8]  
SANDERSON M, 1999, P 22 INT ACM SIGIR C
[9]  
XU J, 1994, IR52 U MASS CTR INT
[10]  
YVONNE MM, 1974, DISCRETE MULTIVARIAT