ETM: Entity Topic Models for Mining Documents Associated with Entities

被引:27
作者
Kim, Hyungsul [1 ]
Sun, Yizhou [1 ]
Hockenmaier, Julia [1 ]
Han, Jiawei [1 ]
机构
[1] Univ Illinois, Urbana, IL USA
来源
12TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2012) | 2012年
关键词
topic models; data mining; entity;
D O I
10.1109/ICDM.2012.107
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Topic models, which factor each document into different topics and represent each topic as a distribution of terms, have been widely and successfully used to better understand collections of text documents. However, documents are also associated with further information, such as the set of real-world entities mentioned in them. For example, news articles are usually related to several people, organizations, countries or locations. Since those associated entities carry rich information, it is highly desirable to build more expressive, entity-based topic models, which can capture the term distributions for each topic, each entity, as well as each topic-entity pair. In this paper, we therefore introduce a novel Entity Topic Model (ETM) for documents that are associated with a set of entities. ETM not only models the generative process of a term given its topic and entity information, but also models the correlation of entity term distributions and topic term distributions. A Gibbs sampling-based algorithm is proposed to learn the model. Experiments on real datasets demonstrate the effectiveness of our approach over several state-of-the-art baselines.
引用
收藏
页码:349 / 358
页数:10
相关论文
共 21 条
[1]  
[Anonymous], 1999, AAAI 99 WORKSH TEXT
[2]  
[Anonymous], 2003, P 26 ANN INT ACM SIG
[3]  
[Anonymous], 2011, P 7 INT C SEM SYST, DOI [10.1145/2063518.2063519, DOI 10.1145/2063518.2063519]
[4]  
Balasubramanyan R., 2011, P 2011 SIAM INT C DA, P450, DOI DOI 10.1137/1.9781611972818.39
[5]  
Blei D.M., 2006, INT C MACHINE LEARNI, DOI DOI 10.1145/1143844.1143859
[6]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[7]   Mixed-membership models of scientific publications [J].
Erosheva, E ;
Fienberg, S ;
Lafferty, J .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2004, 101 :5220-5227
[8]   Open Information Extraction from the Web [J].
Etzioni, Oren ;
Banko, Michele ;
Soderland, Stephen ;
Weld, Daniel S. .
COMMUNICATIONS OF THE ACM, 2008, 51 (12) :68-74
[9]   Probabilistic latent semantic indexing [J].
Hofmann, T .
SIGIR'99: PROCEEDINGS OF 22ND INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 1999, :50-57
[10]  
Neal RM, 2003, ANN STAT, V31, P705, DOI 10.1214/aos/1056562461