Probabilistic latent semantic indexing

被引:2588
作者
Hofmann, T [1 ]
机构
[1] Int Comp Sci Inst, Berkeley, CA 94704 USA
来源
SIGIR'99: PROCEEDINGS OF 22ND INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL | 1999年
关键词
D O I
10.1145/312624.312649
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domain-specific synonymy as well as with polysemous words. In contrast to standard Latent Semantic Indexing (LSI) by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. Retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over LSI. In particular, the combination of models with different dimensionalities has proven to be advantageous.
引用
收藏
页码:50 / 57
页数:8
相关论文
共 16 条
[1]  
[Anonymous], P 31 ANN M ASS COMP
[2]  
[Anonymous], 1993, MONOGRAPHS STAT APPL
[3]  
DEERWESTER SC, 1990, J AM SOC INFORMATION
[4]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[5]  
DUMAIS ST, 1995, P TEXT RETR C TREC 3, P219
[6]  
GILDEA D, 1999, P 6 EUR C SPEECH COM
[7]  
HOFMANN T, 1999, P 16 INT JOINT C ART
[8]  
HOFMANN T, 1999, ADV NEURAL INFORMATI, V11
[9]  
HOFMANN T, 1999, P 15 C UNC AI
[10]  
*LING DAT CONS, 1998, LDC98T25