A Hidden Topic-Based Framework toward Building Applications with Short Web Documents

被引:83
作者
Xuan-Hieu Phan [1 ]
Cam-Tu Nguyen
Dieu-Thu Le [2 ]
Le-Minh Nguyen [3 ]
Horiguchi, Susumu [1 ,5 ]
Quang-Thuy Ha [4 ]
机构
[1] Tohoku Univ, Grad Sch Informat Sci, Dept Comp Sci, Sendai, Miyagi 980, Japan
[2] Univ Trent, Dept Informat Engn & Comp Sci, Trento, Italy
[3] Japan Adv Inst Sci & Technol, Grad Sch Informat Sci, Nomi, Ishikawa 9231292, Japan
[4] Vietnam Natl Univ, Coll Technol, Hanoi, Vietnam
[5] Tohoku Univ, Dept Informat Engn, Fac Engn, Sendai, Miyagi 980, Japan
关键词
Web mining; hidden topic analysis; sparse data; classification; matching; ranking; contextual advertising; LATENT; MODEL;
D O I
10.1109/TKDE.2010.27
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper introduces a hidden topic-based framework for processing short and sparse documents (e. g., search result snippets, product descriptions, book/movie summaries, and advertising messages) on the Web. The framework focuses on solving two main challenges posed by these kinds of documents: 1) data sparseness and 2) synonyms/homonyms. The former leads to the lack of shared words and contexts among documents while the latter are big linguistic obstacles in natural language processing (NLP) and information retrieval (IR). The underlying idea of the framework is that common hidden topics discovered from large external data sets (universal data sets), when included, can make short documents less sparse and more topic-oriented. Furthermore, hidden topics from universal data sets help handle unseen data better. The proposed framework can also be applied for different natural languages and data domains. We carefully evaluated the framework by carrying out two experiments for two important online applications (Web search result classification and matching/ranking for contextual advertising) with large-scale universal data sets and we achieved significant results.
引用
收藏
页码:961 / 976
页数:16
相关论文
共 42 条
[1]  
[Anonymous], 2005, PARAMETER ESTIMATION
[2]  
[Anonymous], P 20 INT JOINT C ART
[3]  
[Anonymous], 2008, Introduction to information retrieval
[4]  
[Anonymous], 1998, P 11 ANN C COMP LEAR
[5]  
[Anonymous], P 29 EUR C IR RES EC
[6]  
BAKER L, 1998, P ACM SIGIR
[7]  
Baldi P., 2003, MODELING INTERNET WE
[8]  
Banerjee S., 2007, P ACM SIGIR
[9]  
Bekkerman R., 2003, Journal of Machine Learning Research, V3, P1183, DOI 10.1162/153244303322753625
[10]  
Berger AL, 1996, COMPUT LINGUIST, V22, P39