A Unified Probabilistic Framework for Name Disambiguation in Digital Library

被引:177
作者
Tang, Jie [1 ]
Fong, A. C. M. [2 ]
Wang, Bo [3 ]
Zhang, Jing [1 ]
机构
[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing 100084, Peoples R China
[2] Auckland Univ Technol, Sch Comp & Math Sci, Auckland 1142, New Zealand
[3] Nanjing Univ Aeronaut & Astronaut, Dept Comp Sci, Nanjing 210016, Peoples R China
关键词
Digital libraries; information search and retrieval; database applications; heterogeneous databases; MODEL;
D O I
10.1109/TKDE.2011.13
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite years of research, the name ambiguity problem remains largely unresolved. Outstanding issues include how to capture all information for name disambiguation in a unified approach, and how to determine the number of people K in the disambiguation process. In this paper, we formalize the problem in a unified probabilistic framework, which incorporates both attributes and relationships. Specifically, we define a disambiguation objective function for the problem and propose a two-step parameter estimation algorithm. We also investigate a dynamic approach for estimating the number of people K. Experiments show that our proposed framework significantly outperforms four baseline methods of using clustering algorithms and two other previous methods. Experiments also indicate that the number K automatically found by our method is close to the actual number.
引用
收藏
页码:975 / 987
页数:13
相关论文
共 52 条
[31]   Finding and evaluating community structure in networks [J].
Newman, MEJ ;
Girvan, M .
PHYSICAL REVIEW E, 2004, 69 (02) :026113-1
[32]  
On B., 2007, P SIAM INT C DAT MIN
[33]  
Pelleg D., 2000, ICML, V1, P727, DOI DOI 10.1007/3-540-44491-2_3
[34]   A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH [J].
RISSANEN, J .
ANNALS OF STATISTICS, 1983, 11 (02) :416-431
[35]   Normalized cuts and image segmentation [J].
Shi, JB ;
Malik, J .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2000, 22 (08) :888-905
[36]  
Shu LC, 2009, PROC INT CONF DATA, P880, DOI 10.1109/ICDE.2009.29
[37]   Efficient Topic-based Unsupervised Name Disambiguation [J].
Song, Yang ;
Huang, Jian ;
Councill, Isaac G. ;
Li, Jia ;
Giles, C. Lee .
PROCEEDINGS OF THE 7TH ACM/IEE JOINT CONFERENCE ON DIGITAL LIBRARIES: BUILDING & SUSTAINING THE DIGITAL ENVIRONMENT, 2007, :342-+
[38]  
Sun Y., 2009, P ACM SIGKDD INT C K
[39]  
Tan YF, 2006, OPENING INFORMATION HORIZONS, P314
[40]  
Tang J., 2008, P 14 ACM SIGKDD INT, P990, DOI DOI 10.1145/1401890.1402008