Incremental Author Name Disambiguation for Scientific Citation Data

被引:10
作者
Zhao, Zhengqiao [1 ]
Rollins, Jason [2 ]
Bai, Linge [2 ]
Rosen, Gail [1 ]
机构
[1] Drexel Univ, Dept Elect & Comp Engn, Philadelphia, PA 19104 USA
[2] Clarivate Analyt, San Francisco, CA 94016 USA
来源
2017 IEEE INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA) | 2017年
基金
美国国家科学基金会;
关键词
D O I
10.1109/DSAA.2017.17
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Name disambiguation is a perennial challenge for any large and growing dataset but is particularly significant for scientific publication data where documents and ideas are linked through citations and depend on highly accurate authorship. Differentiating personal names in scientific publications is a substantial problem as many names are not sufficiently distinct due to the large number of researchers active in most academic disciplines today. As more and more documents and citations are published every year, any system built on this data must be continually retrained and reclassified to remain relevant and helpful. Recently, some incremental learning solutions have been proposed, but most of these have been limited to small-scale simulations and do not exhibit the full heterogeneity of the millions of authors and papers in real world data. In our work, we propose a probabilistic model that simultaneously uses a rich set of metadata and reduces the amount of pairwise comparisons needed for new articles. We suggest an approach to disambiguation that classifies in an incremental fashion to alleviate the need for retraining the model and re-clustering all papers and uses fewer parameters than other algorithms. Using a published dataset, we obtained the highest K-measure which is a geometric mean of cluster and author-class purity. Moreover, on a difficult author block from the Clarivate Analytics Web of Science, we obtain higher precision than other algorithms.
引用
收藏
页码:175 / 183
页数:9
相关论文
共 21 条
[1]   Use of ResearchGate and Google CSE for author name disambiguation [J].
Abdulhayoglu, Mehmet Ali ;
Thijs, Bart .
SCIENTOMETRICS, 2017, 111 (03) :1965-1985
[2]  
[Anonymous], J DATA INF QUAL
[3]  
[Anonymous], CORR
[4]  
Carvalho A. P. D., 2011, JIDM, V2
[5]  
Christopher H. S. t., 2008, INTRO INFORM RETRIEV
[6]   Incremental Learning of Concept Drift from Streaming Imbalanced Data [J].
Ditzler, Gregory ;
Polikar, Robi .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (10) :2283-2301
[7]   A Brief Survey of Automatic Methods for Author Name Disambiguation [J].
Ferreira, Anderson A. ;
Goncalves, Marcos Andre ;
Laender, Alberto H. F. .
SIGMOD RECORD, 2012, 41 (02) :15-26
[8]   Two supervised learning approaches for name disambiguation in author citations [J].
Han, H ;
Giles, L ;
Zha, H ;
Li, C ;
Tsioutsiouliklis, K .
JCDL 2004: PROCEEDINGS OF THE FOURTH ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES: GLOBAL REACH AND DIVERSE IMPACT, 2004, :296-305
[9]  
HAN H, 2005, JCDL, P334
[10]  
HUANG J, 2006, EUR C PRINC DATA, V4213, P536