On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method

被引:26
作者
Santana A.F. [1 ]
Gonçalves M.A. [1 ]
Laender A.H.F. [1 ]
Ferreira A.A. [2 ]
机构
[1] Departamento de Ciência da Computação, Universidade Federal de Minas Gerais, Belo Horizonte
[2] Departamento de Computação, Universidade Federal de Ouro Preto, Ouro Preto
关键词
Heuristics; Name disambiguation; Supervised methods;
D O I
10.1007/s00799-015-0158-y
中图分类号
学科分类号
摘要
Author name disambiguation has been one of the hardest problems faced by digital libraries since their early days. Historically, supervised solutions have empirically outperformed those based on heuristics, but with the burden of having to rely on manually labeled training sets for the learning process. Moreover, most supervised solutions just apply some type of generic machine learning solution and do not exploit specific knowledge about the problem. In this article, we follow a similar reasoning, but in the opposite direction. Instead of extending an existing supervised solution, we propose a set of carefully designed heuristics and similarity functions, and apply supervision only to optimize such parameters for each particular dataset. As our experiments show, the result is a very effective, efficient and practical author name disambiguation method that can be used in many different scenarios. In fact, we show that our method can beat state-of-the-art supervised methods in terms of effectiveness in many situations while being orders of magnitude faster. It can also run without any training information, using only default parameters, and still be very competitive when compared to these supervised methods (beating several of them) and better than most existing unsupervised author name disambiguation solutions. © 2015, Springer-Verlag Berlin Heidelberg.
引用
收藏
页码:229 / 246
页数:17
相关论文
共 26 条
[1]  
Baeza-Yates R.A., Ribeiro-Neto B., Modern information retrieval, (1999)
[2]  
Bhattacharya I., Getoor L., Collective entity resolution in relational data, ACM Trans Knowl Discov Data, 1, 1, (2007)
[3]  
Bordes A., Ertekin S., Weston J., Bottou L., Fast kernel classifiers with online and active learning, J Mach Learning Res, 6, pp. 1579-1619, (2005)
[4]  
Cota R.G., Ferreira A.A., Nascimento C., Goncalves M.A., Laender A.H.F., An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations, J Am Soc Inform Sci Technol, 61, 9, pp. 1853-1870, (2010)
[5]  
Fan X., Wang J., Pu X., Zhou L., Lv B., On graph-based name disambiguation, J Data Inform Qual, 2, pp. 1-23, (2011)
[6]  
Ferreira A.A., Veloso A., Goncalves M.A., Laender A.H.F., Effective self-training author name disambiguation in scholarly digital libraries, Proceedings of the 10th Annual Joint Conference on Digital Libraries, pp. 39-48, (2010)
[7]  
Ferreira A.A., Goncalves M.A., Laender A.H.F., A brief survey of automatic methods for author name disambiguation, SIGMOD Record, 41, 2, pp. 15-26, (2012)
[8]  
Ferreira A.A., Silva R., Goncalves M.A., Veloso A., Laender A.H.F., Active associative sampling for author name disambiguation, Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 175-184, (2012)
[9]  
Ferreira A.A., Veloso A., Goncalves M.A., Laender A.H.F., Self-training author name disambiguation for information scarce scenarios, J Am Soc Inform Sci Technol, 65, 6, pp. 1257-1278, (2014)
[10]  
Han H., Giles L., Zha H., Li C., Tsioutsiouliklis K., Two supervised learning approaches for name disambiguation in author citations, Proceedings of the 4th ACM/IEEE Joint Conference on Digital Libraries, pp. 296-305, (2004)