Identification of Organization Name Variants in Large Databases using Rule-based Scoring and Clustering With a Case Study on the Web of Science Database

被引:4
作者
Caron, Emiel [1 ]
Daniels, Hennie [1 ,2 ]
机构
[1] Erasmus Univ, Erasmus Res Inst Management, POB 1738, Rotterdam, Netherlands
[2] Tilburg Univ, Ctr Econ Res, POB 90153, Tilburg, Netherlands
来源
PROCEEDINGS OF THE 18TH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS, VOL 1 (ICEIS) | 2016年
关键词
Large Scale Databases; Data Warehousing; Database Integration; Data Cleaning; Data Mining; Clustering;
D O I
10.5220/0005836701820187
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This research describes a general method to automatically clean organizational and business names variants within large databases, such as: patent databases, bibliographic databases, databases in business information systems, or any other database containing organisational name variants. The method clusters name variants of organizations based on similarities of their associated meta-data, like, for example, postal code and email domain data. The method is divided into a rule-based scoring system and a clustering system. The method is tested on the cleaning of research organisations in the Web of Science database for the purpose of bibliometric analysis and scientific performance evaluation. The results of the clustering are evaluated with metrics such as precision and recall analysis on a verified data set. The evaluation shows that our method performs well and is conservative, it values precision over recall, with on average 95% precision and 80% recall for clusters.
引用
收藏
页码:182 / 187
页数:6
相关论文
共 12 条
[1]  
[Anonymous], 2003, KDD Workshop on Data Cleaning and Object Consolidation
[2]  
Caron E., 2014, In: Proceedings of the 2014 Science and Technology Indicators Conference, P79, DOI DOI 10.1007/978-981-32-9298-7_12
[3]  
CWTS, 2015, CTR SCI TECHN STUD
[4]  
Koudas Nick, 2004, P 30 VLDB C
[5]   Citation-based bootstrapping for large-scale author disambiguation [J].
Levin, Michael ;
Krawczyk, Stefan ;
Bethard, Steven ;
Jurafsky, Dan .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2012, 63 (05) :1030-1047
[6]  
Maletic JI, 2010, DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK, SECOND EDITION, P19, DOI 10.1007/978-0-387-09823-4_2
[7]  
Moed H.F., 1990, INFORMETRICS, V89-90, P65
[8]   The automatic normalisation challenge: detailed addresses identification [J].
Morillo, Fernanda ;
Santabarbara, Ignacio ;
Aparicio, Javier .
SCIENTOMETRICS, 2013, 95 (03) :953-966
[9]  
Patstat, 2015, EPO WORLDW PAT STAT
[10]  
Ranking Leiden, 2015, CWTS LEID RANK 2015