Machine learning for cross-gazetteer matching of natural features

被引:18
作者
Acheson, Elise [1 ]
Volpi, Michele [2 ,3 ]
Purves, Ross S. [1 ]
机构
[1] Univ Zurich, Dept Geog, Zurich, Switzerland
[2] Swiss Fed Inst Technol, Swiss Data Sci Ctr, Zurich, Switzerland
[3] EPFL Lausanne, Lausanne, Switzerland
关键词
Gazetteer matching; record linking; random forest; natural features; feature types; ONTOLOGY;
D O I
10.1080/13658816.2019.1599123
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Defining and identifying duplicate records in a dataset is a challenging task which grows more complex when the modeled entities themselves are hard to delineate. In the geospatial domain, it may not be clear where a mountain, stream, or valley ends and begins, a problem carried over when such entities are catalogued in gazetteers. In this paper, we take two gazetteers, GeoNames and SwissNames3D, and perform matching - identifying records in each that are about the same entity - across a sample of natural feature records. We first perform rule-based matching, establishing competitive results, then apply machine learning using Random Forests, a method well-suited to the matching task. We report on the performance of a wider array of matching features than has been previously studied, including domain-specific ones such as feature type, land cover class, and elevation. Our results show an increase in performance using machine learning over rules, with a notable performance gain from considering feature types, but negligible gains from other specialized matching features. We argue that future work in this area should strive to be more reproducible and report results on a realistic testing pipeline including candidate selection, feature extraction, and classification.
引用
收藏
页码:708 / 734
页数:27
相关论文
共 41 条
[1]   Gazetteer matching for natural features in Switzerland [J].
Acheson, Elise ;
Villette, Julia ;
Volpi, Michele ;
Purves, Ross S. .
PROCEEDINGS OF THE 11TH WORKSHOP ON GEOGRAPHIC INFORMATION RETRIEVAL (GIR'17), 2016,
[2]   A quantitative analysis of global gazetteers: Patterns of coverage for common feature types [J].
Acheson, Elise ;
De Sabbata, Stefano ;
Purves, Ross S. .
COMPUTERS ENVIRONMENT AND URBAN SYSTEMS, 2017, 64 :309-320
[3]   Frankenplace: Interactive Thematic Mapping for Ad Hoc Exploratory Search [J].
Adams, Benjamin ;
McKenzie, Grant ;
Gahegan, Mark .
PROCEEDINGS OF THE 24TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW 2015), 2015, :12-22
[4]  
Ahlers D., 2013, P 7 WORKSHOP GEOGRAP, P74, DOI DOI 10.1145/2533888.2533938
[5]  
Berman M., 2016, Placing Names: Enriching and Integrating Gazetteers
[6]  
Brauner D.F., 2007, Advances in Geoinformatics, P235, DOI 10.1007/978-3-540-73414-7_15
[7]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[9]   Mapping the English Lake District: a literary GIS [J].
Cooper, David ;
Gregory, Ian N. .
TRANSACTIONS OF THE INSTITUTE OF BRITISH GEOGRAPHERS, 2011, 36 (01) :89-108
[10]  
Costa G, 2011, STUD COMPUT INTELL, V375, P385