Machine learning innovations in address matching: A practical comparison of word2vec and CRFs

被引:29
作者
Comber, Sam [1 ]
Arribas-Bel, Daniel [1 ]
机构
[1] Univ Liverpool, Sch Environm Sci, Dept Geog & Planning, Roxby Bldg,74 Bedford St, Liverpool L69 7ZT, Merseyside, England
基金
英国经济与社会研究理事会;
关键词
RATES;
D O I
10.1111/tgis.12522
中图分类号
P9 [自然地理学]; K9 [地理];
学科分类号
0705 ; 070501 ;
摘要
Record linkage is a frequent obstacle to unlocking the benefits of integrated (spatial) data sources. In the absence of unique identifiers to directly join records, practitioners often rely on text-based approaches for resolving candidate pairs of records to a match. In geographic information science, spatial record linkage is a form of geocoding that pertains to the resolution of text-based linkage between pairs of addresses into matches and non-matches. These approaches link text-based address sequences, integrating sources of data that would otherwise remain in isolation. While recent innovations in machine learning have been introduced in the wider record linkage literature, there is significant potential to apply machine learning to the address matching sub-field of geographic information science. As a response, this paper introduces two recent developments in text-based machine learning-conditional random fields and word2vec-that have not been applied to address matching, evaluating their comparative strengths and drawbacks.
引用
收藏
页码:334 / 348
页数:15
相关论文
共 40 条
[1]  
Baldovin T., 2015, Journal of preventive medicine and hygiene, V56, P88
[2]  
Barrentine A, 2018, LIBPOSTAL
[3]   STATISTICAL INFERENCE FOR PROBABILISTIC FUNCTIONS OF FINITE STATE MARKOV CHAINS [J].
BAUM, LE ;
PETRIE, T .
ANNALS OF MATHEMATICAL STATISTICS, 1966, 37 (06) :1554-&
[4]  
Blanchette C. M., 2013, DRUGS CONTEXT, V2013
[5]   Towards evidence-based, GIS-driven national spatial health information infrastructure and surveillance services in United Kingdom [J].
Kamel Boulos M.N. .
International Journal of Health Geographics, 3 (1)
[6]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[7]   Positional error in automated geocoding of residential addresses [J].
Michael R Cayo ;
Thomas O Talbot .
International Journal of Health Geographics, 2 (1)
[8]   XGBoost: A Scalable Tree Boosting System [J].
Chen, Tianqi ;
Guestrin, Carlos .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794
[9]  
Christen P., 2012, Data Matching, DOI DOI 10.1007/978-3-642-31164-2
[10]  
Christen P, 2005, FEBRL VERSION 0 3 1