The Challenge of Pairing Big Datasets: Probabilistic Record Linkage Methods and Diagnosis of Their Empirical Viability

被引:0
作者
Peng Y. [1 ]
Mation L.F. [2 ]
机构
[1] Brazilian Secretariat for Economic Policy, Esplanada dos Ministérios, Bloco P, 324, Brasília, 70048-900, DF
[2] Brazilian Institute of Applied Economic Research, Setor Bancário Sul Q. 1 Ed. BNDES, 1514, Brasília, 70076-900, DF
关键词
Administrative records; Big data; Blocking; R; Record linkage;
D O I
10.1007/s41549-020-00043-1
中图分类号
学科分类号
摘要
In this paper, we evaluated the predictive performance of probabilistic record linkage algorithms, discussing the implications of different configurations of blocking keys, string similarity functions and phonetic code on the prediction’s overall performance and computational complexity. Furthermore, we carried out a bibliographical survey of the main deterministic and probabilistic record linkage methods, as well as of recent advances combining machine learning techniques and main packages and implementations available in open-source R language. The results can provide heuristics for problems of administrative records integration at the national level and have potential value for the formulation and evaluation of public policies. © 2020, Springer Nature Switzerland AG.
引用
收藏
页码:35 / 57
页数:22
相关论文
共 40 条
  • [1] Bhattacharya I., Getoor L., A latent dirichlet allocation model for entity resolution, In Proceedings of 6Th SIAM International Conference on Data Mining., (2005)
  • [2] Cesarini D., Lindqvist E., Ostling R., Wallace B., Wealth, health, and child development: Evidence from administrative data on Swedish lottery players, The Quarterly Journal of Economics, 131, 2, pp. 687-738, (2016)
  • [3] Christen P., A survey of indexing techniques for scalable record linkage and deduplication, IEEE Transactions on Knowledge and Data Engineering, 24, 9, pp. 1537-1555, (2012)
  • [4] Christen P., Goiser K., Quality and complexity measures for data linkage and deduplication, Quality measures in data mining, 43, pp. 127-151, (2007)
  • [5] Churches T., Christen P., Lim K., Zhu J.X., Preparation of name and address data for record linkage using hidden markov models, BMC Medical Informatics and Decision Making, 2, 1, (2002)
  • [6] Connelly R., Playford C.J., Gayle V., Dibben C., The role of administrative data in the big data revolution in social science research, Social Science Research, 59, pp. 1-12, (2016)
  • [7] Contiero P., Tittarelli A., Tagliabue G., Maghini A., Fabiano S., Crosignani P., Tessandori R., The epilink record linkage software, Methods of Information in Medicine, 44, 1, pp. 66-71, (2005)
  • [8] da Pita R.D.R., Correlação Probabilística Implementada Em Spark Para Big Data Em Saúde, (2016)
  • [9] Dusetzina S.B., Tyree S., Meyer A.-M., Meyer A., Green L., Carpenter W.R., Linking Data for Health Services Research: A Framework and Instructional Guide, (2014)
  • [10] Fair M., Generalized record linkage system-statistics canada’s record linkage software, Austrian Journal of Statistics, 33, 1-2, pp. 37-53, (2016)