When to conduct probabilistic linkage vs. deterministic linkage? A simulation study

被引:67
|
作者
Zhu, Ying [1 ]
Matsuyama, Yutaka [1 ]
Ohashi, Yasuo [1 ,2 ]
Setoguchi, Soko [3 ,4 ]
机构
[1] Univ Tokyo, Grad Sch Med, Dept Biostat, Tokyo 1130033, Japan
[2] Chuo Univ, Dept Integrated Sci & Engn Sustainable Soc, Tokyo 112, Japan
[3] Duke Univ, Sch Med, Duke Clin Res Inst, Durham, NC USA
[4] Univ Tokyo, Grad Sch Med, Dept Pharmacoepidemiol, Tokyo 1130033, Japan
关键词
Record linkage; Probabilistic linkage; Deterministic linkage; Simulation study; Comparative validity; RECORD LINKAGE; HOSPITAL DISCHARGE; LINKING; HEALTH; REGISTRY; IDENTIFIERS; TRANSPARENT; ACCURACY; COHORT; CLAIMS;
D O I
10.1016/j.jbi.2015.05.012
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Introduction: When unique identifiers are unavailable, successful record linkage depends greatly on data quality and types of variables available. While probabilistic linkage theoretically captures more true matches than deterministic linkage by allowing imperfection in identifiers, studies have shown inconclusive results likely due to variations in data quality, implementation of linkage methodology and validation method. The simulation study aimed to understand data characteristics that affect the performance of probabilistic vs. deterministic linkage. Methods: We created ninety-six scenarios that represent real-life situations using non-unique identifiers. We systematically introduced a range of discriminative power, rate of missing and error, and file size to increase linkage patterns and difficulties. We assessed the performance difference of linkage methods using standard validity measures and computation time. Results: Across scenarios, deterministic linkage showed advantage in PPV while probabilistic linkage showed advantage in sensitivity. Probabilistic linkage uniformly outperformed deterministic linkage as the former generated linkages with better trade-off between sensitivity and PPV regardless of data quality. However, with low rate of missing and error in data, deterministic linkage performed not significantly worse. The implementation of deterministic linkage in SAS took less than 1 min, and probabilistic linkage took 2 min to 2 h depending on file size. Discussion: Our simulation study demonstrated that the intrinsic rate of missing and error of linkage variables was key to choosing between linkage methods. In general, probabilistic linkage was a better choice, but for exceptionally good quality data (<5% error), deterministic linkage was a more resource efficient choice. (C) 2015 Elsevier Inc. All rights reserved.
引用
收藏
页码:80 / 86
页数:7
相关论文
共 13 条
  • [1] Results from simulated data sets: probabilistic record linkage outperforms deterministic record linkage
    Tromp, Miranda
    Ravelli, Anita C.
    Bonsel, Gouke J.
    Hasman, Arie
    Reitsma, Johannes B.
    JOURNAL OF CLINICAL EPIDEMIOLOGY, 2011, 64 (05) : 565 - 572
  • [2] A hybrid approach to record linkage using a combination of deterministic and probabilistic methodology
    Ong, Toan C.
    Duca, Lindsey M.
    Kahn, Michael G.
    Crume, Tessa L.
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2020, 27 (04) : 505 - 513
  • [3] Detecting Duplicates at Hospital Admission: Comparison of Deterministic and Probabilistic Record Linkage
    Waldenburger, Andreas
    Nasseh, Daniel
    Stausberg, Juergen
    UNIFYING THE APPLICATIONS AND FOUNDATIONS OF BIOMEDICAL AND HEALTH INFORMATICS, 2016, 226 : 135 - 138
  • [4] Inclusion of a deterministic post-processing stage to increase the performance of probabilistic record linkage
    Brustulin, Rafael
    Marson, Poliana Guerino
    CADERNOS DE SAUDE PUBLICA, 2018, 34 (06):
  • [5] Probabilistic Linkage Creates a Novel Database to Study Bronchiolitis Care in the PICU
    Flaherty, Brian F.
    Smith, McKenna
    Dziorny, Adam
    Srivastava, Rajendu
    Cook, Lawrence J.
    Keenan, Heather T.
    HOSPITAL PEDIATRICS, 2024, 14 (03) : e150 - e155
  • [6] How good is probabilistic record linkage to reconstruct reproductive histories? Results from the Aberdeen children of the 1950s study
    Nitsch D.
    Morton S.
    DeStavola B.L.
    Clark H.
    Leon D.A.
    BMC Medical Research Methodology, 6 (1)
  • [7] Probabilistic vs. deterministic fiber tracking and the influence of different seed regions to delineate cerebellar-thalamic fibers in deep brain stimulation
    Schlaier, Juergen R.
    Beer, Anton L.
    Faltermeier, Rupert
    Fellner, Claudia
    Steib, Kathrin
    Lange, Max
    Greenlee, Mark W.
    Brawanski, Alexander T.
    Anthofer, Judith M.
    EUROPEAN JOURNAL OF NEUROSCIENCE, 2017, 45 (12) : 1623 - 1633
  • [8] Probabilistic Linkage of Randomized Controlled Trial Data to Administrative Claims: A Case Study of Patients from Baricitinib Clinical Trials
    Catherine B. McGuiness
    Natalie N. Boytsov
    Xiang Zhang
    Xin Wang
    Carol L. Kannowski
    Rolin L. Wade
    Rheumatology and Therapy, 2021, 8 : 793 - 802
  • [9] Probabilistic Linkage of Randomized Controlled Trial Data to Administrative Claims: A Case Study of Patients from Baricitinib Clinical Trials
    McGuiness, Catherine B.
    Boytsov, Natalie N.
    Zhang, Xiang
    Wang, Xin
    Kannowski, Carol L.
    Wade, Rolin L.
    RHEUMATOLOGY AND THERAPY, 2021, 8 (02) : 793 - 802
  • [10] Building A Longitudinal Cohort From 9-1-1 to 1-Year Using Existing Data Sources, Probabilistic Linkage, and Multiple Imputation: A Validation Study
    Newgard, Craig D.
    Malveau, Susan
    Zive, Dana
    Lupton, Joshua
    Lin, Amber
    ACADEMIC EMERGENCY MEDICINE, 2018, 25 (11) : 1268 - 1283