Where No Universal Health Care Identifier Exists: Comparison and Determination of the Utility of Score-Based Persons Matching Algorithms Using Demographic Data

被引:9
作者
Waruru, Anthony [1 ]
Natukunda, Agnes [2 ]
Nyagah, Lilly M. [3 ]
Kellogg, Timothy A. [4 ]
Zielinski-Gutierrez, Emily [1 ]
Waruiru, Wanjiru [4 ]
Masamaro, Kenneth [1 ]
Harklerode, Richelle [4 ]
Odhiambo, Jacob [5 ]
Manders, Eric-Jan [6 ]
Young, Peter W. [1 ]
机构
[1] Ctr Dis Control & Prevent, Div Global HIV & TB, POB 606, Nairobi 00621, Kenya
[2] Univ Calif San Francisco, Global Programs Res & Training, San Francisco, CA 94143 USA
[3] Minist Hlth, Natl AIDS & STI Control Program, Nairobi, Kenya
[4] Univ Calif San Francisco, Inst Global Hlth Sci, San Francisco, CA 94143 USA
[5] Palladium Grp, Nairobi, Kenya
[6] Ctr Dis Control & Prevent, Div Global HIV & TB, Atlanta, GA USA
关键词
deterministic matching; score-based matching; HIV case-based surveillance; unique case identification; universal health care identifier;
D O I
10.2196/10436
中图分类号
R1 [预防医学、卫生学];
学科分类号
1004 ; 120402 ;
摘要
Background: A universal health care identifier (UHID) facilitates the development of longitudinal medical records in health care settings where follow up and tracking of persons across health care sectors are needed. HIV case-based surveillance (CBS) entails longitudinal follow up of HIV cases from diagnosis, linkage to care and treatment, and is recommended for second generation HIV surveillance. In the absence of a UHID, records matching, linking, and deduplication may be done using score-based persons matching algorithms. We present a stepwise process of score-based persons matching algorithms based on demographic data to improve HIV CBS and other longitudinal data systems. Objective: The aim of this study is to compare deterministic and score-based persons matching algorithms in records linkage and matching using demographic data in settings without a UHID. Methods: We used HIV CBS pilot data from 124 facilities in 2 high HIV-burden counties (Siaya and Kisumu) in western Kenya. For efficient processing, data were grouped into 3 scenarios within (1) HIV testing services (HTS), (2) HTS-care, and (3) within care. In deterministic matching, we directly compared identifiers and pseudo-identifiers from medical records to determine matches. We used R stringdist package for Jaro, Jaro-Winkler score-based matching and Levenshtein, and Damerau-Levenshtein string edit distance calculation methods. For the Jaro-Winkler method, we used a penalty (p)=0.1 and applied 4 weights (omega) to Levenshtein and Damerau-Levenshtein: deletion omega=0.8, insertion omega=0.8, substitutions omega=1, and transposition omega=0.5. Results: We abstracted 12,157 cases of which 4073/12,157 (33.5%) were from HTS, 1091/12,157 (9.0%) from HTS-care, and 6993/12,157 (57.5%) within care. Using the deterministic process 435/12,157 (3.6%) duplicate records were identified, yielding 96.4% (11,722/12,157) unique cases. Overall, of the score-based methods, Jaro-Winkler yielded the most duplicate records (686/12,157, 5.6%) while Jaro yielded the least duplicates (546/12,157, 4.5%), and Levenshtein and Damerau-Levenshtein yielded 4.6% (563/12,157) duplicates. Specifically, duplicate records yielded by method were: (1) Jaro 5.7% (234/4073) within HTS, 0.4% (4/1091) in HTS-care, and 4.4% (308/6993) within care, (2) Jaro-Winkler 7.4% (302/4073) within HTS, 0.5% (6/1091) in HTS-care, and 5.4% (378/6993) within care, (3) Levenshtein 6.4% (262/4073) within HTS, 0.4% (4/1091) in HTS-care, and 4.2% (297/6993) within care, and (4) Damerau-Levenshtein 6.4% (262/4073) within HTS, 0.4% (4/1091) in HTS-care, and 4.2% (297/6993) within care. Conclusions: Without deduplication, over reporting occurs across the care and treatment cascade. Jaro-Winkler score-based matching performed the best in identifying matches. A pragmatic estimate of duplicates in health care settings can provide a corrective factor for modeled estimates, for targeting and program planning. We propose that even without a UHID, standard national deduplication and persons-matching algorithm that utilizes demographic data would improve accuracy in monitoring HIV care clinical cascades.
引用
收藏
页码:205 / 216
页数:12
相关论文
共 19 条
[1]  
[Anonymous], 2017, CONS GUID PERS CTR H
[2]  
[Anonymous], 2015, UNAIDS 2016-2021 Strategy: On the fast-track to end AIDS
[3]  
[Anonymous], 2015, FAST TRACK END AIDS
[4]   Developing and implementing national health identifiers in resource limited countries: why, what, who, when and how? [J].
Beck, Eduard J. ;
Shields, J. Mark ;
Tanna, Gaurang ;
Henning, Gerrit ;
de Vega, Ian ;
Andrews, Gail ;
Boucher, Philippe ;
Benting, Lionel ;
Garcia-Calleja, Jesus Maria ;
Cutler, John ;
Ewing, Whitney ;
Kijsanayotin, Boonchai ;
Kujinga, Tapiwanashe ;
Mahy, Mary ;
Makofane, Keletso ;
Marsh, Kim ;
Nacheeva, Chujit ;
Rangana, Noma ;
Vega, Mary Felissa Reyes ;
Sabin, Keith ;
Varetska, Olga ;
Wanyee, Steven Macharia ;
Watiti, Stephen ;
Williams, Brian ;
Zhao, Jinkou ;
Nunez, Cesar ;
Ghys, Peter ;
Low-Beer, Daniel .
GLOBAL HEALTH ACTION, 2018, 11 (01)
[5]  
Christen P, 2006, ICDM 2006: SIXTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, WORKSHOPS, P290
[6]  
Clouse K, 2017, JAIDS-J ACQ IMM DEF, V74, P383, DOI 10.1097/qai.0000000000001284
[7]  
COHEN WW, 2003, P IJCAI 03 WORKSH IN
[8]   Accuracy of probabilistic record linkage applied to health databases: systematic review [J].
da Silveira, Daniele Pinto ;
Artmann, Elizabeth .
REVISTA DE SAUDE PUBLICA, 2009, 43 (05) :875-882
[9]  
Delcher Chris, 2016, J Registry Manag, V43, P10
[10]  
Dusetzina S. B., 2014, Linking data for health services research: A framework and instructional guide