Iterative automated record linkage using mixture models

被引:98
作者
Larsen, MD [1 ]
Rubin, DB
机构
[1] Univ Chicago, Dept Stat, Chicago, IL 60637 USA
[2] Harvard Univ, Dept Stat, Cambridge, MA 02138 USA
关键词
administrative records; census; expectation-maximization; expectation-conditional maximization; file matching; latent-class models; post-enumeration survey;
D O I
10.1198/016214501750332956
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
The goal of record linkage is to link quickly and accurately records that correspond to the same person or entity. Whereas certain patterns of agreements and disagreements on variables are more likely among records pertaining to a single person than among records for different people, the observed patterns for pairs of records can be viewed as arising from a mixture of matches and nonmatches. Mixture model estimates can be used to partition record pairs into two or more groups that can be labeled as probable matches (links) and probable nonmatches (nonlinks). A method is proposed and illustrated that uses marginal information in the database to select mixture models, identifies sets of records for clerks to review based on the models and marginal information, incorporates clerically reviewed datal as they become available, into estimates of model parameters, and classifies pairs as links, nonlinks, or in need of further clerical review The procedure is illustrated with five datasets from the U.S. Bureau of the Census. it appears to be robust to variations in record-linkage sites. The clerical review corrects classifications of some pairs directly and leads to changes in classification of others through reestimation of mixture models.
引用
收藏
页码:32 / 41
页数:10
相关论文
共 46 条
[1]  
[Anonymous], 1988, ANN STAT
[2]   THE USE OF NAMES FOR LINKING PERSONAL RECORDS - COMMENT [J].
ARELLANO, MG .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1992, 87 (420) :1204-1206
[3]  
Armstrong J., 1992, P SECT SURV RES METH, P853
[4]  
Armstrong M., 1993, SURV METHODOL, V19, P137
[5]   Latent class marginal models for cross-classifications of counts [J].
Becker, MP ;
Yang, IS .
SOCIOLOGICAL METHODOLOGY, VOL. 28 1998, 1998, 28 :293-325
[6]  
Belin T. R., 1993, SURV METHODOL, V19, P13
[7]  
Bishop M.M., 1975, DISCRETE MULTIVARIAT
[8]  
COULTER RW, 1985, RECORD LINKAGE TECHN, P89
[9]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[10]   USING LATENT CLASS MODELS TO CHARACTERIZE AND ASSESS RELATIVE ERROR IN DISCRETE MEASUREMENTS [J].
ESPELAND, MA ;
HANDELMAN, SL .
BIOMETRICS, 1989, 45 (02) :587-599