Identity matching and information acquisition: Estimation of optimal threshold parameters

被引:0
作者
Alirezazadeh, Pantea [1 ]
Boylu, Fidan [2 ]
Garfinkel, Robert [2 ]
Gopal, Ram [2 ]
Goes, Paulo [3 ]
机构
[1] Univ Connecticut, Dept Operat Management & Informat Syst, Storrs, CT 06269 USA
[2] Univ Connecticut, Dept Operat Management & Informat Syst, Sch Business, Storrs, CT USA
[3] Univ Arizona, Dept Management Informat Syst, Tucson, AZ 85721 USA
关键词
Data quality; Statistical estimation; Sampling distributions; Record matching; Information acquisition; Type I and Type II errors; HETEROGENEOUS DATABASES; RECORD LINKAGE; DECISION-MODEL;
D O I
10.1016/j.dss.2013.08.014
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the growing volume of collected and stored data from customer interactions that have recently shifted towards online channels, an important challenge faced by today's businesses is appropriately dealing with data quality problems. A key step in the data cleaning process is the matching and merging of customer records to assess the identity of individuals. The practical importance of this research is exemplified by a large client firm that deals with private label credit cards. They needed to know whether there existed histories of new customers within the company, in order to decide on the appropriate parameters of possible card offerings. The company incurs substantial costs if they incorrectly "match" an incoming application with an existing customer (Type I error), and also if they falsely assume that there is no match (Type II error). While there is a good deal of generic identity matching software available, that will provide a "strength" score for each potential match, the question of how to use the scores for new applications is of great interest and is addressed in this work. The academic significance lies in the analysis of the score thresholds that are typically used in decision making. That is, upper and lower thresholds are set, where matches are accepted above the former, rejected below the latter, and more information is gathered between the two. We show, for the first time, that the optimal thresholds can be considered to be parameters of a matching distribution, and a number of estimators of these parameters are developed and analyzed. Then extensive computations show the effects of various factors on the convergence rates of the estimates. (C) 2013 Elsevier B.V. All rights reserved.
引用
收藏
页码:160 / 171
页数:12
相关论文
共 23 条
[1]  
Armstrong M., 1993, SURV METHODOL, V19, P137
[2]  
Cochinwala M., 2001, 01013 CSDTR PURD U D
[3]  
COHEN WW, 2003, P KDD 2003 WORKSH DA
[4]   Record matching in data warehouses: A decision model for data consolidation [J].
Dey, D .
OPERATIONS RESEARCH, 2003, 51 (02) :240-254
[5]   A distance-based approach to entity reconciliation in heterogeneous databases [J].
Dey, D ;
Sarkar, S ;
De, P .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2002, 14 (03) :567-582
[6]   A probabilistic decision model for entity matching in heterogeneous databases [J].
Dey, D ;
Sarkar, S ;
De, P .
MANAGEMENT SCIENCE, 1998, 44 (10) :1379-1395
[7]   A THEORY FOR RECORD LINKAGE [J].
FELLEGI, IP ;
SUNTER, AB .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1969, 64 (328) :1183-&
[8]   Real-world data is dirty: Data cleansing and the merge/purge problem [J].
Hernandez, MA ;
Stolfo, SJ .
DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (01) :9-37
[9]  
Jiang Z., 2007, MANAGEMENT SCI, V53
[10]  
McCallum A, 2003, P ACM WORKSH DAT CLE