Your Best Guess When You Know Nothing: Identification and Mitigation of Selection Bias

被引:5
作者
Dost, Katharina [1 ]
Taskova, Katerina [1 ]
Riddle, Patricia [1 ]
Wicker, Jorg [1 ]
机构
[1] Univ Auckland, Sch Comp Sci, Auckland, New Zealand
来源
20TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2020) | 2020年
关键词
D O I
10.1109/ICDM50108.2020.00115
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Machine Learning typically assumes that training and test sets are independently drawn from the same distribution. but this assumption is often violated in practice which creates a bias. Many attempts to identify and mitigate this bias have been proposed, but they usually rely on ground-truth information. But what if the researcher is not even aware of the bias? In contrast to prior work, this paper introduces a new method, IMITATE, to identify and mitigate Selection Bias in the case that we may not know if (and where) a bias is present, and hence no ground-truth information is available. IMITATE investigates the dataset's probability density, then adds generated points in order to smooth out the density and have it resemble a Gaussian. the most common density occurring in real-world applications. If the artificial points focus on certain areas and are not widespread, this could indicate a Selection Bias where these areas are underrepresented in the sample. We demonstrate the effectiveness of the proposed method in both, synthetic and real-world datasets. We also point out limitations and future research directions.
引用
收藏
页码:996 / 1001
页数:6
相关论文
共 17 条
[1]  
[Anonymous], 2014, ADV NEURAL INFORM PR
[2]  
[Anonymous], 2004, ICML
[3]  
Bareinboim E, 2014, AAAI CONF ARTIF INTE, P2410
[4]   AI Fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias [J].
Bellamy, R. K. E. ;
Dey, K. ;
Hind, M. ;
Hoffman, S. C. ;
Houde, S. ;
Kannan, K. ;
Lohia, P. ;
Martino, J. ;
Mehta, S. ;
Mojsilovie, A. ;
Nagar, S. ;
Ramamurthy, K. Natesan ;
Richards, J. ;
Saha, D. ;
Sattigeri, P. ;
Singh, M. ;
Varshney, K. R. ;
Zhang, Y. .
IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 2019, 63 (4-5)
[5]  
Bickel S, 2007, P 24 INT C MACH LEAR, P81, DOI DOI 10.1145/1273496.1273507
[6]   LOF: Identifying density-based local outliers [J].
Breunig, MM ;
Kriegel, HP ;
Ng, RT ;
Sander, J .
SIGMOD RECORD, 2000, 29 (02) :93-104
[7]   THE CENTRAL LIMIT THEOREM FOR DEPENDENT RANDOM VARIABLES [J].
HOEFFDING, W ;
ROBBINS, H .
DUKE MATHEMATICAL JOURNAL, 1948, 15 (03) :773-780
[8]   Independent component analysis:: algorithms and applications [J].
Hyvärinen, A ;
Oja, E .
NEURAL NETWORKS, 2000, 13 (4-5) :411-430
[9]   Why are Normal Distributions Normal? [J].
Lyon, Aidan .
BRITISH JOURNAL FOR THE PHILOSOPHY OF SCIENCE, 2014, 65 (03) :621-649
[10]  
Meagher M, 2007, IEEE INT CONF INF VI, P601